Multimodal distinctness: when a brand is searched not with words

Research question

How do visual search, voice queries, and multimodal interfaces change the requirements for brand visibility, and what from classical text optimization carries over into the world of images, voice, and video.

Evidence type

Google data on Google Lens (20 billion queries per month), Google documentation on multimodal AI Mode, and market observations from Semrush and Lumar.

Freshness of factual claims

Platform and query-volume data are current as of the first quarter of 2026.

Text is no longer the only input

Throughout the AI100 corpus, we have discussed visibility in the context of text queries: the user types a question, and the model produces an answer. But the world of search has long ceased to be reducible to typing words on a keyboard. A user photographs a product in a store and asks, “How much is this online?” They say aloud, “What model is this?” while pointing the camera at a pair of headphones. They upload a screenshot from Instagram and ask, “Find something similar, but cheaper.” They record a video and add a text question: “What material is this made of?”

These are not exotic scenarios. Google Lens processes more than 20 billion visual queries per month, and 20% of them are shopping-related [1]. AI Mode is integrated with Google Lens: a user can take a photo or upload an image, and the system, using Gemini’s multimodal capabilities, analyzes the entire scene—objects, their context, materials, colors, and shapes—and produces a composite answer [2]. ChatGPT with GPT-4o processes images, voice, and text simultaneously. 27% of mobile users already use voice search [3].

For a brand, this means that text optimization is a necessary but no longer sufficient condition for visibility. If your product cannot be recognized from a photo, if your YouTube video has no transcript, if a voice assistant cannot link the spoken company name to the correct entity, you lose the audience that searches without words.

How visual search changes the rules

Visual search works in a fundamentally different way from text search. The user does not describe what they are looking for—they show it. Convolutional neural networks (CNNs) transform an image into a numerical vector and compare it with a database of indexed images [4]. This means that the quality, consistency, and technical accessibility of images on a site directly affect whether your product will be found.

For e-commerce, the consequences are most obvious. A shopper sees a dress on the street, photographs it, and Google Lens shows similar products with prices from different online stores within three seconds. If your product images are low quality, lack descriptive alt text, lack Product schema, and do not follow a consistent photography style, they will not make it into that selection. A competitor with clean, marked-up photos will.

Visual consistency across platforms is also becoming a factor. Google Lens is better at recognizing brands that use the same photography style across their website, marketplaces, and social platforms. A fragmented visual layer makes it harder to tie content to an entity [5].

Voice search and long queries

Voice queries differ from text queries not only in modality, but also in structure. A person speaking aloud uses natural sentences: “What’s the best café near me that’s open right now?” instead of “cafe near me open.” Queries in AI Mode are, on average, three times longer than ordinary search queries [6]. This means that content optimized for short keyword phrases may not match the way people formulate queries by voice.

For a brand, the practical implication is clear: FAQ sections written in a question-and-direct-answer format work better for voice search than long marketing texts. Structured data (FAQ schema, HowTo schema) helps voice assistants extract a specific answer. The brand name should be pronounceable and unambiguous—a model that cannot connect the spoken “Exco-Data” to the entity “ExcoData” will lose the brand in a voice query.

Video and transcripts

AI systems are making increasing use of video content. YouTube video transcripts become a source for citation: if an expert in your video explains in detail how the product works, and the transcript is available, the model can extract a passage from it for an answer. If there is no transcript, the video remains invisible to the text layer of the answer system.

Google explicitly states that AI Mode uses multimodal analysis: the system works simultaneously with text, images, video, and context [2]. For a brand that publishes educational videos, reviews, or product demos, a clean and accurate transcript is not optional—it is a condition of discoverability.

What to do right now

Multimodal optimization does not require a revolution. It requires extending familiar work into new formats.

Images: high quality, descriptive filenames and alt text, Product schema tied to specific products, and a consistent photography style across platforms.

Voice: FAQ sections in question-answer format, HowTo schema for instructions, and a pronounceable, unambiguous brand name.

Video: transcripts for every video on YouTube and on the site, VideoObject schema, and descriptive titles and metadata.

General layer: the same principle as for text visibility—structured data, machine readability, and external confirmation. Multimodality does not replace these foundations; it adds new input channels on top of them.

What seems well established: Visual search already processes tens of billions of queries per month. AI Mode integrates multimodal input (photo + text + voice). Video transcripts are used as a source for citation. Voice queries are longer and more conversational than text queries.

What remains uncertain or platform-dependent: The exact share of AI answers initiated by visual or voice input is still poorly measured outside Google Lens. The effect of multimodal optimization on brand citation across different platforms has been studied only fragmentarily.

What this changes in practice: A brand needs to optimize not only text, but also images, video, and voice discoverability. The basic actions (alt text, transcripts, FAQ schema) are simple and can be started right away.

Sources: [1] Google / DemandSage. Google Lens: 20 billion visual searches per month, 20% shopping-related. 2025 [2] 9to5Google / Google I/O. Google AI Mode adding multimodal Google Lens search. 2025 [3] Google / Lumar. 27% of global mobile users use voice search. 2025 [4] Xictron / Pinecone. Visual search technology: CNN embeddings and vector matching. 2026 [5] SE Blog. Multimodal Search Optimization: visual consistency and entity recognition. 2026 [6] ALM Corp. Google AI Mode queries average nearly 3x longer than traditional search. 2026

Related materials

Next step

Check whether AI sees more than just your text

AI100 tests text-based brand visibility in neutral scenarios. But multimodal diagnosis starts from the same foundations: entity distinctness, structured data, and external confirmation. The report shows where to begin.

Open the sample report →

Multimodal distinctness: when a brand is searched not with words

Text is no longer the only input

How visual search changes the rules

Voice search and long queries

Video and transcripts

What to do right now

Related materials

Machine-readable commercial infrastructure: markup, product feeds, and catalogs as a language AI can understand

SEO and AI visibility: what carries over, what does not, and where familiar optimization can backfire

Practical action map: how to strengthen a brand’s machine distinctness

Update lag: how quickly AI systems change their view of a company after news, a product launch, or a price change

Check whether AI sees more than just your text