Text is no longer the only input
Throughout the AI100 corpus, we have discussed visibility in the context of text queries: the user types a question, and the model produces an answer. But the world of search has long ceased to be reducible to typing words on a keyboard. A user photographs a product in a store and asks, “How much is this online?” They say aloud, “What model is this?” while pointing the camera at a pair of headphones. They upload a screenshot from Instagram and ask, “Find something similar, but cheaper.” They record a video and add a text question: “What material is this made of?”
These are not exotic scenarios. Google Lens processes more than 20 billion visual queries per month, and 20% of them are shopping-related [1]. AI Mode is integrated with Google Lens: a user can take a photo or upload an image, and the system, using Gemini’s multimodal capabilities, analyzes the entire scene—objects, their context, materials, colors, and shapes—and produces a composite answer [2]. ChatGPT with GPT-4o processes images, voice, and text simultaneously. 27% of mobile users already use voice search [3].
For a brand, this means that text optimization is a necessary but no longer sufficient condition for visibility. If your product cannot be recognized from a photo, if your YouTube video has no transcript, if a voice assistant cannot link the spoken company name to the correct entity, you lose the audience that searches without words.
How visual search changes the rules
Visual search works in a fundamentally different way from text search. The user does not describe what they are looking for—they show it. Convolutional neural networks (CNNs) transform an image into a numerical vector and compare it with a database of indexed images [4]. This means that the quality, consistency, and technical accessibility of images on a site directly affect whether your product will be found.
For e-commerce, the consequences are most obvious. A shopper sees a dress on the street, photographs it, and Google Lens shows similar products with prices from different online stores within three seconds. If your product images are low quality, lack descriptive alt text, lack Product schema, and do not follow a consistent photography style, they will not make it into that selection. A competitor with clean, marked-up photos will.
Visual consistency across platforms is also becoming a factor. Google Lens is better at recognizing brands that use the same photography style across their website, marketplaces, and social platforms. A fragmented visual layer makes it harder to tie content to an entity [5].
Voice search and long queries
Voice queries differ from text queries not only in modality, but also in structure. A person speaking aloud uses natural sentences: “What’s the best café near me that’s open right now?” instead of “cafe near me open.” Queries in AI Mode are, on average, three times longer than ordinary search queries [6]. This means that content optimized for short keyword phrases may not match the way people formulate queries by voice.
For a brand, the practical implication is clear: FAQ sections written in a question-and-direct-answer format work better for voice search than long marketing texts. Structured data (FAQ schema, HowTo schema) helps voice assistants extract a specific answer. The brand name should be pronounceable and unambiguous—a model that cannot connect the spoken “Exco-Data” to the entity “ExcoData” will lose the brand in a voice query.
Video and transcripts
AI systems are making increasing use of video content. YouTube video transcripts become a source for citation: if an expert in your video explains in detail how the product works, and the transcript is available, the model can extract a passage from it for an answer. If there is no transcript, the video remains invisible to the text layer of the answer system.
Google explicitly states that AI Mode uses multimodal analysis: the system works simultaneously with text, images, video, and context [2]. For a brand that publishes educational videos, reviews, or product demos, a clean and accurate transcript is not optional—it is a condition of discoverability.
What to do right now
Multimodal optimization does not require a revolution. It requires extending familiar work into new formats.
Images: high quality, descriptive filenames and alt text, Product schema tied to specific products, and a consistent photography style across platforms.
Voice: FAQ sections in question-answer format, HowTo schema for instructions, and a pronounceable, unambiguous brand name.
Video: transcripts for every video on YouTube and on the site, VideoObject schema, and descriptive titles and metadata.
General layer: the same principle as for text visibility—structured data, machine readability, and external confirmation. Multimodality does not replace these foundations; it adds new input channels on top of them.
Visual search already processes tens of billions of queries per month. AI Mode integrates multimodal input (photo + text + voice). Video transcripts are used as a source for citation. Voice queries are longer and more conversational than text queries.
The exact share of AI answers initiated by visual or voice input is still poorly measured outside Google Lens. The effect of multimodal optimization on brand citation across different platforms has been studied only fragmentarily.
A brand needs to optimize not only text, but also images, video, and voice discoverability. The basic actions (alt text, transcripts, FAQ schema) are simple and can be started right away.
Sources
Related materials
Machine-readable commercial infrastructure: markup, product feeds, and catalogs as a language AI can understand
The data and markup layer that makes a brand and its products understandable to machines: catalogs, product feeds, structured descriptions, and their synchronization.
Open the material →SEO and AI visibility: what carries over, what does not, and where familiar optimization can backfire
What transfers from classic SEO to the AI answer environment, what stops working, and what new requirements emerge.
Open the material →Practical action map: how to strengthen a brand’s machine distinctness
Six sequential steps for improving AI visibility: from identity verification through language reassembly and trust contour to monitoring.
Open the material →Update lag: how quickly AI systems change their view of a company after news, a product launch, or a price change
Why there is a time gap between a fact changing about a brand and its stable appearance in machine answers — and how to observe this lag in practice.
Open the material →Check whether AI sees more than just your text
AI100 tests text-based brand visibility in neutral scenarios. But multimodal diagnosis starts from the same foundations: entity distinctness, structured data, and external confirmation. The report shows where to begin.
Open the sample report →