TikTok and Instagram have integrated AI directly into search: when a user asks for something, the system responds with content — and among that content there might be your competitor's video while yours doesn't show up. If your videos and your descriptions aren't structured to be read as answers, you're giving up space on two platforms where your customers spend hours every day. Adapting what you already publish so it becomes an AI answer is simpler than it seems.
Photograph your product, upload the picture to ChatGPT or Gemini and ask “what is this and who makes it”. Does it recognize it? If the answer is no, you’re losing an emerging visibility channel — and it’s not just an image problem, it’s a problem of how TikTok and Instagram are learning to read your content.
Let me explain why this point-blank test is the right starting point for understanding where search is shifting inside social media, and what you can do today to your captions and your video descriptions to become an answer inside the app, not just a piece of content in the feed.
A multimodal engine inside the user’s phone
For years we thought of search as a white bar with text inside it. Now search inside TikTok and Instagram looks more like an assistant that watches, listens, reads the captions, cross-references the hashtags and tries to give you an answer. The engine behind this paradigm shift is the family of multimodal models, of which Gemini is the most documented example.
In the world of research on multimodal AI, the Gemini team describes the turning point like this:
The visual encoding of Gemini models is inspired by our own foundational work on Flamingo, CoCa, and PaLI, with the important distinction that the models are multimodal from the beginning and can natively output images using discrete image tokens.
Translated: Gemini is not a text model with an image component bolted on top. It was born multimodal, so it processes text, images and predictably video too as if they were the same language. The operational consequence for your business is direct: when a social platform integrates multimodal models into its internal search, it stops searching only in the caption text and starts understanding the content of the video itself, the product packaging, the color of the ceramic, the writing on the bottom of the plate.
From the caption to the frame: what changes for those who publish
If you produce content on TikTok or Instagram, until yesterday 90% of the internal ranking signal came from the caption, the hashtags and the trending audio. With multimodal search entering the apps, the video frame and the static photo start to weigh as much as the text.
The Gemini team adds:
In addition, Gemini models can directly ingest audio
It means it’s no longer just the caption text that tells the engine what’s in the video: it’s also the audio, including your voice-over, the name of the product you pronounce, the city you’re in. This changes the way you think about recording a Reel or a TikTok: saying your brand name out loud in the first 3 seconds becomes an entity signal, not a quirk.
If you’ve already read how author entity recognition in AI models works, you understand where I’m going: the multimodal engine inside social media works with the same logic as the textual knowledge graph, except that the “text” is what it sees and hears.
“New collection ❤️🌊” tells the multimodal engine nothing: there’s no product name, no place, no category.
Why the Caltagirone ceramist disappears (and what it has to do with your small business)
Imagine an artisan ceramic maker from Caltagirone — let’s call it “Bottega Mediterranea”, a family business from Agrigento that sells hand-painted plates online and follows the tradition of Sicilian maiolica. It publishes 3 Reels a week: hands painting, close-ups of the brush, the open kiln. Great images, captions with 2 generic hashtags like #handmade #madeinitaly.
When a potential customer opens Instagram and searches for “hand-painted Caltagirone ceramics”, the internal algorithm has to understand that this Reel is relevant. If the caption doesn’t say “Caltagirone ceramics”, if the audio never pronounces “Bottega Mediterranea” and “Caltagirone”, if the frames contain no clear references to the Sicilian tradition — the in-app search moves on, even if the content is visually beautiful.
The paradox is this: the content is well made for the human eye, but invisible to the multimodal engine that is maturing inside the app.
The first 125 characters must contain: product name, material, technique, location.
The test I ran with 15 artisan products
To understand where we are today, I ran a simple and honest hands-on test — with all its limitations, which I’ll state right away.
I took 15 artisan products from small Italian producers: Sicilian ceramics, Umbrian maiolica, Murano glass, Tuscan leather, Como silks. For each one I shot a frontal photo with a smartphone (neutral background, natural light) and uploaded the photo to ChatGPT (with vision) and to Gemini, asking both: “what is this object and who makes it?”.
The results, on a sample of 15 (indicative test, not a scientific study):
- In 11 cases out of 15, both models correctly recognized the product category (“it’s a decorated ceramic plate in Sicilian style”).
- In 4 cases out of 15, one of the two models also ventured a guess at the area of production (“it looks like Caltagirone ceramics” or “Deruta maiolica style”).
- In 0 cases out of 15, the models correctly named the specific producer.
The zero on the producer is the figure that matters. Multimodal AI today recognizes the style, the tradition, the category. It does not recognize the brand. And this opens a window of opportunity for those who move now: building the bridge between product-image and brand-name inside social media, before the competitors do.
Limit of the test: 15 products are few, and I tested only ChatGPT and Gemini — not TikTok Search or Instagram Search directly, because their internal engines are not exposed via a public API. Real analysis requires professional social tracking tools and test sessions on apps with multiple accounts.
The mistakes I see most often
Going through portfolios of Italian small businesses and artisan companies, I always see the same 4 patterns that strip away visibility inside in-app search.
A single-line caption with nothing but emoji. “New collection ❤️🌊” tells the multimodal engine nothing: there’s no product name, no place, no category. It works for aesthetic engagement, zero for internal search.
Generic hashtags copied from old tools. #instagood #photooftheday #love: they’re water in 2026. Better 3 specific hashtags (#ceramicacaltagirone #maiolicasiciliana #artigianatosicilia) than 30 generic ones: the algorithm associates the content with relevant semantic clusters.
Brand name never pronounced in the audio. If you make a 30-second Reel and never say your company’s name, you’re telling the multimodal engine that the content belongs to a generic “ceramist”, not to you.
Video description with no location. A producer from Agrigento who never writes “Agrigento” or “Sicily” in the caption loses all the local queries like “ceramists Agrigento” or “Sicilian maiolica where to buy”.
What to do concretely in the next 30 days
You don’t need to overhaul your content strategy. You need 4 concrete adjustments.
- Rewrite the captions as direct answers. The first 125 characters must contain: product name, material, technique, location. Example: “Hand-painted Caltagirone ceramic plate, Moor’s head decoration, Bottega Mediterranea Agrigento.”
- Pronounce the brand in the first 3 seconds of the Reels. “Hi, I’m [name] from [Bottega Mediterranea], today I’ll show you…”. The audio enters the multimodal signal.
- Add 1 location hashtag + 1 tradition hashtag + 1 technique hashtag. No generic #love or #handmade: the engine looks for clusters, not popularity.
- Detailed alt text on Instagram photos. Instagram reads it and uses it: describe what’s in the image as if you were telling it to someone who can’t see.
If you want to dig deeper into how the AI engine builds the link between image, brand name and recognizability, I recommend reading how to enter the Google Knowledge Graph and the weight of implicit mentions as an authority signal. They are two pieces of the same construction.
The thread: showing up in AI answers, even inside the apps
In the articles of this series I’m leading you to a precise point: visibility in AI answers is no longer just a matter of ChatGPT or Perplexity in the browser. It’s shifting inside the platforms your customers use every day — TikTok, Instagram, soon WhatsApp Business with Meta AI.
The Gemini multimodal engine I quoted to you is not a closed lab: the same logic is entering the internal search of the apps. Those who optimize captions, audio and descriptions now as if they were SEO content — I’d say more precisely “GEO content” — build themselves an advantage that competitors will close only slowly.
In the next articles of the series we’ll see how in-app search works on Bing Copilot and how the strategy changes when the user moves from the browser bar to the assistant built into the app. The thread stays the same: be the answer, not just the result.