AI Platforms

AI in Social Media (TikTok, Instagram): how your videos become answers inside the apps

Roberto Serra 25 June 2026·~8 min read

TikTok and Instagram have integrated AI directly into search: when a user asks for something, the system responds with content — and among that content there might be your competitor's video while yours doesn't show up. If your videos and your descriptions aren't structured to be read as answers, you're giving up space on two platforms where your customers spend hours every day. Adapting what you already publish so it becomes an AI answer is simpler than it seems.

Photograph your product, upload the picture to ChatGPT or Gemini and ask “what is this and who makes it”. Does it recognize it? If the answer is no, you’re losing an emerging visibility channel — and it’s not just an image problem, it’s a problem of how TikTok and Instagram are learning to read your content.

Let me explain why this point-blank test is the right starting point for understanding where search is shifting inside social media, and what you can do today to your captions and your video descriptions to become an answer inside the app, not just a piece of content in the feed.

A multimodal engine inside the user’s phone

For years we thought of search as a white bar with text inside it. Now search inside TikTok and Instagram looks more like an assistant that watches, listens, reads the captions, cross-references the hashtags and tries to give you an answer. The engine behind this paradigm shift is the family of multimodal models, of which Gemini is the most documented example.

In the world of research on multimodal AI, the Gemini team describes the turning point like this:

The visual encoding of Gemini models is inspired by our own foundational work on Flamingo, CoCa, and PaLI, with the important distinction that the models are multimodal from the beginning and can natively output images using discrete image tokens.

Gemini Team, 2023

Translated: Gemini is not a text model with an image component bolted on top. It was born multimodal, so it processes text, images and predictably video too as if they were the same language. The operational consequence for your business is direct: when a social platform integrates multimodal models into its internal search, it stops searching only in the caption text and starts understanding the content of the video itself, the product packaging, the color of the ceramic, the writing on the bottom of the plate.

From the caption to the frame: what changes for those who publish

If you produce content on TikTok or Instagram, until yesterday 90% of the internal ranking signal came from the caption, the hashtags and the trending audio. With multimodal search entering the apps, the video frame and the static photo start to weigh as much as the text.

The Gemini team adds:

In addition, Gemini models can directly ingest audio

Gemini Team, 2023

It means it’s no longer just the caption text that tells the engine what’s in the video: it’s also the audio, including your voice-over, the name of the product you pronounce, the city you’re in. This changes the way you think about recording a Reel or a TikTok: saying your brand name out loud in the first 3 seconds becomes an entity signal, not a quirk.

If you’ve already read how author entity recognition in AI models works, you understand where I’m going: the multimodal engine inside social media works with the same logic as the textual knowledge graph, except that the “text” is what it sees and hears.

Common mistake

“New collection ❤️🌊” tells the multimodal engine nothing: there’s no product name, no place, no category.

Why the Caltagirone ceramist disappears (and what it has to do with your small business)

Imagine an artisan ceramic maker from Caltagirone — let’s call it “Bottega Mediterranea”, a family business from Agrigento that sells hand-painted plates online and follows the tradition of Sicilian maiolica. It publishes 3 Reels a week: hands painting, close-ups of the brush, the open kiln. Great images, captions with 2 generic hashtags like #handmade #madeinitaly.

When a potential customer opens Instagram and searches for “hand-painted Caltagirone ceramics”, the internal algorithm has to understand that this Reel is relevant. If the caption doesn’t say “Caltagirone ceramics”, if the audio never pronounces “Bottega Mediterranea” and “Caltagirone”, if the frames contain no clear references to the Sicilian tradition — the in-app search moves on, even if the content is visually beautiful.

The paradox is this: the content is well made for the human eye, but invisible to the multimodal engine that is maturing inside the app.

Pro tip

The first 125 characters must contain: product name, material, technique, location.

The test I ran with 15 artisan products

To understand where we are today, I ran a simple and honest hands-on test — with all its limitations, which I’ll state right away.

I took 15 artisan products from small Italian producers: Sicilian ceramics, Umbrian maiolica, Murano glass, Tuscan leather, Como silks. For each one I shot a frontal photo with a smartphone (neutral background, natural light) and uploaded the photo to ChatGPT (with vision) and to Gemini, asking both: “what is this object and who makes it?”.

The results, on a sample of 15 (indicative test, not a scientific study):

In 11 cases out of 15, both models correctly recognized the product category (“it’s a decorated ceramic plate in Sicilian style”).
In 4 cases out of 15, one of the two models also ventured a guess at the area of production (“it looks like Caltagirone ceramics” or “Deruta maiolica style”).
In 0 cases out of 15, the models correctly named the specific producer.

The zero on the producer is the figure that matters. Multimodal AI today recognizes the style, the tradition, the category. It does not recognize the brand. And this opens a window of opportunity for those who move now: building the bridge between product-image and brand-name inside social media, before the competitors do.

Limit of the test: 15 products are few, and I tested only ChatGPT and Gemini — not TikTok Search or Instagram Search directly, because their internal engines are not exposed via a public API. Real analysis requires professional social tracking tools and test sessions on apps with multiple accounts.

The mistakes I see most often

Going through portfolios of Italian small businesses and artisan companies, I always see the same 4 patterns that strip away visibility inside in-app search.

A single-line caption with nothing but emoji. “New collection ❤️🌊” tells the multimodal engine nothing: there’s no product name, no place, no category. It works for aesthetic engagement, zero for internal search.

Generic hashtags copied from old tools. #instagood #photooftheday #love: they’re water in 2026. Better 3 specific hashtags (#ceramicacaltagirone #maiolicasiciliana #artigianatosicilia) than 30 generic ones: the algorithm associates the content with relevant semantic clusters.

Brand name never pronounced in the audio. If you make a 30-second Reel and never say your company’s name, you’re telling the multimodal engine that the content belongs to a generic “ceramist”, not to you.

Video description with no location. A producer from Agrigento who never writes “Agrigento” or “Sicily” in the caption loses all the local queries like “ceramists Agrigento” or “Sicilian maiolica where to buy”.

What to do concretely in the next 30 days

You don’t need to overhaul your content strategy. You need 4 concrete adjustments.

Rewrite the captions as direct answers. The first 125 characters must contain: product name, material, technique, location. Example: “Hand-painted Caltagirone ceramic plate, Moor’s head decoration, Bottega Mediterranea Agrigento.”
Pronounce the brand in the first 3 seconds of the Reels. “Hi, I’m [name] from [Bottega Mediterranea], today I’ll show you…”. The audio enters the multimodal signal.
Add 1 location hashtag + 1 tradition hashtag + 1 technique hashtag. No generic #love or #handmade: the engine looks for clusters, not popularity.
Detailed alt text on Instagram photos. Instagram reads it and uses it: describe what’s in the image as if you were telling it to someone who can’t see.

If you want to dig deeper into how the AI engine builds the link between image, brand name and recognizability, I recommend reading how to enter the Google Knowledge Graph and the weight of implicit mentions as an authority signal. They are two pieces of the same construction.

The thread: showing up in AI answers, even inside the apps

In the articles of this series I’m leading you to a precise point: visibility in AI answers is no longer just a matter of ChatGPT or Perplexity in the browser. It’s shifting inside the platforms your customers use every day — TikTok, Instagram, soon WhatsApp Business with Meta AI.

The Gemini multimodal engine I quoted to you is not a closed lab: the same logic is entering the internal search of the apps. Those who optimize captions, audio and descriptions now as if they were SEO content — I’d say more precisely “GEO content” — build themselves an advantage that competitors will close only slowly.

In the next articles of the series we’ll see how in-app search works on Bing Copilot and how the strategy changes when the user moves from the browser bar to the assistant built into the app. The thread stays the same: be the answer, not just the result.

Chapter 6 · AI Platforms

Continue with the deep dives

40 deep dives across the 5 sections of the chapter.

6.1 Bing Copilot & Others 12 deep dives

Voice AI: how to show up in Alexa, Google Home and Siri answers AI in Social Media (TikTok, Instagram): how your videos become answers inside the apps You are here AI Evolution Monitoring: How to Keep Up With AI Engine Changes Without Losing Your Mind Bing Copilot and the Microsoft ecosystem: why your brand must be there Microsoft Copilot in Office 365: how to land in your buyers’ decks and emails Meta AI on Instagram: the AI engine Pompeii tour operators are ignoring Apple Intelligence and Siri AI: the invisible channel that just landed on every iPhone Vertical AI Chatbots: why being in the niche dataset is worth more than a thousand backlinks AI Search in marketplaces: your product listings are already the source of the AI answers Cross-Platform Consistency: Why Your Brand Must Tell the Same Story on Every AI Platform-specific content strategy: why one piece of content no longer cuts it AI aggregators and meta-search: why being visible only on ChatGPT is no longer enough

6.2 ChatGPT & OpenAI 8 deep dives

ChatGPT: Answer Architecture ChatGPT Browse Mode: Why Live Answers Go Through Bing (and What That Changes for You) GPT Store and Custom GPTs: How to Become the Default Source in Your Industry GPT Store: the plugin ecosystem that recommends brands without you knowing OpenAI Plugins & Actions: when the AI doesn’t recommend you, it uses you ChatGPT’s recipe: where your brand ended up in its training data When ChatGPT Cites You Without Linking: The Referral Pattern Trade-off How ChatGPT cites sources (and why your brand must be in the text, not in a footnote)

6.3 Claude & Anthropic 4 deep dives

Claude, the paranoid editor: how the constitutional filter decides who gets cited Claude and the 200K tokens: why complete guides win where short content disappears Claude doesn’t browse: if you’re not in its training, to it you don’t exist Claude and Artifacts: How to Appear in the Analyses the Model Generates for Your Industry

6.4 Google Gemini & SGE 8 deep dives

Google Quality Rater Guidelines: the manual Google uses for AI answers too Google Merchant Center and AI Shopping: How to Get Your Products Cited by Gemini and SGE Google Vertex AI and Enterprise Search: How to Land in the Answers That Pull from the Web Google SGE and AI Overview: how the architecture really works and what changes for your rankings Gemini and the Knowledge Graph: why Google knows you before it even answers Google AI Overviews and snippet selection: why Gemini picks one brand and ignores the other Google Perspectives & Discussion: when Gemini listens to Reddit before your site Gemini Extensions & Workspace: Why Your Content Inside Drive, Gmail and YouTube Becomes a Direct Channel in AI Answers

6.5 Perplexity 8 deep dives

Perplexity real-time RAG: why your site can enter the answers today, not in six months Perplexity Citation Pattern: How Source Selection Really Works How Perplexity Chooses the Sources It Cites (and Why Your Site Isn’t There) Perplexity Spaces and Collections: the recommendation micro-channel you can own Ranking first on Google but invisible on Perplexity? Check your robots.txt Perplexity Pages: the AI articles Google indexes (and why they matter to you) Perplexity Pro and Free cite different sources: why your client might not see you Perplexity Focus Modes: how not to vanish when the user changes the filter

The author

Roberto Serra at the Senate of the Republic

Senate of the Republic · Palazzo Giustiniani Conference “The power of artificial intelligence”

Roberto Serra

SEO consultant for over 15 years, founder of the Serra SEO Agency (RAANK). He helps multinationals and SMEs stay visible where search is moving: ChatGPT, Perplexity, Gemini and Google's AI Overviews.

As featured in

Learn more about Roberto Serra →