Content Structure for AI

Got hours of excellent video? Without a transcript, they don’t exist to AI

Roberto Serra 25 June 2026·~7 min read

You have hours of video and podcasts with excellent technical content — interviews, analyses, case studies. To AI, none of it exists. Without text, every hour of audio is as if it had never been produced: invisible to ChatGPT, invisible to Perplexity, invisible to anyone using AI to inform themselves in your field. You're producing value that never accumulates. Turning that content into citable assets without creating anything new is simpler and faster than you think.

Think about all the videos you’ve published. The interviews, the webinars, the episodes of your company podcast. Content where you explain your work, answer customer questions, tell case studies with a depth you’ve never reached in the text on your site. For those who listen to them, it’s your best material.

To AI, they don’t exist.

That’s not a figure of speech. The systems that power the answers of ChatGPT, Perplexity and Gemini work on text. When the crawler reaches your page and finds an embedded video player or a link to Spotify, it can’t extract anything. It doesn’t listen to the audio, it doesn’t watch the video. It sees an element on the page and skips it. All the value you put into that content — the expertise, the data, the explanations — stays locked in a format the retrieval system can’t read.

The solution is simpler than you think: transcribe everything and publish the text on the page. Every transcribed episode becomes a textual asset that AI can index, break into chunks and return as an answer.

Why text is the only currency that counts

In the world of research on RAG systems — the ones that retrieve information to generate AI answers — there’s a principle that comes up in every paper:

“Unstructured Data, such as text, is the most widely used retrieval source.”

Gao et al., 2024

The most widely used retrieval source is unstructured text. Not images, not audio, not video. Text. This doesn’t mean audio and video have no value — it means that value doesn’t enter the retrieval cycle as long as it stays in audio or video format.

The mechanism is concrete: the system takes the text of a page, breaks it into blocks of a few hundred tokens, and indexes them in a vector space. When a query comes in, it looks for the most relevant blocks and passes them to the model as context to generate the answer. If your content isn’t text, it isn’t broken up, it isn’t indexed, it isn’t found. It’s not a penalty — it’s a technical impossibility.

From audio to text: what changes in retrieval

When you transcribe a video or a podcast and publish the transcript as text on the page, you’re creating new content in the eyes of the system. Before, you had a page with a player and maybe an introductory paragraph. After, you have a page with thousands of words of dense content, full of answers to specific questions, detailed explanations, concrete examples.

Gao et al.’s paper explains what happens at that point:

“These chunks are subsequently used as the expanded context in prompt.”

Gao et al., 2024

Every block of text becomes a potential piece of context the model can use to build an answer. A 40-minute podcast, transcribed, produces around 5,000-7,000 words. That’s dozens of chunks, each with the potential to be extracted and cited. Before transcription, that content generated zero usable chunks.

And here’s something many people don’t consider. In a podcast or a video, you tend to speak differently from how you write. You’re more direct, you use more concrete examples, you answer questions with language that resembles users’ queries. This linguistic naturalness is an advantage: the chunks that come out of it have a strong semantic alignment with the questions people ask AI engines.

Common mistake

A raw transcript, full of filler words, repetitions and unfinished sentences, produces low-quality chunks.

The quality of indexing depends on the quality of the text

Transcribing isn’t enough, though. A raw transcript, full of filler words, repetitions and unfinished sentences, produces low-quality chunks. And the quality of the indexed text matters:

“The goal of optimizing indexing is to enhance the quality of the content being indexed.”

Gao et al., 2024

The goal of optimizing indexing is to enhance the quality of the indexed content. In practice this means that a clean transcript, with complete sentences and headings that organize the topics, gets indexed better than a wall of text with “um”, “I mean”, “as I was saying”.

This doesn’t mean rewriting the content from scratch. It means doing a light edit: removing repetitions, completing interrupted sentences, adding headings that signal shifts in topic. If in your podcast you move from topic A to topic B, a heading between the two blocks lets the system create two distinct chunks instead of a single mixed one — and a chunk focused on a specific topic is more likely to be retrieved for a query on that topic.

Pro tip

The solution is simpler than you think: transcribe everything and publish the text on the page.

How to structure the transcript on the page

I’ve seen sites that publish the transcript as a downloadable PDF. For the user it can work, for AI retrieval it’s almost useless: the crawler has to download the file, parse it, extract the text — and many systems simply don’t do that. The text needs to be in the HTML page, directly in the body.

The structure that works best is this:

Headings with the topic of each section. Not generic timestamps like “Minute 12:30” but descriptive headings: “How we solved problem X for client Y”. The heading becomes the signal that tells the system what that block of text is about.
Timestamps as a reference, not as structure. Timestamps help the user who wants to jump to a point in the video — but they shouldn’t replace section titles. Put them as a note in parentheses next to the heading, not as headings in their own right.
Blocks of 200-400 words per section. Just like I explained when talking about alt text as content for AI: every multimedia element needs a textual representation the system can break into standalone chunks. The same principle applies to transcripts — each section must contain a complete concept.
VideoObject or PodcastEpisode schema markup. Even though JSON-LD doesn’t directly affect RAG retrieval, it helps traditional search engines connect the page to the original multimedia content. It’s double coverage that costs a few minutes to implement.

The hidden competitive advantage

Most of your competitors publish video and podcasts without a transcript. At most they add a summary paragraph and the player. This means hours and hours of valuable content stay invisible to AI systems.

If you transcribe everything, you’re doing two things at once. First: you’re multiplying the volume of indexable textual content on your site without producing new content — that content already exists, you just have to make it readable. Second: you’re occupying space in an area where the competitors aren’t, because they didn’t bother to do it.

I’ve seen cases where a single monthly podcast, transcribed, generates more indexable textual content than the entire company blog. Think about what happens over the long term: 12 episodes a year, 5,000 words each, 60,000 words of dense, specific content that AI can cite. It’s content that speaks the language of queries — because it’s born as an answer to real questions.

What you can do this week

Take your most recent video or podcast. Use an automatic transcription service — there are dozens, many free — and get the raw text. Then spend half an hour editing it: remove the repetitions, complete the sentences, add 4-5 descriptive headings that signal the main topics.

Publish that transcript as text on the same page as the video, below the player. Not as a separate page, not as a PDF — as HTML content on the page. Then do the same with the next episodes, and when you have time, go back through the archive.

It’s a first step. For systematic work — schema markup, chunk optimization, an integrated audio-text editorial strategy — you need specific skills and a holistic view of how your multimedia content connects to the rest of the site. But even just the basic transcript turns invisible content into something AI can find.

The next step is understanding how to apply the same principle to other visual formats. I cover it in the deep dives on infographics with parallel text, informative captions and diagrams as structured text — because the problem is the same: anything that isn’t text doesn’t exist to AI. And the solution is always to convert it.

Chapter 3 · Content Structure for AI

Continue with the deep dives

39 deep dives across the 5 sections of the chapter.

3.1 Answer Patterns 8 deep dives

The AI Looks for the Phrase ‘X is…’ on Your Page, and Moves On if It Can’t Find It If Your Industry Has Pairs to Compare and You Don’t, the AI Cites Someone Else Are Your Guides a Wall of Text? AI Can’t Extract Them as an Answer Do Your FAQs Have One-Line Answers? To AI They’re Unusable Your content explains the ‘what’ but not the ‘why’? AI ignores it Are your lists random? AI ignores them and cites whoever has clear criteria Your content has no numbers? AI considers it less trustworthy Only talk about the benefits? The AI classifies you as promotional

3.2 Citable Formats 7 deep dives

Is the key information buried in plain text? With a callout, the AI extracts it first Are your comparisons written in prose? As a table they’d be 10x more citable Schema markup isn’t just for Google: AI uses it as a ready-made summary Do You Cite Your Sources? AI Treats You as a Higher-Tier Resource Is your key information buried only in the text? With JSON-LD, AI reads it without errors Does your best content only exist as web pages? As PDFs it becomes a standalone asset Only evergreen guides? You’re losing the citations on industry news

3.3 Linking & Semantic Context 8 deep dives

The Same Content Lives on Three Different URLs? The AI Doesn’t Know Which to Choose Does your site have coverage gaps? Competitors fill them and the AI picks them Your Most Important Page Has Fewer Internal Links Than a Secondary One? The AI Gets Confused Your links say ‘click here’? AI can’t tell where they lead Your links jump from one topic to another? AI perceives expertise in none Adding links without explaining why? The AI doesn’t understand the relationship Are your related articles picked by an algorithm? To AI they’re worth almost nothing Is your content a set of isolated pages? The hub and spoke model organizes it for AI

3.4 Multimodal Content 8 deep dives

Your flowcharts are beautiful images that AI can’t read Your videos have no chapters? The AI can’t cite the right part Want AI to cite you more? Build a tool other sites want to embed Are your podcast show notes a three-line outline? You’re wasting an asset Do your infographics have alt text like ‘sales chart’? To AI, they don’t exist Got hours of excellent video? Without a transcript, they don’t exist to AI You are here Your infographics are beautiful but to AI they don’t exist Do your captions say ‘Sales chart’? With the right numbers, they become citable

3.5 Page Architecture 8 deep dives

If the answer is in paragraph 8, the AI will never find it Every section of your page must be a mini-article the AI can cite on its own AI doesn’t read your generic headings: it ignores them Your article has no table of contents? The AI is searching for answers in the dark You’re Wasting Your Page’s First Viewport on a Decorative Banner AI can’t tell where your page sits without breadcrumbs Want AI to cite your article? Give it a TL;DR to copy Your sidebar is polluting the content the AI extracts

The author

Roberto Serra at the Senate of the Republic

Senate of the Republic · Palazzo Giustiniani Conference “The power of artificial intelligence”

Roberto Serra

SEO consultant for over 15 years, founder of the Serra SEO Agency (RAANK). He helps multinationals and SMEs stay visible where search is moving: ChatGPT, Perplexity, Gemini and Google's AI Overviews.

As featured in

Learn more about Roberto Serra →