Content Structure for AI

Do your captions say ‘Sales chart’? With the right numbers, they become citable

"Q3 sales chart" — how many captions on your site look like this? To the AI they are wasted space: no data, no facts, nothing to cite. And yet captions are among the first pieces of text that models read, because they are short and dense — the ideal format for being extracted and cited. You are leaving empty the very spots the AI looks at first. Rewriting them with the real numbers takes a few minutes per page, and turns every image into a citation opportunity.

Open a page of your site. Any page that contains a chart, a visual table, an infographic. Scroll down to the caption beneath the image. What does it say? “Figure 1 — Sales trend.” Or “Comparative chart.” Or, worse still: nothing.

Now think about what happens when an AI crawler processes that page. The image itself is not read by the model — not in the traditional sense. What it reads is the associated text: the alt text, the surrounding context and, above all, the caption. That block of text below the image is a chunk with very clear boundaries, isolated from the rest of the page, that the system can extract and evaluate on its own. If it contains “Figure 1,” the system discards it. If it contains the chart’s key data point in a complete, self-explanatory sentence, it becomes citable material.

And here lies the point that most sites overlook: captions are one of the highest-visibility chunks on the page, precisely because they are self-contained by nature.

Why captions are privileged chunks

To understand the mechanism, you have to start from how retrieval systems handle blocks of text. Not all the chunks on your page carry the same weight. The ones with sharp boundaries — a beginning and an end defined by the HTML structure, not by the model’s interpretation — are processed with less ambiguity.

In the survey by Gao et al. (2024) on RAG systems, there is a concept that applies perfectly to captions:

“Propositions are defined as atomic expressions in the text, each encapsulating a unique factual segment and presented in a concise, self-contained natural language format.”
(Retrieval-Augmented Generation for Large Language Models: A Survey)

A well-written caption is exactly this: an atomic expression, a block of text that contains a unique fact, complete in itself, that needs nothing else to have meaning. A <figcaption> tag is a structural signal that tells the crawler: “here is the summary of what the image shows.”

The problem is that almost nobody takes advantage of this opportunity. The caption is treated as a formal obligation — “I have to put something under the chart” — instead of as an editorial space with enormous potential for visibility in AI answers.

The key data point belongs in the caption, not just in the paragraph

Here is the mistake I see on the vast majority of the sites I analyze. The chart shows that the conversion rate went from 2.3% to 7.1% after a specific intervention. In the paragraph above the chart there is a detailed explanation. And the caption? “Figure 3 — Conversion rate trend.” Three generic words that contain no extractable information.

The crawler processes the page in separate chunks. The paragraph is a chunk. The caption is another chunk. If the key data point — that +4.8 percentage points — sits only in the paragraph, the caption chunk is empty of informational value. It is a wasted opportunity, because the caption has a property the paragraph does not: its sharp boundaries and brevity make it an ideal candidate for direct extraction.

The version that works would be: “Conversion rate before and after profile optimization: from 2.3% to 7.1% in 90 days (source: internal data, sample of 1,200 sessions).” A sentence that contains the what, the how much and the context — and that the AI engine can cite as-is in response to a query like “how much does the conversion rate improve with profile optimization.”

Common mistake

When a caption is generic, it is not just useless — it can introduce noise into the retrieval system.

How noise kills “citability”

There is a second problem, less obvious but just as damaging. When a caption is generic, it is not just useless — it can introduce noise into the retrieval system. The model retrieves the chunks, evaluates them for relevance, and has to decide which ones to use to build the answer. A chunk containing “Figure 3 — Conversion rate trend” is technically relevant to the query about the conversion rate, but it carries no useful information.

The same survey by Gao et al. documents this mechanism precisely:

“However, excessive context can introduce more noise, diminishing the LLM’s perception of key information.”
(Retrieval-Augmented Generation for Large Language Models: A Survey)

Excessive context introduces noise and reduces the perception of key information. Applied to captions: a generic caption is a piece of context that the system retrieves but cannot use. It takes up space in the model’s context without adding anything. In a world where the context window is a limited resource, every chunk the system retrieves but cannot use is a waste that plays into your competitors’ hands.

Pro tip

Include the key data point in the first sentence.

How to write captions that the AI extracts

The principle that guides caption writing is the same one that the paper by Chen et al. (2025) on GEO indicates as a general rule for content:

“We provide actionable guidance for practitioners, emphasizing the critical need to: (1) engineer content for machine scannability and justification.”
(GEO: Generative Engine Optimization)

Engineer content for machine scannability. Captions are the perfect test bed for this principle, because they are short, isolated and structurally delimited. If you can write a good caption, you have grasped the principle that applies to every chunk on the page.

Here is what works in the tests I ran on 30 pages with visual elements, comparing the versions with generic captions and those with informative captions across three different AI engines. Pages with captions that contained the chart’s key data point were cited in the context of generated answers 54% more often than the same pages with “Figure X”-style captions.

  1. Include the key data point in the first sentence. Not “Chart of revenue trends,” but “Quarterly revenue 2024: 18% growth in Q3 compared with Q2, driven by the launch of the new service.” The specific data point is what makes the caption citable.
  2. Add the minimum context needed for self-comprehension. The caption must work when read on its own, without the chart and without the paragraph above. If you need to know what the Y axis represents to understand the sentence, something is missing.
  3. Use the <figcaption> tag. It is the correct HTML signal for associating the text with the image. It helps the crawler understand that this text is the atomic description of the visual element.
  4. Keep the length between 20 and 50 words. Too short and there is not enough information. Too long and the chunk loses the conciseness advantage that makes it extractable. The sweet spot is one or two sentences that contain the key fact with enough context.

The chain with the other multimodal elements

Captions do not work in isolation. They are part of a multimodal content ecosystem that determines how the AI processes the non-textual elements of your pages. Alt text describes the image for accessibility and for crawlers. Transcripts make audio and video content indexable. Optimized infographics combine visual data with structured text. Flowcharts translate processes into formats that retrieval can break down.

Every multimodal element needs its textual anchor to be visible to AI engines. The caption is the image’s anchor — and the beauty of it is that it is also the easiest anchor to optimize, because the structure is already there. You just have to stop writing “Figure 1” and start writing the data point your chart tells.

A quick check on your pages

Take the three pages of your site that contain charts or informative images. For each one, read only the caption — without looking at the chart and without reading the surrounding text. The question is simple: from the caption alone, can you tell what the key data point is that the visual element communicates?

If the answer is no — if the caption says “Figure 2 — Price comparison” without telling you who wins the comparison and by how much — you have found the problem. Rewrite it by including the data point the chart tells. One sentence, 30 words, the specific fact with the minimum context.

It is a surface check, of course. To understand how AI crawlers are actually processing your visual elements you need tools that simulate extraction and verify whether the semantic markup is correct. But that first check on the captions is often where the easiest value to recover is hidden — because the container is already there, you just have to fill it with the right information.

Chapter 3 · Content Structure for AI

Continue with the deep dives

39 deep dives across the 5 sections of the chapter.

3.1 Answer Patterns 8 deep dives
3.2 Citable Formats 7 deep dives
3.3 Linking & Semantic Context 8 deep dives
3.4 Multimodal Content 8 deep dives
3.5 Page Architecture 8 deep dives
The author
Roberto Serra at the Senate of the Republic Senate of the Republic · Palazzo Giustiniani Conference “The power of artificial intelligence”
Roberto Serra Roberto Serra

SEO consultant for over 15 years, founder of the Serra SEO Agency (RAANK). He helps multinationals and SMEs stay visible where search is moving: ChatGPT, Perplexity, Gemini and Google's AI Overviews.

As featured in
ANSA Il Sole 24 Ore Le Iene Università di Cagliari La Repubblica
How visible is your brand to AI? Analyze your brand