Content Structure for AI

You’re Wasting Your Page’s First Viewport on a Decorative Banner

Does the cookie banner, the big cover image and the welcome line take up the first screens of your page? You're wasting the only space the AI reads with maximum attention. If there's no useful content there, the chance of being cited collapses before the model even reaches your best text. A few layout changes are enough to put things back in the right place — and recover citations you're losing every day.

Open your site on your phone. Look at what you see before scrolling: a cookie banner, a full-width hero image, maybe a CTA with a “Learn more” button. Now ask yourself: is there any useful text above the fold? If the answer is no, you have a problem that isn’t just about aesthetics.

AI engines don’t see your hero image. They don’t read your cookie banner. They don’t click your CTA. They see text — and the first text they encounter on the page is the one with the highest probability of being extracted and used as context to build an answer.

If that first block of text says “Welcome to our site, we’ve been an industry leader since 1985” instead of answering the question the user asked, you’ve just burned your most valuable asset.

Why the first block matters more than all the others

To understand the mechanism, you need to know how retrieval works in AI systems. When Perplexity, Gemini or another system with RAG (Retrieval-Augmented Generation) has to answer a query, it doesn’t read your page in full. It breaks it into blocks — the well-known chunks — and retrieves the most relevant ones to insert into the prompt context.

In the survey by Gao et al. (2024) on Retrieval-Augmented Generation, the mechanism is described precisely:

“These chunks are subsequently used as the expanded context in prompt.”

Gao et al., 2024

In plain terms: the blocks extracted from your page become the context the model uses to generate the answer. Not the whole page, only the selected blocks. And the first block of the page starts with a structural advantage, because it’s the easiest to identify, the fastest to reach in the parsing process and the one that typically carries the strongest signal about the page’s topic.

If that first block contains “Welcome to our website” or its equivalent, the RAG system evaluates it, finds it irrelevant to the query and discards it. At that moment, your page starts at a disadvantage compared to a competitor that put the answer right up front.

The first 200-300 tokens are your business card

In the same research, Gao et al. highlight a fundamental strategy for anyone who wants to optimize retrieval:

“Re-ranking the retrieved information to relocate the most relevant content to the edges of the prompt is a key strategy.”

Gao et al., 2024

The concept of “edges of the prompt” — the borders of the context — is crucial. AI models process information at the beginning or the end of the context better than information buried in the middle. I discussed this in depth in the article on the inverted pyramid for AI: the answer must be at the top, not in paragraph eight.

But here the point goes further. It isn’t enough for the answer to be at the top relative to the rest of the content. It must be at the top relative to everything the crawler sees when it arrives on your page. And this is where all those elements that occupy the first viewport without delivering information come into play: banners, decorative images, sliders, empty CTAs.

When the crawler reads your HTML, it encounters the text in the order it appears in the DOM. If, before your first paragraph, there are 400 tokens of markup for the cookie banner, the navigation, the hero section and the generic subtitle, your real content starts at token 401. And those first 200-300 tokens — the ones the RAG system weights the most — you’ve given away to elements that don’t answer any question.

Common mistake

If that first block of text says “Welcome to our site, we’ve been an industry leader since 1985” instead of answering the question the user asked, you’ve just burned your most valuable asset.

The test I ran on 40 pages

A few weeks ago I ran a test on 40 pages from Italian sites across different B2B niches. I extracted the first 300 tokens of each page’s HTML body, after removing the navigation tags and cookie banners from the count where they could be separated via markup. In 28 out of 40 pages — 70% — the first 300 tokens did not contain the answer to the main query the page was ranking for on Google.

They contained, in order: textual breadcrumbs, generic titles, publication dates, author names, image captions and introductions like “In this article we’ll talk about…”. The real answer arrived on average after token 450.

I then checked those same 40 pages on Perplexity and Gemini with their respective target queries. The 12 pages that had the answer in the first 300 tokens were cited 58% of the time. The other 28 only 19%. It’s not a huge sample, but the pattern is clear.

Pro tip

The rule is simple: the first paragraph of text on your page, the one the crawler encounters first in the DOM, must contain the answer to the target query.

Cookie banner: the invisible enemy

A specific note on cookie banners, because almost everyone underestimates them. If your cookie banner is implemented as a CSS overlay, it’s probably not a problem — the crawler sees the content underneath. But if it’s a div that comes before the main content in the DOM, its tokens count. And some GDPR consent banners, especially those with long text on “manage preferences” and descriptions of cookie categories, easily take up 150-200 tokens.

Check how yours is implemented. Open the page source code and find where the banner markup sits relative to the main content. If the banner comes first in the DOM, you’re losing precious tokens.

It’s not just the cookie banner: anything that doesn’t inform, subtracts

The list of elements that steal tokens from your content is long:

  1. Hero images with overlaid text generate empty tokens: the image’s alt attribute and the overlaid text (which usually says something vague like “Innovative solutions for your business”) are markup that the crawler encounters before the useful content.
  2. Sliders and carousels are even worse. Each slide generates its own tokens: if you have five slides with generic headlines, you’ve just spent over 200 tokens to say nothing.
  3. CTAs above the content — “Request a quote”, “Book a call” — are important for human conversion, but to the AI they’re noise. They don’t answer any query.

How to structure the first viewport for AI

Chen et al. (2025) sum it up well in their study on optimizing content for AI engines:

“We provide actionable guidance for practitioners, emphasizing the critical need to: (1) engineer content for machine scannability.”

Chen et al., 2025

“Machine scannability” is the cardinal principle. Your content must be readable and understandable by an automated parser from the very first moments of analysis. This has direct implications for how you structure the top of the page.

The rule is simple: the first paragraph of text on your page, the one the crawler encounters first in the DOM, must contain the answer to the target query. Not an introduction. Not an “in this article you’ll discover”. The answer.

In practice, for every page you want to make visible in AI answers, do this exercise:

  1. Identify the main query the page answers.
  2. Write the answer in 2-3 sentences.
  3. Place those sentences as the first paragraph, right after the H1.

Everything else — context, deep dives, examples, data — comes afterward.

If you have a hero image, move it below the first paragraph or turn it into an element that doesn’t precede the text in the DOM. If you have a CTA at the top, consider whether it can sit after the first content block. If you have a summary — and I discussed it in the article on the table of contents as a semantic map — that’s fine because it contains structural information that helps parsing.

Chunk-friendly also means “perfect first chunk”

I’ve written about it in depth in the articles on chunk-friendly structure and on heading hierarchy: every section must work as a self-contained unit. But there’s a hierarchy among the sections. The first chunk of the page is the most important, because it’s the one the RAG system encounters first and uses as the primary signal to decide whether the page is relevant to the query.

If the first chunk is perfect — a descriptive H1, an immediate answer, the query keyword present — the likelihood that the system retrieves the following chunks as well increases. If the first chunk is generic, the system might discard the entire page before reaching the better content.

First check: how many tokens you waste before the answer

You can run a quick check. Open any important page on your site, view the source code and find the first paragraph tag (<p>) in the body. Count how many elements precede it. If you find more than three non-informative elements before the first useful paragraph, you have room for improvement.

You can run a more precise test with any online tokenizer: copy all the text the crawler encounters before your answer and count the tokens. If you exceed 150, you’re slowing down retrieval. If you exceed 300, you’re probably compromising it.

These are surface-level checks that give you a direction, but a complete audit of the above-the-fold across all strategic pages requires specific tools and skills: the relationship between DOM structure, token budget and extraction probability isn’t trivial to optimize at scale.

Chapter 3 · Content Structure for AI

Continue with the deep dives

39 deep dives across the 5 sections of the chapter.

3.1 Answer Patterns 8 deep dives
3.2 Citable Formats 7 deep dives
3.3 Linking & Semantic Context 8 deep dives
3.4 Multimodal Content 8 deep dives
3.5 Page Architecture 8 deep dives
The author
Roberto Serra at the Senate of the Republic Senate of the Republic · Palazzo Giustiniani Conference “The power of artificial intelligence”
Roberto Serra Roberto Serra

SEO consultant for over 15 years, founder of the Serra SEO Agency (RAANK). He helps multinationals and SMEs stay visible where search is moving: ChatGPT, Perplexity, Gemini and Google's AI Overviews.

As featured in
ANSA Il Sole 24 Ore Le Iene Università di Cagliari La Repubblica
How visible is your brand to AI? Analyze your brand