Content Structure for AI

Your sidebar is polluting the content the AI extracts

When the AI reads your site, it doesn't tell the text that matters apart from the side column with its links, widgets and CTAs: it blends them all together. If almost half of what it reads is noise, the answer it builds will be poor even if what you wrote is excellent. It's not a content problem: it's a structure problem that you fix with a quick technical change — and it can make the difference between being cited and not being cited.

Open your site on desktop. Look at the right-hand column: a “recent posts” widget, a newsletter banner, an “about us” box, maybe a contact form. All of this ends up in the text that AI crawlers extract from your page. It’s not a hypothesis — it’s mechanics.

RAG systems don’t “see” your page the way you see it in the browser. They don’t tell the main column apart from the sidebar, the content from the furniture. They extract text. And when the text includes 200 tokens of widgets, 150 of navigation and 100 of footer repeated on every page, the content you wrote so carefully gets diluted into a block where the useful signal is a fraction of the total.

If you’ve read my articles on how to structure content in the first 150 tokens or on how to build self-contained sections for retrieval, you already know that every token counts. Here I’ll explain why you’re wasting many of those tokens on noise you shouldn’t even be showing to crawlers.

The noise you don’t know you have

Think about what a typical page on your site contains, beyond the actual article:

  • Navigation menu with every section of the site
  • Sidebar with widgets: recent posts, categories, tag cloud, banners
  • Footer with company details, social links, legal disclaimers, secondary menu
  • Cookie bar, pop-ups, repeated inline CTAs

On an 800-word article, these elements can add 400-600 extra tokens. That means the real content — the part that should answer the user’s question — accounts for 55-65% of the total extracted text. The rest is noise.

In the research world, the concept is clearly documented. In the survey by Minaee et al. (2025) on data filtering techniques for language models, we read:

“Common data filtering techniques include: Removing Noise: refers to eliminating irrelevant or noisy data that might impact the model’s ability to generalize well.”
(A Survey of Large Language Models)

In plain terms: noise compromises the model’s ability to extract the relevant meaning. If the chunk the AI retrieves from your page contains a third of side widgets, the quality of the answer it generates from that chunk will be proportionally worse.

Why noise weighs more than you think

There’s an aspect that makes the problem even more concrete. In the survey by Gao et al. (2024) on RAG systems, there’s a passage that frames it perfectly:

“However, excessive context can introduce more noise, diminishing the LLM’s perception of key information.”
(Retrieval-Augmented Generation for Large Language Models: A Survey)

“Diminishing the LLM’s perception of key information” — that’s the phrase that counts. It’s not just that noise takes up space. Noise actively lowers the model’s ability to perceive the key information in your content. Every widget token is a token competing for the model’s attention against your main message.

And there’s a second level. In the same survey, the authors define a concept that applies directly to your situation:

“Noise Robustness appraises the model’s capability to manage noise documents that are question-related but lack substantive information.”
(Retrieval-Augmented Generation for Large Language Models: A Survey)

Here’s the point: your site’s sidebar is “question-related but lacks substantive information.” A “recent posts” widget on your industry’s topic looks relevant — but it contains no answer. It’s a noise document by definition. When your article’s chunk gets extracted with that widget attached, you’re forcing the model to separate the signal from the noise. Some models do it well. Others don’t. And you don’t control which model is processing your page.

Common mistake

The unique signal — the actual article — is drowned in a sea of constant noise.

The problem multiplies with repetition

Sidebar noise isn’t a single-page problem. It’s a structural problem. The exact same block of widgets, the same navigation, the same footer get extracted from every page of your site.

I analyzed 25 Italian company sites — SMEs with an active blog and a classic sidebar — crawling the pages the way a RAG system would. On average, 35% of the extracted tokens were identical across all pages: navigation, sidebar, footer. The same block of text, repeated a hundred times.

To the AI processing these pages, your site looks like it contains a huge volume of repetitive content with small variations in the middle. The unique signal — the actual article — is drowned in a sea of constant noise.

Pro tip

Check the semantic tags: Make sure your theme uses <main> or <article> to wrap the primary content.

How to isolate the main content

The technical solution exists and is simple to implement. It relies on the HTML5 semantic tags that modern crawlers recognize.

The <main> or <article> tag signals to crawlers where the main content begins and ends. If your page uses these tags correctly, a smart crawler can decide to extract only the content inside them, ignoring the sidebar and footer.

In practice:

  1. Check the semantic tags: Make sure your theme uses <main> or <article> to wrap the primary content. Most modern WordPress themes do, but not all. Open the source code of one of your pages and look for these tags — if your article’s content is inside a generic <div>, the crawler has no way to tell it apart from the sidebar with any certainty.
  2. Minimize widgets: Every widget you remove is tokens that won’t pollute the extracted chunks. Ask yourself: does this widget help the reader on this specific page, or is it only there to fill space? If it’s the latter, get rid of it.
  3. Avoid repetitive inline CTAs: A “subscribe to the newsletter” box after every paragraph adds noisy tokens inside the <article> tag itself — and there not even the semantic tag saves you.
  4. Clean up the footer: A footer with 200 words of disclaimers, links to every page of the site and social widgets is a block of pure noise. Cut it down to the bare minimum.

A quick check to get started

Copy the text of one of your pages — not from the CMS, but from the browser, selecting everything from top to bottom the way a crawler would. Paste it into an editor and count the words. Then highlight only the article’s real content. The ratio between the two figures tells you how much noise you’re serving to AI crawlers.

If the real content is below 60%, you have a concrete problem. It’s a first surface-level check — for a precise picture you need tools that analyze the actual rendering the way crawlers see it — but it already gives you a clear direction on where to act.

I’ve written an entire block of deep dives on page architecture to help you understand how each element influences what the AI extracts. If you want the model to cite your content and not your widget, the starting point is the TL;DR section as a structural element — a clean block, free of noise, designed to be extracted as-is.

The sidebar isn’t an enemy. But every token it steals from your main content is a token the AI doesn’t use to cite you.

Chapter 3 · Content Structure for AI

Continue with the deep dives

39 deep dives across the 5 sections of the chapter.

3.1 Answer Patterns 8 deep dives
3.2 Citable Formats 7 deep dives
3.3 Linking & Semantic Context 8 deep dives
3.4 Multimodal Content 8 deep dives
3.5 Page Architecture 8 deep dives
The author
Roberto Serra at the Senate of the Republic Senate of the Republic · Palazzo Giustiniani Conference “The power of artificial intelligence”
Roberto Serra Roberto Serra

SEO consultant for over 15 years, founder of the Serra SEO Agency (RAANK). He helps multinationals and SMEs stay visible where search is moving: ChatGPT, Perplexity, Gemini and Google's AI Overviews.

As featured in
ANSA Il Sole 24 Ore Le Iene Università di Cagliari La Repubblica
How visible is your brand to AI? Analyze your brand