Content Structure for AI

Your sidebar is polluting the content the AI extracts

Roberto Serra 25 June 2026·~6 min read

When the AI reads your site, it doesn't tell the text that matters apart from the side column with its links, widgets and CTAs: it blends them all together. If almost half of what it reads is noise, the answer it builds will be poor even if what you wrote is excellent. It's not a content problem: it's a structure problem that you fix with a quick technical change — and it can make the difference between being cited and not being cited.

Open your site on desktop. Look at the right-hand column: a “recent posts” widget, a newsletter banner, an “about us” box, maybe a contact form. All of this ends up in the text that AI crawlers extract from your page. It’s not a hypothesis — it’s mechanics.

RAG systems don’t “see” your page the way you see it in the browser. They don’t tell the main column apart from the sidebar, the content from the furniture. They extract text. And when the text includes 200 tokens of widgets, 150 of navigation and 100 of footer repeated on every page, the content you wrote so carefully gets diluted into a block where the useful signal is a fraction of the total.

If you’ve read my articles on how to structure content in the first 150 tokens or on how to build self-contained sections for retrieval, you already know that every token counts. Here I’ll explain why you’re wasting many of those tokens on noise you shouldn’t even be showing to crawlers.

The noise you don’t know you have

Think about what a typical page on your site contains, beyond the actual article:

Navigation menu with every section of the site
Sidebar with widgets: recent posts, categories, tag cloud, banners
Footer with company details, social links, legal disclaimers, secondary menu
Cookie bar, pop-ups, repeated inline CTAs

On an 800-word article, these elements can add 400-600 extra tokens. That means the real content — the part that should answer the user’s question — accounts for 55-65% of the total extracted text. The rest is noise.

In the research world, the concept is clearly documented. In the survey by Minaee et al. (2025) on data filtering techniques for language models, we read:

“Common data filtering techniques include: Removing Noise: refers to eliminating irrelevant or noisy data that might impact the model’s ability to generalize well.”
(A Survey of Large Language Models)

In plain terms: noise compromises the model’s ability to extract the relevant meaning. If the chunk the AI retrieves from your page contains a third of side widgets, the quality of the answer it generates from that chunk will be proportionally worse.

Why noise weighs more than you think

There’s an aspect that makes the problem even more concrete. In the survey by Gao et al. (2024) on RAG systems, there’s a passage that frames it perfectly:

“However, excessive context can introduce more noise, diminishing the LLM’s perception of key information.”
(Retrieval-Augmented Generation for Large Language Models: A Survey)

“Diminishing the LLM’s perception of key information” — that’s the phrase that counts. It’s not just that noise takes up space. Noise actively lowers the model’s ability to perceive the key information in your content. Every widget token is a token competing for the model’s attention against your main message.

And there’s a second level. In the same survey, the authors define a concept that applies directly to your situation:

“Noise Robustness appraises the model’s capability to manage noise documents that are question-related but lack substantive information.”
(Retrieval-Augmented Generation for Large Language Models: A Survey)

Here’s the point: your site’s sidebar is “question-related but lacks substantive information.” A “recent posts” widget on your industry’s topic looks relevant — but it contains no answer. It’s a noise document by definition. When your article’s chunk gets extracted with that widget attached, you’re forcing the model to separate the signal from the noise. Some models do it well. Others don’t. And you don’t control which model is processing your page.

Common mistake

The unique signal — the actual article — is drowned in a sea of constant noise.

The problem multiplies with repetition

Sidebar noise isn’t a single-page problem. It’s a structural problem. The exact same block of widgets, the same navigation, the same footer get extracted from every page of your site.

I analyzed 25 Italian company sites — SMEs with an active blog and a classic sidebar — crawling the pages the way a RAG system would. On average, 35% of the extracted tokens were identical across all pages: navigation, sidebar, footer. The same block of text, repeated a hundred times.

To the AI processing these pages, your site looks like it contains a huge volume of repetitive content with small variations in the middle. The unique signal — the actual article — is drowned in a sea of constant noise.

Pro tip

Check the semantic tags: Make sure your theme uses <main> or <article> to wrap the primary content.

How to isolate the main content

The technical solution exists and is simple to implement. It relies on the HTML5 semantic tags that modern crawlers recognize.

The <main> or <article> tag signals to crawlers where the main content begins and ends. If your page uses these tags correctly, a smart crawler can decide to extract only the content inside them, ignoring the sidebar and footer.

In practice:

Check the semantic tags: Make sure your theme uses <main> or <article> to wrap the primary content. Most modern WordPress themes do, but not all. Open the source code of one of your pages and look for these tags — if your article’s content is inside a generic <div>, the crawler has no way to tell it apart from the sidebar with any certainty.
Minimize widgets: Every widget you remove is tokens that won’t pollute the extracted chunks. Ask yourself: does this widget help the reader on this specific page, or is it only there to fill space? If it’s the latter, get rid of it.
Avoid repetitive inline CTAs: A “subscribe to the newsletter” box after every paragraph adds noisy tokens inside the <article> tag itself — and there not even the semantic tag saves you.
Clean up the footer: A footer with 200 words of disclaimers, links to every page of the site and social widgets is a block of pure noise. Cut it down to the bare minimum.

A quick check to get started

Copy the text of one of your pages — not from the CMS, but from the browser, selecting everything from top to bottom the way a crawler would. Paste it into an editor and count the words. Then highlight only the article’s real content. The ratio between the two figures tells you how much noise you’re serving to AI crawlers.

If the real content is below 60%, you have a concrete problem. It’s a first surface-level check — for a precise picture you need tools that analyze the actual rendering the way crawlers see it — but it already gives you a clear direction on where to act.

I’ve written an entire block of deep dives on page architecture to help you understand how each element influences what the AI extracts. If you want the model to cite your content and not your widget, the starting point is the TL;DR section as a structural element — a clean block, free of noise, designed to be extracted as-is.

The sidebar isn’t an enemy. But every token it steals from your main content is a token the AI doesn’t use to cite you.

Chapter 3 · Content Structure for AI

Continue with the deep dives

39 deep dives across the 5 sections of the chapter.

3.1 Answer Patterns 8 deep dives

The AI Looks for the Phrase ‘X is…’ on Your Page, and Moves On if It Can’t Find It If Your Industry Has Pairs to Compare and You Don’t, the AI Cites Someone Else Are Your Guides a Wall of Text? AI Can’t Extract Them as an Answer Do Your FAQs Have One-Line Answers? To AI They’re Unusable Your content explains the ‘what’ but not the ‘why’? AI ignores it Are your lists random? AI ignores them and cites whoever has clear criteria Your content has no numbers? AI considers it less trustworthy Only talk about the benefits? The AI classifies you as promotional

3.2 Citable Formats 7 deep dives

Is the key information buried in plain text? With a callout, the AI extracts it first Are your comparisons written in prose? As a table they’d be 10x more citable Schema markup isn’t just for Google: AI uses it as a ready-made summary Do You Cite Your Sources? AI Treats You as a Higher-Tier Resource Is your key information buried only in the text? With JSON-LD, AI reads it without errors Does your best content only exist as web pages? As PDFs it becomes a standalone asset Only evergreen guides? You’re losing the citations on industry news

3.3 Linking & Semantic Context 8 deep dives

The Same Content Lives on Three Different URLs? The AI Doesn’t Know Which to Choose Does your site have coverage gaps? Competitors fill them and the AI picks them Your Most Important Page Has Fewer Internal Links Than a Secondary One? The AI Gets Confused Your links say ‘click here’? AI can’t tell where they lead Your links jump from one topic to another? AI perceives expertise in none Adding links without explaining why? The AI doesn’t understand the relationship Are your related articles picked by an algorithm? To AI they’re worth almost nothing Is your content a set of isolated pages? The hub and spoke model organizes it for AI

3.4 Multimodal Content 8 deep dives

Your flowcharts are beautiful images that AI can’t read Your videos have no chapters? The AI can’t cite the right part Want AI to cite you more? Build a tool other sites want to embed Are your podcast show notes a three-line outline? You’re wasting an asset Do your infographics have alt text like ‘sales chart’? To AI, they don’t exist Got hours of excellent video? Without a transcript, they don’t exist to AI Your infographics are beautiful but to AI they don’t exist Do your captions say ‘Sales chart’? With the right numbers, they become citable

3.5 Page Architecture 8 deep dives

If the answer is in paragraph 8, the AI will never find it Every section of your page must be a mini-article the AI can cite on its own AI doesn’t read your generic headings: it ignores them Your article has no table of contents? The AI is searching for answers in the dark You’re Wasting Your Page’s First Viewport on a Decorative Banner AI can’t tell where your page sits without breadcrumbs Want AI to cite your article? Give it a TL;DR to copy Your sidebar is polluting the content the AI extracts You are here

The author

Roberto Serra at the Senate of the Republic

Senate of the Republic · Palazzo Giustiniani Conference “The power of artificial intelligence”

Roberto Serra

SEO consultant for over 15 years, founder of the Serra SEO Agency (RAANK). He helps multinationals and SMEs stay visible where search is moving: ChatGPT, Perplexity, Gemini and Google's AI Overviews.

As featured in

Learn more about Roberto Serra →