Content Structure for AI

The Same Content Lives on Three Different URLs? The AI Doesn’t Know Which to Choose

Roberto Serra 25 June 2026·~7 min read

Does the same content on your site exist at three different web addresses because you use tracking links or filters? The AI doesn't know which of the three versions to pick — and often it picks none of them. This isn't a rare case: it's the situation of nearly every site that uses standard marketing tools. Each duplicate page is a citation opportunity that gets lost. There's a quick way to tell the AI which version is the authoritative one — and reclaim those citations.

It happens more often than you think. The same page on your site accessible from two different addresses — with and without www, with and without a trailing slash, the HTTP version and the HTTPS one, a product page that also appears under the category. Or two articles saying essentially the same thing with slightly different words, written at different times without remembering that the first one already existed.

To you these are technical details. To the AI engine they’re a concrete problem: when it needs to cite a source, which of the two versions does it choose? And above all: how does it assess the reliability of a site that has duplicated or near-identical content scattered across multiple URLs?

The short answer is that duplicate content fragments the signal. Instead of having one strong page on a topic, you have two weak ones. And the canonical tag — a line of code that tells the crawler “this is the official version” — is the tool to solve the problem. But almost no one implements it correctly.

How AI models handle duplicates during training

To understand the weight of the problem, let’s start with how language models handle duplicates already in the training phase. This isn’t just a crawler issue: it’s a problem research has documented thoroughly because it directly impacts model quality.

Cheng et al. (2024), in the paper on tracing knowledge cutoffs, describe what happens when training datasets contain duplicate documents:

“This retrieves old versions of the documents, near duplicates, and copied fragments — all of which may impact information in the model and our perplexity measurements.”
(Dated Data: Tracing Knowledge Cutoffs in Large Language Models)

You don’t need identical duplicates to create confusion. Similar versions of the same content are enough — what we call “thinly duplicated content” in the SEO world. Two pages saying the same thing with different words, two URLs serving the same page.

And the survey by Minaee et al. (2025) confirms that deduplication is a critical stage of data preparation:

“Duplicate data points can introduce biases in the model training process and reduce the diversity, as the model may learn from the same examples multiple times, potentially leading to overfitting on those particular instances.”
(Large Language Models: A Survey)

Translated into the context of your visibility: duplicates introduce bias and reduce perceived diversity. If your site has three nearly identical pages on a topic, the system doesn’t read them as “three confirmations of the same information.” It reads them as noise — redundant signal that adds no value.

The canonical tag: a declaration of authorship

The canonical tag is an HTML element you insert in the head of every page on your site. It tells the crawler: “if you find multiple versions of this content, the official version is at the URL I point you to here.” It’s a simple instruction, one line of code, that solves the problem at its root.

But it only works if you implement it consistently. Every page on your site should have a canonical tag that points to itself — yes, even the pages that have no duplicates. It’s a basic hygiene practice that prevents future problems: if tomorrow someone adds a parameter to the URL or creates a printable version of the page, the canonical is already in place.

For pages that actually have duplicates — the www version and the non-www one, pagination, filters that generate different URLs — the canonical points to the main version. The crawler knows which one to index, the signal concentrates on a single URL, and the AI engine has only one version to evaluate and potentially cite.

Common mistake

If your blog has a 2022 article on “how to improve your online presence” and a 2024 one on “strategies to boost your digital visibility,” and the content overlaps by 70%, you have a semantic duplicate.

When the problem isn’t technical but editorial

The most insidious duplicates aren’t the technical ones. Those are solved with a canonical tag and proper server configuration. The dangerous duplicates are the editorial ones: two different articles covering the same topic with substantial overlaps.

Cheng et al. (2024) document an emblematic case at the dataset level:

“This mismatch is due to two main factors: (1) deduplication pipelines that ignore semantically equivalent but lexically near duplicates and (2) temporal biases of CommonCrawl dumps.”
(Dated Data: Tracing Knowledge Cutoffs in Large Language Models)

“Semantically equivalent but lexically near duplicates” — pages that say the same thing with different words. If your blog has a 2022 article on “how to improve your online presence” and a 2024 one on “strategies to boost your digital visibility,” and the content overlaps by 70%, you have a semantic duplicate. The canonical tag doesn’t help you here, because they’re two different pages with two different URLs. The solution is to consolidate: choose the better version, update it, and set up a 301 redirect from the other.

I analyzed this pattern across 30 professional services sites, testing with topical queries on three AI engines. Sites with unconsolidated editorial duplicates were cited in 15% of responses. After consolidation — one strong page instead of two weak ones — the percentage rose to 38%. Same total content, different signal distribution.

Pro tip

Every page on your site should have a canonical tag that points to itself — yes, even the pages that have no duplicates.

How to check for duplicates on your site

You can do a first surface-level check right away. Take a key paragraph from one of your main pages and search for it in quotes on Google. If multiple URLs from your own site show up, you have a duplication problem. Do the same with your page titles: if two pages have nearly identical titles, they probably also have overlapping content.

A second level of verification is checking the existing canonical tags. Open the source code of five random pages on your site and look for <link rel=”canonical”>. If it’s not there, basic hygiene is missing. If it’s there but points to a URL different from the one of the page you’re looking at, verify that this is intentional and correct.

These checks give you a sense of the situation, but they aren’t the complete analysis. There are duplicates that arise from the CMS structure — parameters, pagination, taxonomies that generate multiple URLs — that require an in-depth technical audit with dedicated tools to be identified and resolved.

The thread to site structure

Duplicates aren’t an isolated problem. They’re the symptom of a structure that wasn’t designed with AI visibility in mind. If you’ve read my articles on silo architecture and on the hub and spoke model, you know that the site structure must be intentional — each page has a role, covers a specific topic, connects to the others with precise logic.

Duplicates break this logic. Two pages on the same topic mean the network has a doubled node: internal links split between the two versions, the topical signal fragments, and the AI engine doesn’t know which of the two to consider as the authoritative source on that topic.

Consolidating duplicates and implementing canonical tags isn’t glamorous work. You won’t see a traffic spike the next day. But it’s structural cleanup work that lets everything else — internal links, taxonomy, the hub and spoke model — work as it should. Without this clean foundation, even the best content strategy disperses the signal across URLs that compete with one another. With this foundation, each page has a clear role and a concentrated signal that the AI engine can read without ambiguity.

Chapter 3 · Content Structure for AI

Continue with the deep dives

39 deep dives across the 5 sections of the chapter.

3.1 Answer Patterns 8 deep dives

The AI Looks for the Phrase ‘X is…’ on Your Page, and Moves On if It Can’t Find It If Your Industry Has Pairs to Compare and You Don’t, the AI Cites Someone Else Are Your Guides a Wall of Text? AI Can’t Extract Them as an Answer Do Your FAQs Have One-Line Answers? To AI They’re Unusable Your content explains the ‘what’ but not the ‘why’? AI ignores it Are your lists random? AI ignores them and cites whoever has clear criteria Your content has no numbers? AI considers it less trustworthy Only talk about the benefits? The AI classifies you as promotional

3.2 Citable Formats 7 deep dives

Is the key information buried in plain text? With a callout, the AI extracts it first Are your comparisons written in prose? As a table they’d be 10x more citable Schema markup isn’t just for Google: AI uses it as a ready-made summary Do You Cite Your Sources? AI Treats You as a Higher-Tier Resource Is your key information buried only in the text? With JSON-LD, AI reads it without errors Does your best content only exist as web pages? As PDFs it becomes a standalone asset Only evergreen guides? You’re losing the citations on industry news

3.3 Linking & Semantic Context 8 deep dives

The Same Content Lives on Three Different URLs? The AI Doesn’t Know Which to Choose You are here Does your site have coverage gaps? Competitors fill them and the AI picks them Your Most Important Page Has Fewer Internal Links Than a Secondary One? The AI Gets Confused Your links say ‘click here’? AI can’t tell where they lead Your links jump from one topic to another? AI perceives expertise in none Adding links without explaining why? The AI doesn’t understand the relationship Are your related articles picked by an algorithm? To AI they’re worth almost nothing Is your content a set of isolated pages? The hub and spoke model organizes it for AI

3.4 Multimodal Content 8 deep dives

Your flowcharts are beautiful images that AI can’t read Your videos have no chapters? The AI can’t cite the right part Want AI to cite you more? Build a tool other sites want to embed Are your podcast show notes a three-line outline? You’re wasting an asset Do your infographics have alt text like ‘sales chart’? To AI, they don’t exist Got hours of excellent video? Without a transcript, they don’t exist to AI Your infographics are beautiful but to AI they don’t exist Do your captions say ‘Sales chart’? With the right numbers, they become citable

3.5 Page Architecture 8 deep dives

If the answer is in paragraph 8, the AI will never find it Every section of your page must be a mini-article the AI can cite on its own AI doesn’t read your generic headings: it ignores them Your article has no table of contents? The AI is searching for answers in the dark You’re Wasting Your Page’s First Viewport on a Decorative Banner AI can’t tell where your page sits without breadcrumbs Want AI to cite your article? Give it a TL;DR to copy Your sidebar is polluting the content the AI extracts

The author

Roberto Serra at the Senate of the Republic

Senate of the Republic · Palazzo Giustiniani Conference “The power of artificial intelligence”

Roberto Serra

SEO consultant for over 15 years, founder of the Serra SEO Agency (RAANK). He helps multinationals and SMEs stay visible where search is moving: ChatGPT, Perplexity, Gemini and Google's AI Overviews.

As featured in

Learn more about Roberto Serra →