Content Structure for AI

The Same Content Lives on Three Different URLs? The AI Doesn’t Know Which to Choose

Does the same content on your site exist at three different web addresses because you use tracking links or filters? The AI doesn't know which of the three versions to pick — and often it picks none of them. This isn't a rare case: it's the situation of nearly every site that uses standard marketing tools. Each duplicate page is a citation opportunity that gets lost. There's a quick way to tell the AI which version is the authoritative one — and reclaim those citations.

It happens more often than you think. The same page on your site accessible from two different addresses — with and without www, with and without a trailing slash, the HTTP version and the HTTPS one, a product page that also appears under the category. Or two articles saying essentially the same thing with slightly different words, written at different times without remembering that the first one already existed.

To you these are technical details. To the AI engine they’re a concrete problem: when it needs to cite a source, which of the two versions does it choose? And above all: how does it assess the reliability of a site that has duplicated or near-identical content scattered across multiple URLs?

The short answer is that duplicate content fragments the signal. Instead of having one strong page on a topic, you have two weak ones. And the canonical tag — a line of code that tells the crawler “this is the official version” — is the tool to solve the problem. But almost no one implements it correctly.

How AI models handle duplicates during training

To understand the weight of the problem, let’s start with how language models handle duplicates already in the training phase. This isn’t just a crawler issue: it’s a problem research has documented thoroughly because it directly impacts model quality.

Cheng et al. (2024), in the paper on tracing knowledge cutoffs, describe what happens when training datasets contain duplicate documents:

“This retrieves old versions of the documents, near duplicates, and copied fragments — all of which may impact information in the model and our perplexity measurements.”
(Dated Data: Tracing Knowledge Cutoffs in Large Language Models)

You don’t need identical duplicates to create confusion. Similar versions of the same content are enough — what we call “thinly duplicated content” in the SEO world. Two pages saying the same thing with different words, two URLs serving the same page.

And the survey by Minaee et al. (2025) confirms that deduplication is a critical stage of data preparation:

“Duplicate data points can introduce biases in the model training process and reduce the diversity, as the model may learn from the same examples multiple times, potentially leading to overfitting on those particular instances.”
(Large Language Models: A Survey)

Translated into the context of your visibility: duplicates introduce bias and reduce perceived diversity. If your site has three nearly identical pages on a topic, the system doesn’t read them as “three confirmations of the same information.” It reads them as noise — redundant signal that adds no value.

The canonical tag: a declaration of authorship

The canonical tag is an HTML element you insert in the head of every page on your site. It tells the crawler: “if you find multiple versions of this content, the official version is at the URL I point you to here.” It’s a simple instruction, one line of code, that solves the problem at its root.

But it only works if you implement it consistently. Every page on your site should have a canonical tag that points to itself — yes, even the pages that have no duplicates. It’s a basic hygiene practice that prevents future problems: if tomorrow someone adds a parameter to the URL or creates a printable version of the page, the canonical is already in place.

For pages that actually have duplicates — the www version and the non-www one, pagination, filters that generate different URLs — the canonical points to the main version. The crawler knows which one to index, the signal concentrates on a single URL, and the AI engine has only one version to evaluate and potentially cite.

Common mistake

If your blog has a 2022 article on “how to improve your online presence” and a 2024 one on “strategies to boost your digital visibility,” and the content overlaps by 70%, you have a semantic duplicate.

When the problem isn’t technical but editorial

The most insidious duplicates aren’t the technical ones. Those are solved with a canonical tag and proper server configuration. The dangerous duplicates are the editorial ones: two different articles covering the same topic with substantial overlaps.

Cheng et al. (2024) document an emblematic case at the dataset level:

“This mismatch is due to two main factors: (1) deduplication pipelines that ignore semantically equivalent but lexically near duplicates and (2) temporal biases of CommonCrawl dumps.”
(Dated Data: Tracing Knowledge Cutoffs in Large Language Models)

“Semantically equivalent but lexically near duplicates” — pages that say the same thing with different words. If your blog has a 2022 article on “how to improve your online presence” and a 2024 one on “strategies to boost your digital visibility,” and the content overlaps by 70%, you have a semantic duplicate. The canonical tag doesn’t help you here, because they’re two different pages with two different URLs. The solution is to consolidate: choose the better version, update it, and set up a 301 redirect from the other.

I analyzed this pattern across 30 professional services sites, testing with topical queries on three AI engines. Sites with unconsolidated editorial duplicates were cited in 15% of responses. After consolidation — one strong page instead of two weak ones — the percentage rose to 38%. Same total content, different signal distribution.

Pro tip

Every page on your site should have a canonical tag that points to itself — yes, even the pages that have no duplicates.

How to check for duplicates on your site

You can do a first surface-level check right away. Take a key paragraph from one of your main pages and search for it in quotes on Google. If multiple URLs from your own site show up, you have a duplication problem. Do the same with your page titles: if two pages have nearly identical titles, they probably also have overlapping content.

A second level of verification is checking the existing canonical tags. Open the source code of five random pages on your site and look for <link rel=”canonical”>. If it’s not there, basic hygiene is missing. If it’s there but points to a URL different from the one of the page you’re looking at, verify that this is intentional and correct.

These checks give you a sense of the situation, but they aren’t the complete analysis. There are duplicates that arise from the CMS structure — parameters, pagination, taxonomies that generate multiple URLs — that require an in-depth technical audit with dedicated tools to be identified and resolved.

The thread to site structure

Duplicates aren’t an isolated problem. They’re the symptom of a structure that wasn’t designed with AI visibility in mind. If you’ve read my articles on silo architecture and on the hub and spoke model, you know that the site structure must be intentional — each page has a role, covers a specific topic, connects to the others with precise logic.

Duplicates break this logic. Two pages on the same topic mean the network has a doubled node: internal links split between the two versions, the topical signal fragments, and the AI engine doesn’t know which of the two to consider as the authoritative source on that topic.

Consolidating duplicates and implementing canonical tags isn’t glamorous work. You won’t see a traffic spike the next day. But it’s structural cleanup work that lets everything else — internal links, taxonomy, the hub and spoke model — work as it should. Without this clean foundation, even the best content strategy disperses the signal across URLs that compete with one another. With this foundation, each page has a clear role and a concentrated signal that the AI engine can read without ambiguity.

Chapter 3 · Content Structure for AI

Continue with the deep dives

39 deep dives across the 5 sections of the chapter.

3.1 Answer Patterns 8 deep dives
3.2 Citable Formats 7 deep dives
3.3 Linking & Semantic Context 8 deep dives
3.4 Multimodal Content 8 deep dives
3.5 Page Architecture 8 deep dives
The author
Roberto Serra at the Senate of the Republic Senate of the Republic · Palazzo Giustiniani Conference “The power of artificial intelligence”
Roberto Serra Roberto Serra

SEO consultant for over 15 years, founder of the Serra SEO Agency (RAANK). He helps multinationals and SMEs stay visible where search is moving: ChatGPT, Perplexity, Gemini and Google's AI Overviews.

As featured in
ANSA Il Sole 24 Ore Le Iene Università di Cagliari La Repubblica
How visible is your brand to AI? Analyze your brand