How AI engines think

Copied content? The AI keeps the original and discards yours

If your articles rework content that already exists elsewhere, the AI keeps the original and discards yours — all the visibility goes to whoever wrote it first, even if you did a better job. It's not a plagiarism problem: even a text rewritten with different words, if it adds nothing original, is treated as a copy. You're producing content that gives you no advantage with the AI, while those who publish data or angles that don't exist anywhere else take it all. Understanding what makes content truly unique is the first step to stop wasting that work.

Stop for a second before you keep reading.

Think about the last piece of content you published on your site. How did you build it? You searched the top ten results on Google, you read them, you figured out which concepts to cover — and then you wrote. Similar structure, similar topics, similar angle. Different words, sure. But the model doesn’t compare words. It compares patterns.

If this is your content creation process, there’s a good chance your content doesn’t exist for the AI. Not because it’s bad. Because it was removed during training.

The filter no one told you about

Before a model like GPT-4 or Gemini learns anything from the web, the data gets cleaned. One of the most impactful cleaning stages is deduplication.

“Data cleaning techniques such as filtering, deduplication, are shown to have a beneficial effect on training.”

Minaee et al., 2025.

“Beneficial effect on training” means that models trained on deduplicated data are better. More accurate, less inclined to memorize repetitive patterns, more capable of generalizing. So the teams building the datasets don’t treat it as optional: it’s a standard stage.

The operational definition is just as clear. From the same paper:

“De-duplication refers to the process of removing duplicate or near-duplicate data from the training set.”

Minaee et al., 2025.

Duplicates and near-duplicates: this distinction is the part that concerns you directly. You’re not only at risk if you copied entire paragraphs. You’re at risk if your content is similar enough to one that already exists — even if every word is different.

How deduplication works in practice

Deduplication algorithms — MinHash, SimHash, locality-sensitive hashing — don’t read text the way you do. They don’t look for identical words. They create a numerical representation of the content based on its structure and conceptual distribution, then compare these representations across billions of documents.

When two documents produce representations that are too similar, the system treats them as near-duplicates. One is kept, the other removed. The selection criterion varies depending on the dataset: publication date, domain authority, coverage. But the recurring pattern is that the original — the document that established the pattern first — gets priority.

The results are measurable. Minaee et al. (2025) cite a concrete case:

“As an example, in Falcon40B, Penedo et al. showed that properly filtered and deduplicated web data can lead to better performance.”

Falcon40B is a real model, trained on real data, that delivered better performance thanks to aggressive deduplication. It’s not a theory — it’s an engineering choice already made, already replicated, already validated at scale.

It follows that every dataset built to train a competitive model applies deduplication. And every piece of content similar to one that already exists is, with high probability, discarded.

Common mistake

Stop rewriting, start creating: if your process is “read the top 10 and rewrite”, you’re producing near-duplicates by definition.

The test you need to run right now

This is the moment when you might think: “but I don’t copy, I rephrase”. Fine. The problem is that deduplication doesn’t judge intent — it measures structural similarity.

Take your most important article and run this test:

  1. Search the topic on Google
  2. Open the top five results and read the structure — not the text, the structure
  3. How many sections does each article have? In what order? Which concepts do they cover and in what sequence?
  4. Now compare it to yours. Is the structure the same? Is the order of the topics the same? Do the concepts covered match?

If the answer is yes, you have a near-duplicate. The original — the first result you found on Google, the one with the most authoritative domain — is probably in the training set. Yours, in all likelihood, was removed.

This also applies to content that looks different on the surface. If fifty blog posts in an industry cover “what X is, five benefits of X, how to implement X, conclusion”, deduplication treats them as variations of the same document. It doesn’t keep fifty of them. It keeps one or two. The others don’t exist in the training set.

Pro tip

Add your unique data point as an anchor: even a single original number makes the content impossible to deduplicate.

What survives deduplication

The survival criterion is simple: your content must have at least one element that doesn’t exist in the others. Not “better written”. Not “more complete”. A structurally unique element that no other document in the dataset has.

The elements that hold up against deduplication:

  • An original data point: a statistic from your business, a benchmark you measured, a number that exists only in your direct experience. Data can’t be deduplicated — a data point is unique by definition.
  • An angle no one else takes: not a “creative” approach, but a genuinely different perspective. Not “10 tips for X” but “why X doesn’t work in industry Y and what to do instead”. A different starting question produces a different structure.
  • A direct, non-replicable experience: a real case, an anonymized client, a project with verifiable results. No one can have the same experience — it’s unique by definition.
  • A narrative structure different from the standard one: start from the reader’s problem, tell the case first and then the mechanism, build the content backwards. A different structure produces a different numerical representation.

The element doesn’t have to be huge. It just has to exist and be real.

The connection with the pre-training data mix

Deduplication doesn’t operate on isolated documents. It operates on entire datasets — The Pile, RedPajama, Common Crawl, Falcon RefinedWeb. As you read in the article on the pre-training data mix, these datasets aggregate billions of pages and balance them by industry and language before using them in training.

Deduplication is applied at this aggregate level. Your content isn’t compared only with the direct competitors in your industry — it’s compared with the entire corpus. If an article with the same structure as yours exists in English on an English-speaking domain published a year earlier, the system can treat yours as a near-duplicate even if you’ve never heard of that site.

This carries an operational implication: it’s not enough to be original compared to the Italian competitors in your industry. You have to be original compared to the entire corpus.

Fine-tuning won’t save you if your content isn’t in the pre-training

There’s a widespread belief that models are updated frequently and that fine-tuning can introduce new information. As you read in the article on fine-tuning, this belief is partly wrong.

Fine-tuning doesn’t add knowledge — it aligns the model’s behavior on a specific task. The underlying knowledge comes from pre-training, and pre-training is done on datasets that have undergone deduplication. If your content was removed at that stage, no later stage recovers it. It’s not in the model. It doesn’t contribute to the answers.

The moment when you can act is before, not after. You can influence the next training cycles — which happen regularly on new corpora — by producing content that gets past the deduplication filter. You can do it now.

How to influence the next training cycles

Training on large models isn’t a single event — it’s an iterative process. New models are trained on new datasets that include content published after the previous cycle. By producing structurally unique content now, you increase the probability of being included in the next cycles.

The concrete actions:

  • Stop rewriting, start creating: if your process is “read the top 10 and rewrite”, you’re producing near-duplicates by definition. Start from an angle, a data point or an experience — then build the content on top of it.
  • Add your unique data point as an anchor: even a single original number makes the content impossible to deduplicate. Put it at the beginning, not as a side note. Make it the structural starting point.
  • Publish first: deduplication favors the original. If you have a genuinely new idea, publishing it first establishes temporal priority. Waiting for someone else to publish it first means yours becomes, for the algorithm, the near-duplicate.
  • Audit your existing content: review your ten most important pieces of content. For each one: does a similar piece with an identical structure exist on a more authoritative domain? If so, yours has probably already been removed by a previous training cycle. Don’t update it — rewrite it with a different structure and an original data point.

The rule that changes how you produce content

RLHF and Constitutional AI define how models choose their answers and which content to favor in citations. But that logic operates on a corpus that has already been filtered. Deduplication is the stage that precedes everything else — it’s the entry gate.

If your content doesn’t get past the deduplication filter, it doesn’t enter training. If it doesn’t enter training, it can’t influence the answers. It can’t be cited, preferred, used as a source. For the model, it doesn’t exist.

The operational rule is this: every piece of content you publish must have at least one element that doesn’t exist in any other document in the corpus. Not better. Unique.

It’s not a matter of writing quality. It’s a matter of structure. A mediocre piece of content with an original data point gets past the deduplication filter. An excellent piece that replicates the standard structure doesn’t.

Take your next piece of content. Before you write it, ask yourself: what’s the unique element? If you can’t answer in one sentence, you haven’t found the right angle yet.

Chapter 1 · How AI engines think

Continue with the deep dives

38 deep dives across the 5 sections of the chapter.

1.1 AI Reasoning 8 deep dives
1.2 Evaluation & Scoring 8 deep dives
1.3 LLM Architecture 8 deep dives
1.4 Retrieval & Grounding 7 deep dives
1.5 Training & Alignment 7 deep dives
The author
Roberto Serra at the Senate of the Republic Senate of the Republic · Palazzo Giustiniani Conference “The power of artificial intelligence”
Roberto Serra Roberto Serra

SEO consultant for over 15 years, founder of the Serra SEO Agency (RAANK). He helps multinationals and SMEs stay visible where search is moving: ChatGPT, Perplexity, Gemini and Google's AI Overviews.

As featured in
ANSA Il Sole 24 Ore Le Iene Università di Cagliari La Repubblica
How visible is your brand to AI? Analyze your brand