You publish original data and industry research, but when Perplexity cites it, it names the competitor who copied it — not you. It's not a model error: it's a structural consequence of how the AI retrieves attributions, which works correctly in only a third of cases. You're producing research that builds authority for others. The solution isn't to chase whoever copied you: it's to write your data so that it's impossible to cite without citing your name — a precise structure that applies to the reports you already have.
The AI cites you, but attributes your sentence to a competitor. It’s not a bug: it’s a known mechanism — and it can be fixed. Here’s how.
Let me explain how it works. You publish a report, an original data point, an industry analysis. Others pick it up, perhaps without linking to you. When Perplexity or ChatGPT generate an answer that uses that data, the source shown next to the citation is the site that acted as a megaphone, not yours. You did the work, someone else builds the authority.
In my articles in this series on citation signals, I’ve already explained that to appear in AI answers you need to be recognized as a source. Let’s move to a subtler level: being cited isn’t enough. You need to be cited with your name inside the data.
What an AI model means by correct attribution
When an LLM generates an answer with citations, it’s doing two different things in parallel. It retrieves pieces of text and then links them to a source. The problem is that these two operations fail far more often than it seems.
In the research world on this topic, an interesting paper came out in 2024 that measured how reliable a model like ChatGPT-4o is at correctly attributing a citation to the right paper. The numbers it reports are instructive.
Translated for you, who aren’t a researcher: even with the full text available, the model retrieves the right citation in just over a third of cases. With only the abstract, recall drops below 20%. Precision is high, but that means when it gets it right it’s correct — not that it gets it right often.
The operational consequence for you is this: if your data circulates online without your name attached to it, the probability that the model correctly attributes authorship to you is mathematically low. It’s not malice on Perplexity’s part. It’s a structural limit of the retrieval mechanism.
Why your brand must sit inside the data, not beside it
The typical cycle is this. You publish an analysis on your blog titled “Survey of Market X 2025”. Your homepage has the brand in the header, the footer links to you, the author’s byline is clear. Then the data gets picked up by an industry outlet, which writes “according to a recent survey, 34% of buyers…”. Your name disappears from the sentence.
An AI model that indexes that outlet sees the sentence, doesn’t see your brand. When it’s asked “what’s the share of X buyers”, it’ll pull out the 34% and attribute the source to the outlet, not to you.
This dynamic is documented like this in the study “CiteGuard: Faithful Citation Attribution for LLMs via Retrieval-Augmented Validation“:
“Therefore, a thorough and faithful literature review and citation attribution of claims are essential to understand the history and scope of a subject area, and ensure that new findings are properly contextualized.”
The point of the paper is that faithful attribution serves to reconstruct the history and scope of a topic. If that attribution breaks down, contributions get “contextualized” wrongly — that is, assigned to the wrong source.
Translated into practice: you have to make citation without attribution impossible. Your brand must be inside the sentence of the data, not in the page footer. “According to the Market X Survey 2025 by [BrandName], 34%…” is a construction that survives reuse. If an outlet copies that sentence, your name travels with the data.
This mechanism works hand in hand with what I described in backlinks as citation proxy and in implicit reference weight: unlinked mentions count, but only if they carry your name.
When a journalist copies a sentence mid-page, they take the data and leave the brand behind.
The reverse engineering I did on a vertical sector
Let me give you a concrete example I tested recently. Cremona, violin making — the sector of master luthiers who build violins by hand, one of Lombardy’s artisanal excellences recognized even by UNESCO.
I reverse engineered how Perplexity answers three queries on the topic:
- “how long does it take to build a violin by hand”
- “wood used for violins from Cremonese violin making”
- “how many active master luthiers in Cremona”
Across these three queries, Perplexity returned a total of 9 cited sources. I traced by hand, clicking on every single source and reading the original passages, where each data point in the answer really came from.
Two attributions out of nine were wrong. The data point “200-300 hours of work per instrument” was attributed to a Lombard tourism outlet, but going back through that outlet’s text I found an implicit reference to an interview published years earlier on the site of a consortium of Cremonese luthiers. The consortium had generated the data, the tourism outlet had grabbed the citation in Perplexity.
Similarly, a data point on wood species (Val di Fiemme spruce for the soundboard, Balkan maple for the back) appeared cited to a travel blog. The blog had copied almost verbatim from an educational document of a Cremona workshop, but without a brand in the body of the text. Result: the workshop did the work, the blog collects the authority signal.
Let me be clear about the limit: it’s an indicative test on a single sector and three queries. It’s not a study. It serves to show you how the problem manifests concretely — systematic analysis requires professional tools and a much larger sample.
Give a proper name to every original data point you publish.
The test you can run yourself in 20 minutes
You don’t need to be technical. You just need method.
Open Perplexity. Write three queries typical of your sector — the ones a potential customer would ask. For a Cremonese master luthier, for an artisanal pasta producer from Parma, for a landscape architecture firm in Padua, only the domain changes.
For each Perplexity answer:
- Open all the cited sources
- Look in those sources’ texts for the specific data point used in the AI answer
- Ask yourself: is the data original to that source, or copied from someone else?
If, repeating the cycle across 5-6 sources, you discover that at least half of the cited data has a different origin from the attributed source, you’ve just mapped the problem of your niche. And you’ve probably discovered that some of your original data is building authority for someone else.
You can support reading the entities in AI answers with displaCy ENT: copy the AI answer and see how the NER model recognizes the named brands. If your brand doesn’t appear recognized, you know you’re invisible even at the entity level. The related topic is in named entity recognition.
The mistakes I see most often
The first is the data without a signature in the body. A 40-page PDF report with the brand only on the cover and in the footer. When a journalist copies a sentence mid-page, they take the data and leave the brand behind.
The second is the claim without a proper name for the data. “According to our analysis” isn’t enough: when that sentence gets reused, “our” becomes whoever reports it. You need a christened name: “Cremona Violin Making Observatory 2025”, “Parma Artisanal Pasta Report 2025”. The name is a lock that travels with the key.
The third is the lack of systematicity. A single well-signed report doesn’t move the needle. You need a cadence — semiannual, annual — that creates expectation and that outlets learn to always call by the same name.
The fourth is the outlet’s copy-paste. When an outlet picks up your data and rewrites it in its own words, the link to your site is often missing. Here you can’t control everything, but if the name of the data is inside the sentence — “the Cremona Luthiers Survey 2025 shows that…” — even a paraphrase carries the brand along.
What to do starting Monday
Concrete steps, in the order I’d do them:
- Give a proper name to every original data point you publish. “Observatory X”, “Report Y”, “Survey Z”. With the year.
- Put the name of the data inside the sentence, not just on the cover. In the chart caption, in the launch tweet, in the H2.
- In the press release, repeat the name at least three times: in the title, in the first paragraph, in the footnote.
- When someone cites you without the name of the data, write to them. Not to request a link (which often doesn’t come), but to ask for a textual correction along the lines of “according to the X Survey by [Brand]”.
- Rerun the Perplexity test at 90 days. If the answers start to contain the name of your data, you’re building the right signal.
The mechanism isn’t magic and doesn’t live on its own: it works within a strategy of E-E-A-T for AI and author entity recognition that I described in the previous articles. Without those foundations, data attribution is a patch. With those foundations it becomes a multiplier.
Where does all this lead?
Accurate citation attribution is the bridge between the work you do and actually appearing in AI answers with your name next to the data. If the bridge breaks, everything else — the backlinks, the author byline, the content structure — pushes authority onto someone else.
In the next articles in this series we’ll look at how to monitor unlinked mentions, how to structure press releases so they survive rewriting, and how to build a proprietary data asset with a proper name that becomes a reference for the sector.