Authority and Credibility for AI

Academic papers, Wikipedia, media: the source hierarchy for AI

Roberto Serra 25 June 2026·~7 min read

Do you know where your site sits in the hierarchy of sources the AI uses to decide who to trust? Academic papers and institutional sites at the top, personal blogs at the bottom — and the distance between the levels is enormous. Every euro invested in visibility on an authoritative source produces far more impact than ten euros spent on generic sources. Understanding where you stand today and which leap is most accessible is the starting point for a strategy that actually works.

If you’ve read my articles on how pre-training data works, you already know that AI models aren’t trained on “the internet” indiscriminately. Datasets have a specific composition: Wikipedia, Common Crawl, Reddit, academic papers, books. Each component carries a different weight in the final mix.

What many people don’t realize is that this composition creates a direct consequence: not all sources carry the same weight for the AI. There’s an implicit hierarchy — never stated as such in the technical papers, but reconstructable from the mechanics of training and from empirical data. And understanding where your content sits in this hierarchy is the first step toward understanding why the AI ignores you or recommends you.

Where does the hierarchy come from?

The starting point is simple: pre-training datasets don’t treat all sources the same way. Wikipedia is included almost in full, cleaned up and weighted as a high-density information source. Common Crawl is heavily filtered — most web pages are discarded because they’re too noisy, duplicated, or low quality. Reddit comes in as a proxy for conversational language and collective judgments. Academic papers come in as high-reliability sources on technical domains.

This selection isn’t random. It reflects an implicit judgment about which sources deserve to contribute to the model’s knowledge. And that judgment crystallizes in the model’s weights during training: text coming from a high-reputation source leaves a deeper imprint than text coming from just any page.

The empirical fact: the AI prefers third-party sources

If the bias were only in training, you could argue it’s a historical artifact. But the data shows it propagates into the responses as well. Chen et al. in 2025 analyzed which types of source AI engines tend to favor, and the result leaves no room for interpretation:

“AI Search exhibit a systematic and overwhelming bias towards Earned media — third-party, authoritative sources — over Brand-owned and Social content.” — Chen et al., 2025

“Systematic and overwhelming.” Not a slight preference — a structural and massive bias. The AI favors authoritative third-party sources over the content a brand produces about itself and over social content. Your corporate site, however polished, plays in a different league than an article about you on an industry publication.

From this fact, combined with the composition of the training, you can reconstruct an operational hierarchy of sources. It isn’t written in any paper as an “official ranking” — it’s a deduction I build by cross-referencing the mechanics of training with the empirical results. But the logic is solid.

Common mistake

A mistake I see often is concentrating the entire budget on level-5 content, hoping that quantity will make up for position.

The reconstructed hierarchy: five levels

Level 1 — Academic papers and official documentation. ArXiv, ACL Anthology, NeurIPS proceedings, technical documentation from OpenAI, Google, Anthropic. These are the sources that enter training with the highest weight for technical domains. A claim backed by a peer-reviewed paper carries a weight no blog post can match.

Level 2 — Wikipedia and structured knowledge bases. Wikipedia is the informational substrate of nearly every language model. If a concept, a brand, or a professional is present on Wikipedia with a well-documented entry, the model knows it at a deep structural level. Wikidata adds the relational layer — the connections between entities that feed the knowledge panels. I cover this in more depth in the article on Wikipedia and AI visibility.

Level 3 — Authoritative media and industry publications. Earned media in the full sense of the term: journalistic articles, independent reviews, industry analyses published on recognized outlets. These sources benefit from what Srba et al. document in their survey on credibility:

“Context-based signals considering user/source cues like domain reputation and publication metadata contribute most towards human judgement.”

Srba et al., 2024

The model learned to weight credibility by observing how humans assess it. And humans give more weight to domain reputation and publication metadata than to the content itself. A mention of your brand in the Financial Times carries a context signal that a mention on just any blog can’t replicate.

Level 4 — Professional directories, technical forums, community content. Stack Overflow, Reddit (in industry-specific subreddits), verified professional directories. These are sources that enter training and carry a medium credibility signal. Their value lies in community endorsement — the collective consensus that emerges from votes, answers, and discussions. It isn’t the same weight as a publication, but it’s a genuine signal the model registers.

Level 5 — Brand-owned content and social media. Your site, your social profiles, your press releases. These are the sources with the lowest weight in the hierarchy. Not because they’re useless — your site remains the starting base, and the structured data it contains feeds the model’s understanding. But the AI has learned the same lesson humans have always applied: those who talk about themselves are not the most reliable source on themselves.

Pro tip

Search for your brand while excluding your own domain and count how many mentions come from level-3 sources or higher.

Why this hierarchy matters for your visibility

The operational consequence is direct. If all your investment in content is concentrated on level 5 — website, social, corporate blog — you’re playing in the lowest tier of the hierarchy. Your content exists, the model may have ingested it during training, but the weight it assigns to it is structurally lower than that of a mention on a level-3 source or higher.

It isn’t a question of text quality. A perfect article on your corporate blog weighs less than a three-line mention on an industry publication. The mechanics don’t reward effort — they reward position in the source hierarchy.

And this effect is amplified in RAG systems. When an AI engine like Perplexity retrieves external sources before generating the answer, it applies a quality filter to the retrieved documents. A level-3 source passes that filter more easily than a level-5 one — domain reputation is a signal the system uses to decide what to include in the synthesis and what to discard. The result is that the training hierarchy replicates itself in real time during retrieval.

And here it connects to an aspect I covered when discussing expertise validation: the model evaluates not only what you say, but where you say it and who confirms that it’s true. An expertise declared on your site is a claim. The same expertise confirmed by a level-2 or level-3 source is a fact. The AI treats the two cases in radically different ways.

How to climb the hierarchy

The first check is to map where you stand now. Search for your brand while excluding your own domain and count how many mentions come from level-3 sources or higher. If the answer is “few or none,” you’ve identified the bottleneck.

From there, the strategy is to build presence at the levels that matter. Not all of them — there’s no point in aiming for academic papers if you don’t operate in an academic field. But levels 2 and 3 are accessible to any brand that has something real to offer. A Wikipedia entry requires verifiable notability. A mention on an industry publication requires you to do something worth writing about. Both require real work, not technical optimization.

A mistake I see often is concentrating the entire budget on level-5 content, hoping that quantity will make up for position. A hundred articles on the corporate blog don’t equal one citation on a level-3 source. Not because those hundred articles have no value — they do, for your direct audience, for organic ranking, for brand building. But in the calculation the AI makes to decide who to trust, the signal of an authoritative third-party source weighs disproportionately compared to self-declaration. It’s a disproportion that volume doesn’t offset.

The beauty of this hierarchy is that once you move up a level, the benefit is permanent. A mention on an authoritative source enters training and stays. A social post has a useful life of hours. The difference accumulates over time, and it translates into a structural advantage that competitors who invest only at level 5 cannot close.

If you want to understand how citations from government sources fit into this hierarchy — and why they represent a special, very high-weight case — I cover it in the next article.

The hierarchy exists, whether you know it or not. The difference is deciding which level to build your presence on.