Authority and Credibility for AI

Wikipedia is the source every AI model checks first

Roberto Serra 25 June 2026·~7 min read

Wikipedia is present in every large dataset used to train AI models, with a weight no other source matches. Being cited in a Wikipedia entry in your industry — even just as a reference — concretely changes the probability that the AI will mention your brand. Not every company can have its own page, but almost all of them can become a cited source in entries that already exist. Figuring out which of the two paths is viable for you takes very little time.

If you ask an AI engine something about a public figure, a well-known company, or a technical concept, the answer you get has an extremely high probability of being derived from Wikipedia. Not because the model “chooses” Wikipedia. Because Wikipedia is already inside the model, from the very first day of training.

This changes everything for anyone who wants to appear in AI answers. We are not talking about one channel among many. We are talking about the source that defines what the model considers a recognized entity and what it does not.

The structural fact: Wikipedia is in the DNA of every model

When it comes to training data, Wikipedia is not one element among many. It is a constant. Every large language model (from GPT to Gemini and Claude) has included Wikipedia in its pre-training mix. Not by convention, but for a precise technical reason.

Van Durme et al. document this clearly in 2024:

“Wikipedia is commonly used in pre-training, provides broad topic coverage, and is considered a high-quality reference source.”

Van Durme et al., 2024

Three characteristics in one sentence: common use in pre-training, broad coverage, high quality as a reference. This is not an opinion on the value of Wikipedia — it is a description of how the AI industry treats that source. For example, the training data mix of GPT-3 is: “Common Crawl (filtered), WebText2, Books1, Books2, Wikipedia.” Wikipedia is always present, in every generation, as you can see.

In practice, when a model is trained, Wikipedia works as a kind of baseline encyclopedia that shapes the fundamental associations between concepts. The model learns what is connected to what, who is relevant to which topic, which entities exist and what relationships they have with one another. All of this before it ever sees a single user query.

From this follows a logical but important deduction: if your brand, your name, your company has no presence on Wikipedia — not even as a mention in a related entry — then for the model that fundamental connection does not exist. You are not in the baseline vocabulary.

Beyond training: Wikipedia in RAG systems

The weight of Wikipedia does not end in the pre-training phase. Modern AI systems — the ones you use when you query Perplexity, or when ChatGPT looks up up-to-date information — rely on Retrieval-Augmented Generation. They retrieve external sources before generating the answer. And Wikipedia, together with its structured twin Wikidata, is among the first sources consulted.

Gong et al. in 2026 describe how the entity mapping process works in RAG-based fact-checking systems:

“Then it is followed by entity mapping to Wikidata nodes done by Wikidata API.”

Gong et al., 2026

When the system needs to verify a piece of information or build an answer, it maps the mentioned entities onto Wikidata nodes. And Wikidata is the data structure that powers Wikipedia and that, in turn, is powered by Wikipedia. They are two sides of the same ecosystem.

Put simply: the system asks “does this entity exist in the structured knowledge graph?” — and Wikidata is the first place it goes to look for the answer. If your Wikidata node exists, with correct properties and verifiable references, the system recognizes you. If it does not exist, it has to reconstruct your identity from fragments scattered across the web — and it may decide there is not enough evidence to cite you.

I talked about this in the article on the Knowledge Panel: Wikidata is one of the fundamental building blocks for existing as a structured entity. But here the point goes further. It is not just about having a node in the graph. It is about having an entry in the source that the model considers, by definition, the most reliable.

Common mistake

I have seen companies that tried to create their own Wikipedia page and watched it get deleted within a day.

Wikipedia has not been replaced by AI (and that is where the opportunity lies)

You might think that with the rise of language models Wikipedia is losing relevance. That AI-generated content is contaminating or replacing traditional sources. The data says the opposite.

Huang et al. in 2025 specifically analyzed the impact of language models on Wikipedia, and the conclusion is reassuring for anyone working on this front:

“LLMs have not yet fully changed Wikipedia’s language and knowledge structures.”

Huang et al., 2025

Wikipedia retains its linguistic and knowledge structure. The collaborative editorial process, the notability rules, the system of verifiable citations — all of this has withstood the AI wave. And for the models this is a quality signal: Wikipedia remains a source with unique characteristics that the open web does not have.

This is why a presence on Wikipedia carries disproportionate weight. It is not one source among thousands — it is the reference source that the models treat as a benchmark of truth. If Wikipedia says that an entity exists and is notable, the model treats that information with a level of trust that no corporate website, no press release, no blog article can reach.

The complexity that goes unseen

And this is where the game gets complicated. Because Wikipedia does not work like a social profile you open and fill in. It has precise rules, strict notability criteria, a community of editors that monitors every change. Creating an entry that does not meet the criteria gets it deleted within hours. Editing an existing entry with promotional intent gets identified and reverted.

This is no game for amateurs. Serious work on Wikipedia and Wikidata requires understanding how the editorial criteria work, which secondary sources are needed to demonstrate notability, how to structure Wikidata properties so that the knowledge graph interprets them correctly, and how to link the Wikidata node to the schema markup of your site to create a coherent system.

I have seen companies that tried to create their own Wikipedia page and watched it get deleted within a day. Others that have a Wikidata item but with wrong or incomplete properties, generating no useful signal at all. The problem is not the willingness — it is the knowledge of an ecosystem that has its own rules, different from any other platform.

This is one of those cases where the self-check you can do right now is useful for understanding where you stand, but the solution requires expert hands. Search for your brand name on Wikipedia: is there a dedicated entry? Are you mentioned in entries related to your industry? Then search on Wikidata: do you have an item? Does it have correct and up-to-date properties? If the answer is no to everything, you have identified one of the bottlenecks with the greatest impact on your AI visibility.

The link with the rest of the chain

A Wikipedia presence does not work in isolation. It is the foundation on which all the other authority signals I have analyzed in this series of articles rest. Your citations from authoritative sources carry more weight if the model already recognizes you as an entity. Community signals anchor themselves to a structured node instead of remaining scattered points. Your level in the hierarchy of sources rises if Wikipedia includes you in its own verified ecosystem.

And for those operating in sectors where citations from institutional sources matter — healthcare, finance, public administration — having a Wikidata node that links your entity to those institutional sources creates a signal that RAG systems interpret as extremely reliable cross-confirmation.

The principle is simple in theory: existing on Wikipedia and Wikidata means existing in the AI’s baseline vocabulary. But the execution is anything but simple. And the distance between “I know I should be on Wikipedia” and “I have a Wikipedia presence that generates the right signals for AI” is exactly the space where the game is played.

Chapter 2 · Authority and Credibility for AI

Continue with the deep dives

40 deep dives across the 5 sections of the chapter.

2.1 Authority Signals 8 deep dives

Yesterday’s Update Beats the Perfect Article from 2 Years Ago Structured data is your site’s ID card for AI Backlinks aren’t just for Google: AI uses them in training to weight sources Even without a link, every mention of your brand counts for AI 50 articles on one topic beat 500 on everything: topical authority for AI Do You Have a Google Knowledge Panel? To AI, You Are a Recognized Entity When an expert in your field mentions you, the AI registers the signal Not all validations carry equal weight: the trust hierarchy for AI

2.2 Brand Authority 8 deep dives

Different names on different platforms? AI fragments your authority For Local Queries, AI Gives Huge Weight to Geographic Signals Reviews, followers, case studies: AI sums them all into a single score Repeat brand + category everywhere: the AI builds the association for you The CEO’s Authority Transfers to the Company (and Vice Versa): AI Sees It AI Has 3-5 Slots in Its Answers: How to Take a Competitor’s Place Your trade association membership is a signal for the AI Your site says ‘leader since 2005’, LinkedIn says ‘founded in 2012’: the AI notices

2.3 Sources & Citations 7 deep dives

Data Only You Have: The Ultimate Weapon for AI Visibility Wikipedia is the source every AI model checks first You are here Can AI Tell a Real Expert From a Self-Proclaimed One Spontaneous user recommendations outweigh any content you create Academic papers, Wikipedia, media: the source hierarchy for AI Being cited on a .gov site is equivalent to a certification for AI A book with an ISBN is the format with the highest trust score for AI

2.4 Technical Credibility 8 deep dives

AI crawlers have more aggressive timeouts than Google: is your page fast enough? Are You Blocking GPTBot in robots.txt? Then You’re Invisible to ChatGPT Wrong semantic HTML = AI doesn’t understand your content’s hierarchy Your content’s update date is a signal the AI reads A Public API Endpoint Makes Your Business Integrable by AI Your site’s accessibility is a quality proxy for AI too Anonymous content with no source? For AI it’s a red flag Without HTTPS, Your Site Doesn’t Exist for RAG Systems

2.5 Trust & Reputation 9 deep dives

AI authority is not permanent: if you don’t maintain it, it decays 5 Stars on Google, 2 on Trustpilot: AI Sees the Contradiction AI Uses Google’s E-E-A-T Report Card to Decide Who to Trust Your site is excellent but AI doesn’t know you? It could be a training bias When All Experts Say the Same Thing, AI Presents It as Truth You’ve published on your topic for 10 years? The AI knows it and rewards you If AI recognizes your name as an expert, all your content rises Perplexity doesn’t cite everyone: it has a quality filter you must pass A Web Controversy Can Erase You From AI Answers for Months

The author

Roberto Serra at the Senate of the Republic

Senate of the Republic · Palazzo Giustiniani Conference “The power of artificial intelligence”

Roberto Serra

SEO consultant for over 15 years, founder of the Serra SEO Agency (RAANK). He helps multinationals and SMEs stay visible where search is moving: ChatGPT, Perplexity, Gemini and Google's AI Overviews.

As featured in

Learn more about Roberto Serra →