Authority and Credibility for AI

Your site is excellent but AI doesn’t know you? It could be a training bias

Your site is well maintained, your content is good, but AI doesn't know you? Before looking for the problem in the text, consider that the models were trained on data that overrepresents tech, finance, and large generalist media. If your industry is underrepresented in that data, the model will ignore you no matter how well you write. It's not a quality problem: it's a problem of where you publish. And that can be fixed.

You’ve invested years in your site. Original content, impeccable service, satisfied customers. Then you ask an AI engine to recommend a supplier in your industry and your name doesn’t come up. In its place, competitors you know well and that you know to be less qualified than you.

The instinctive reaction is to think you need to “optimize” something. But the problem could be upstream, in a place no site optimization can reach: the data the model was trained on.

Training doesn’t cover the world uniformly

Every language model learns from a dataset. And every dataset is a partial snapshot of the web. Common Crawl, The Pile, RedPajama: these are names you may never have heard, but they are the foundations on which everything AI “knows” about the world rests.

The point is that these collections are not balanced. Certain industries, certain languages, certain types of sites are captured massively. Others remain in the shadows. In the study by Gao et al. from 2025, one of the most comprehensive on the construction of LLMs, the concept is expressed bluntly:

“Addressing Imbalances: Balancing the distribution of classes or categories in the dataset to avoid biases and ensure fair representation.”
(A Survey of Large Language Models)

If balancing is a problem recognized by research, it follows that your industry could be on the wrong side of the scale. Not because of you. Because of how the datasets were built.

What ends up in the training and what stays out

To give you a concrete idea: The Pile, one of the most widely used open datasets in research, publishes the list of its sources. There you’ll find Wikipedia, StackOverflow, GitHub, PubMed, academic papers, digitized books, technical forums. Entire segments of the economy are almost completely missing: craftsmanship, local services, manufacturing SMEs, healthcare professionals, consultants.

It’s not that these industries produce less quality content. It’s that the content they produce doesn’t end up in the channels the datasets collect. A plumber from Brescia with thirty years of experience and a well-made site has no publications on PubMed, doesn’t write on GitHub, has no Wikipedia entry. For the dataset, he simply doesn’t exist.

And note, this doesn’t only apply to craftsmanship. Law firms, veterinary clinics, agricultural businesses, architecture studios. Entire pieces of the real economy that generate value every day but that, in the training, weigh a fraction of what an open source repository on GitHub weighs.

And here it gets even more interesting. Gao et al. in the same survey document a principle that changes the perspective:

“Properly filtered and deduplicated web data alone can lead to powerful models.”
(A Survey of Large Language Models)

Translated: the quality of the data matters more than the quantity. But filtering and deduplication work on what’s in the dataset. If your industry isn’t there, there’s nothing to filter, nothing to value. It’s not a problem with the quality of your site. It’s a problem of presence in the raw material.

Common mistake

AI doesn’t cite you, so you don’t generate traffic from AI, so you don’t generate new mentions, so in the next training update you’re even less present.

The consequences don’t stop at the training

Bias in the training propagates. When a model knows little about an industry, the answers about that industry are less accurate, less detailed, less confident. The research by Wang et al. from 2025 on model alignment says it directly:

“They may also produce toxic, offensive, or harmful content due to biases present in the training data.”
(A Survey on Progress in LLM Alignment)

Now, I’m not saying AI produces “toxic” content about your industry. But the mechanism is the same: when the training data is imbalanced, the answers inherit that imbalance. It follows that if the model has seen a thousand articles about digital marketing tools and three about your niche industry, when someone asks “who is the best at X,” AI responds with what it knows best. And what it knows best is what was in the training.

For an entrepreneur this is a silent vicious circle. AI doesn’t cite you, so you don’t generate traffic from AI, so you don’t generate new mentions, so in the next training update you’re even less present.

Pro tip

An article in an industry publication with your name associated with your specialization carries enormous weight.

How to tell if your industry is underrepresented

This is a surface check, a first step to understand where you stand. It doesn’t replace a full analysis, but it gives you a direction.

Do these three things:

Search for your industry in public datasets. The Pile documents its sources. RedPajama has a known composition. If your industry doesn’t appear among the main sources, you have a partial but significant answer.

Test the AI answers about your industry. Run 10-15 specific queries across multiple AI engines, like “best [your industry] suppliers in [your area].” Don’t look at a single answer: look at the pattern across multiple reworded queries. If the answers are vague, generic, or always cite the same 2-3 names, the industry is probably underrepresented.

Compare with overrepresented industries. Run the same queries for industries that are notoriously present in the datasets (tech, finance, software). The difference in the quality and specificity of the answers shows you the gap.

If the picture confirms an underrepresentation, it’s not a final verdict. It’s the starting map for a strategy.

Compensating for the bias: publishing where the training looks

If your site alone isn’t enough because the dataset never collected it, the path is to bring your name and your expertise where the datasets look. It’s not a trick. It’s building presence in the channels that matter for future models, and that already matter today for systems that use RAG to look up information in real time.

Some concrete paths:

Media in your industry. An article in an industry publication with your name associated with your specialization carries enormous weight. National and industry media are among the main sources of all the large datasets. If you publish there, you enter the circuit.

Wikipedia as a cited source. I’m not talking about creating a page about you: that requires notability. But you can be cited as a source in existing entries in your industry. A citation in a Wikipedia entry is one of the strongest signals in training datasets, because Wikipedia is present in all of them.

Platforms with a high presence in datasets. Reddit, StackOverflow, Quora, technical industry forums. If you have expertise, share it where the datasets collect. A detailed technical comment on Reddit with your name carries more weight, for the training, than a hundred posts on your blog that no dataset ever indexed.

Authoritative industry directories. Not the generic directories that sell links. The directories specific to your industry, the ones the datasets collect because they have editorial authority.

Your Google Business Profile. It sounds trivial, but it’s structured data that crawlers collect systematically. A complete profile, with reviews, the correct category, and a description consistent with your positioning, is a signal that enters the circuit.

All of this connects to what I explored in the article on E-E-A-T credibility: authority signals don’t live only on your site, but in the network of external mentions that confirm who you are.

Training bias and the chain of visibility

This mechanism intertwines with others that determine your AI visibility. The consensus among sources amplifies those already present in multiple points of the dataset. Cross-platform reputation weighs more when the platforms you’re present on are the ones the training captured. And temporal authority rewards those who have built presence over time, not those who arrive at the last minute.

None of these mechanisms work if the raw material is missing at the base: your presence in the data the model learns from. Training bias is the floor on which everything else rests. If that floor is empty, it doesn’t matter how solid the structure you build on top of it is.

The good news is that training isn’t static. Models are updated, datasets are refreshed, and RAG systems look up information in real time. Every mention you build today on an authoritative source is one more brick for the next version of the model, and an immediate signal for the engines that already use real-time search.

You can’t rewrite the past training. But you can build the presence that future training will find.

Chapter 2 · Authority and Credibility for AI

Continue with the deep dives

40 deep dives across the 5 sections of the chapter.

2.1 Authority Signals 8 deep dives
2.2 Brand Authority 8 deep dives
2.3 Sources & Citations 7 deep dives
2.4 Technical Credibility 8 deep dives
2.5 Trust & Reputation 9 deep dives
The author
Roberto Serra at the Senate of the Republic Senate of the Republic · Palazzo Giustiniani Conference “The power of artificial intelligence”
Roberto Serra Roberto Serra

SEO consultant for over 15 years, founder of the Serra SEO Agency (RAANK). He helps multinationals and SMEs stay visible where search is moving: ChatGPT, Perplexity, Gemini and Google's AI Overviews.

As featured in
ANSA Il Sole 24 Ore Le Iene Università di Cagliari La Repubblica
How visible is your brand to AI? Analyze your brand