How AI engines think

Is your industry underrepresented in the training data? AI already starts at a disadvantage

Roberto Serra 25 June 2026·~8 min read

AI knows some industries much better than others, because during its training it read enormously more about certain topics. If your industry is among the underrepresented ones, the model already starts with less information about you and your market — and this penalizes you regardless of the quality of your site or your content. It's not a problem without a solution: knowing where your industry is weak in AI lets you compensate with targeted actions, achieving results disproportionate to the effort.

Not all industries start out even in the race for AI visibility. The pre-training datasets — The Pile, RedPajama, Common Crawl — over-represent tech, finance, and Anglo-American media. If your industry is underrepresented, the model knows you less regardless of the quality of your site.

But if you know where the imbalance lies, you can compensate.

Content’s default weight: the technical principle

AI models are trained on enormous datasets made up of billions of web pages, books, academic papers, forum conversations, and source code. The distribution, however, is never uniform — and this has consequences that go well beyond the quality of the answers.

As recent research on the behavior of large language models points out, this structural imbalance is not a marginal detail: “This is especially relevant in NLP tasks, where diverse and representative training data is crucial” (Minaee et al., 2025).

Put another way: the representativeness of training data is not just an academic question. It’s the foundation on which every model builds its map of the real world. If a category of content is absent or scarce in that map, the model operates with a structural gap that can’t be solved by a well-formulated query.

The academic papers present in The Pile come predominantly from ArXiv (computer science, physics, mathematics). The web pages in Common Crawl over-represent English-language sites, tech media, and mainstream platforms. The book corpora — like Books3 — contain predominantly Anglo-American non-fiction. The result is a model that “thinks” in English, that reasons mostly in tech logic, and that has little structural familiarity with vertical industries in languages other than English.

What happens when your domain is marginal in the training data

From this composition follows an effect that many brands have not yet considered: content has a default weight in training based on its domain. It’s not a matter of editorial quality. It’s a matter of frequency and statistical distribution during training.

If you work in furniture, in food service, in construction, in Italian tax consulting — your domain probably has a marginal presence in the training data. Not because your content is poor, but because the dataset is skewed toward other industries and other languages.

The practical consequences are three:

The model knows you less. If your industry has few documents in the training data, the model has less “experience” with the terms, brands, and dynamics of your market. The answers will be more generic, less accurate, with vague or incorrect references.

Competitors present in the training data win by default. If a competitor has content on sources that are in the training data — national media, Wikipedia, English-language platforms — they have a structural advantage that doesn’t depend on the quality of their service. It’s in the model before any query even begins.

Hallucinations increase in underrepresented industries. Less data in the training means a greater chance that the model invents information about your industry. And the problem is not trivial: as documented by Zhibin Wu et al. (2025), “they may also produce toxic, offensive, or harmful content due to biases present in the training data”. Structural bias is not limited to ignorance of the domain: it can turn into actively distorted representation.

An Italian pharmaceutical company that doesn’t maintain a presence on the sources in the training data could be described with frames borrowed from contexts that have nothing to do with it. It’s not just a matter of “the model doesn’t know me” — it can be a matter of “the model talks about my industry using distorted or partial references”. Structural bias translates into actively wrong representation.

Common mistake

Competitors present in the training data win by default.

The internal structure of training data: how the dataset is built

To understand where to act, you need to understand how the process works. Training data is not a random collection of text — it’s a dataset curated through specific phases of filtering, weighting, and sampling.

In this phase, each document is treated as a data point in relation to the others. As described by Gao et al. (2024) in a training optimization framework: “each training data point consists of one positive sample and five negative samples”. This principle — each data point defined by contrast with the negatives — is relevant because it implies that the presence or absence of an industry is reflected not only in how much the model knows about that industry, but in how the model classifies and compares competing sources.

In practice: if your industry has few positive samples in the training data, the comparison between sources that the model performs with every answer is calibrated on a narrow sample. The model tends to generalize toward the better-represented industries — and your specific source gets statistically diluted.

This mechanism is separate — but connected — to the alignment processes that happen after pre-training. Fine-tuning and RLHF can amplify or attenuate certain pre-training biases, but they don’t eliminate them. A model fine-tuned on a vertical domain can compensate for the under-representation of the industry — but only if the fine-tuning dataset is built on quality sources. If the sources aren’t there, the problem remains.

Pro tip
Publish on the sources that enter the training data.

Underrepresented industries: opportunity or obstacle?

The correct answer is: it depends on when you move.

There’s a concrete opportunity in industries with low representation: less competition in the training data means that whoever moves first becomes the default reference. If you’re the first brand in your industry to build a strong presence on the sources that enter future training cycles, you occupy a space that others haven’t yet claimed.

You can’t change past training data. But you can influence future training data. Training cycles update — and the sources that get included in the new datasets are those with greater citability, structural authority, and presence on the platforms already in the corpus.

How to check your situation

A practical test I often do with clients:

Ask a model a generic question about your industry in Italian

Evaluate the answer: is it generic? Does it contain industry errors? Does it cite brands?

Ask the same question in English

If the English version is significantly more detailed and accurate, your industry is underrepresented in Italian in the training data

A second indicator: look for your industry in the public documentation of The Pile. If the main sources of your market — trade magazines, industry associations, regulatory bodies — are not in the list of sources, you have indirect confirmation of under-representation.

RAG systems (like Perplexity or Bing Chat) operate on real-time retrieval — they don’t depend on historical training data, but on the quality of your site’s current indexing. You can appear in RAG answers even if you’re not in the base training data. But for models that answer without real-time search, the training data is all that exists.

What to do concretely to reduce the disadvantage

Publish on the sources that enter the training data. You can’t control what gets included, but you can increase the probability. National media, Wikipedia, and international platforms like Reddit, Stack Exchange, and Quora are sources that have already proven to end up in training cycles. An editorial presence on these platforms — citations, mentions, accredited articles — increases the probability that your industry and your brand are represented in the next cycle.

Create content in English on your industry’s core topics. If your domain is underrepresented in Italian but covered in English, an English version of your key content is a pragmatic way to enter a much larger corpus. It’s not about abandoning Italian — it’s about maintaining a presence on both surfaces.

Become the cited source of your industry. Training datasets favor highly citable documents — papers, official guides, content that other sites link to as a reference. Building content that becomes a reference point for the industry is not just an SEO strategy: it’s a strategy for increasing the statistical weight of your domain in the next training cycle.

Watch out for safety filters. It’s worth remembering that under-representation combines with the Constitutional AI mechanisms that operate on the answers. An underrepresented industry can be filtered not only out of ignorance, but because the model doesn’t have enough context to distinguish between legitimate content and problematic content of the domain. Deduplication adds a further layer: if the few pieces of content from your industry present in the training data are near-duplicates of each other, they get reduced to a single one. Even less.

The central point many overlook

AI visibility doesn’t start the moment a user runs a query. It starts from the composition of the dataset the model was trained on months or years earlier. Understanding that composition — and understanding where your industry is underrepresented — is the first step in building a strategy that isn’t just reactive.

It’s not a matter of optimizing a single article. It’s understanding the default weight your content carries into the model and building the conditions to increase it.

Anyone who works on AI visibility without considering the composition of the training data is optimizing on the surface without having solved the foundation.

Is your industry underrepresented in the training data? AI already starts at a disadvantage

Content’s default weight: the technical principle

What happens when your domain is marginal in the training data

The internal structure of training data: how the dataset is built

Underrepresented industries: opportunity or obstacle?

How to check your situation

What to do concretely to reduce the disadvantage

The central point many overlook

Continue with the deep dives