AI Platforms

ChatGPT’s recipe: where your brand ended up in its training data

There are two ways ChatGPT can know your brand, and they work in completely different ways: what it learned during its training, and what your enterprise clients fed it directly inside their Enterprise accounts. If you don't act on both, you're controlling only half the problem. The good news is that you can act on both channels — but only if you know how they work.

The question isn’t “how do I get AI visibility”. It’s “are your enterprise clients uploading my brand into their ChatGPT Team/Enterprise”? And even before that: is my brand already inside the data ChatGPT was trained on?

These are two different questions, with different answers. The one I tackle here is the second, because if your name never passed through GPT’s training corpus, everything else becomes an uphill climb: every time a user asks ChatGPT something about your industry, the model starts from a knowledge base in which you don’t exist.

The good news is that GPT’s training data has a documented recipe. It’s no mystery: Common Crawl for the bulk of the public web, books, Wikipedia, Reddit, GitHub. Knowing the relative weights tells you where to invest in order to end up inside the next version of the model.

What’s inside the corpus that taught ChatGPT

When I say “training data” I mean the pile of text OpenAI fed the model during pre-training. It’s not the same thing as the live web ChatGPT consults when you turn on search: that’s a later layer (I cover it in the articles in the series on platforms).

In the world of language-model research, the publicly known composition of GPT training corpora is a recipe that has been stable over time: roughly 60% comes from Common Crawl (the snapshot of the open web), about 16% from corpora of digitized books, a share of around 3% from Wikipedia, the rest spread across Reddit (filtered for quality), GitHub and minor sources.

The exact figures change between GPT-3, GPT-4 and later versions, and OpenAI no longer publishes tables with the granularity of the original 2020 paper. But the qualitative proportion holds: the public web weighs more than everything else put together, Wikipedia carries a weight disproportionate to its size, and Reddit enters the corpus as a proxy for “conversational” natural language.

From this follows an operational consequence many business owners haven’t brought into focus: not all sources weigh the same. In terms of the probability of being repeated by ChatGPT, a paragraph on Wikipedia is worth far more than ten posts on your company blog. A mention on Reuters or Il Sole 24 Ore (which enter Common Crawl with high authority) is worth more than twenty mentions in industry directories.

Why this sits upstream of everything else

In my earlier articles I told you about tokenization and about E-E-A-T for AI. These are two mechanics that determine how the model processes and evaluates content. But before the model “evaluates” anything, that something must have been seen during training.

The composition of the corpus is filter zero. If your brand doesn’t appear in any of the heavy sources (Common Crawl with high link equity, Wikipedia, books, Reddit), ChatGPT learned nothing about you during pre-training. It can only find you if it activates live search, or if an enterprise user uploads your documents into their Team environment. That’s a valid scenario, but it’s a fallback: the game is played earlier.

The same reasoning applies that I make in the articles on author entity recognition and backlinks as a citation proxy: the signal of existence in the model’s knowledge graph comes from where you are cited, not from how much you write on your own site.

Common mistake

A hotel in Cervinia with a gorgeous website and zero mentions in national outlets is just as invisible as one with no website at all.

The test you can run in 10 minutes

The most honest way to figure out whether your brand is in ChatGPT’s corpus is this:

  1. Open ChatGPT with web search turned off (base model, no browsing). Ask: “What do you know about [brand name]?”. If it answers with specific, correct details, you’re inside the training data. If it answers “I don’t have information” or makes things up, you’re not (or you’re below the recall threshold).
  2. Check your Wikipedia page. If there isn’t one, that’s the first gap. If there is one but it’s a three-line stub, that’s the second gap. Use Wikidata to see whether the brand exists as a structured entity.
  3. Search for your brand on Google News and filter for national or international outlets (Reuters, ANSA, Il Sole 24 Ore, Corriere). These enter Common Crawl with high weight. If you don’t show up, you’re not sowing where the model harvests.
  4. Check your presence on Reddit. You don’t need to post yourself: what you need is for someone to have cited you in relevant threads in your industry (r/italy, r/travel, vertical subreddits).

Binary threshold: if ChatGPT without browsing doesn’t recognize you AND you don’t have a Wikipedia page AND you don’t appear in national outlets, your brand is outside the corpus. Period.

Pro tip

Produce structured public documents (PDFs, web pages with precise data) that your B2B partners can upload into their ChatGPT Enterprise environments.

The case study: a luxury ski hotel in the Aosta Valley

A five-star hotel in Courmayeur (I’ll call it “Hotel Alpha” for confidentiality) contacted me last year with a specific problem: the international B2B tour operators that sell luxury ski packages were asking ChatGPT Enterprise “best luxury hotels in Courmayeur for UHNW clientele” and their name never came up. Two long-standing competitors from Cervinia and one from Zermatt did.

First-hand diagnosis: no Wikipedia page, zero citations on Reuters/Bloomberg/FT (which cover the luxury ski segment), no presence on Reddit in the threads of r/skiing or r/luxurytravel. They were invisible to the training data. They had invested in a gorgeous multilingual website and in Google Ads: none of that enters GPT’s corpus.

The intervention (lasting 4 months): building a public knowledge base in the form of technical fact sheets (indexable PDFs + web pages) with precise specs on rooms, ski-in/ski-out services, partnerships with mountain guides, verifiable operational data. These fact sheets were made available as downloadable, indexed “operator dossiers”, so tour operators could upload them directly into their ChatGPT Enterprise as reference documents. In parallel: pitching to 2 international trade publications (one British, one Swiss) that Common Crawl harvests, and opening the Wikipedia entry with independent sources.

Result after 3 months: ChatGPT without browsing still didn’t know them (logical, the next training cut-off hadn’t arrived). But ChatGPT with browsing active cited their technical fact sheets in 4 out of 10 test queries. And most importantly: the B2B tour operators using ChatGPT Team with the dossiers loaded got detailed, correct answers about the property. Stated limits: a sample of 10 queries, an indicative test and not a controlled study, a single property. The pattern, however, is consistent with dozens of similar clients in high-end tourism.

The mistakes I see most often

In luxury tourism, but also in other premium B2B sectors, the mistakes around training data composition keep repeating:

  • Betting everything on your own website. Your site does enter Common Crawl, sure, but with low weight if it has no links from authoritative domains. A hotel in Cervinia with a gorgeous website and zero mentions in national outlets is just as invisible as one with no website at all.
  • Ignoring Wikipedia because “clients don’t read it anyway”. Clients don’t, but the model does. And it reads it with a weight multiplied relative to its size.
  • Relying only on category portals (Booking, TripAdvisor, destination portals). They get crawled, but the content is aggregated and doesn’t identify you as a distinct entity.
  • Not providing structured material to B2B partners. In the case of luxury ski hotels, tour operators are the channel that uploads your documents into their ChatGPT Enterprise. If you don’t give them clean technical fact sheets, they’ll upload the competitors’.

What to do concretely

  • Open a Wikipedia page with independent third-party sources (not self-referential). If you don’t have enough media coverage to justify it, build that first: media relations with 2-3 high-authority trade publications.
  • Produce structured public documents (PDFs, web pages with precise data) that your B2B partners can upload into their ChatGPT Enterprise environments. Treat them as a knowledge base, not as a brochure.
  • Get cited where the corpus harvests: national and international outlets, academic publications if your industry allows it, relevant Reddit threads (through PR, not spam).
  • Check your presence with the no-browsing test every 6 months. It’s not a magic factor and it’s not enough on its own, but it’s the most honest check of “do I exist in ChatGPT’s world”.

Real analysis requires professional tools and coordinated media relations work, but this entry-level audit tells you in half an hour whether you’re inside or outside the corpus.

The thread: training data composition is the foundation

To show up in AI answers, optimizing your site isn’t enough. The model answers well about you only if it has already seen you during training, or if someone (an enterprise user, live search) puts you in front of it at the moment. The composition of the GPT corpus tells you exactly what the levers are: Wikipedia, national outlets inside Common Crawl, Reddit, the books corpus. No shortcut.

In the next articles in this series I go into detail on how Anthropic’s Claude handles its training, how Gemini differs, and how Perplexity sidesteps the problem by drawing on the live web. If you’re interested in understanding how ChatGPT recognizes entities, I point you to the article on named entity recognition and to the one on entering the Google Knowledge Graph, which is a close cousin of the mechanism I’ve described here.

Chapter 6 · AI Platforms

Continue with the deep dives

40 deep dives across the 5 sections of the chapter.

6.1 Bing Copilot & Others 12 deep dives
6.2 ChatGPT & OpenAI 8 deep dives
6.3 Claude & Anthropic 4 deep dives
6.4 Google Gemini & SGE 8 deep dives
6.5 Perplexity 8 deep dives
The author
Roberto Serra at the Senate of the Republic Senate of the Republic · Palazzo Giustiniani Conference “The power of artificial intelligence”
Roberto Serra Roberto Serra

SEO consultant for over 15 years, founder of the Serra SEO Agency (RAANK). He helps multinationals and SMEs stay visible where search is moving: ChatGPT, Perplexity, Gemini and Google's AI Overviews.

As featured in
ANSA Il Sole 24 Ore Le Iene Università di Cagliari La Repubblica
How visible is your brand to AI? Analyze your brand