How AI engines think

Is your brand invisible to ChatGPT? The problem starts with how it reads it

Roberto Serra 25 June 2026·~8 min read

Your brand might be invisible to ChatGPT for a reason that has nothing to do with content or reputation: the way AI reads text can break your company name into meaningless fragments, making it unrecognizable. This is not a remote hypothesis — it is a mechanical problem that affects many brands with compound names, acronyms or unusual characters. Checking it takes thirty seconds, and fixing it takes just as long — enough to reopen a visibility channel you are handing to your competitors without even knowing it.

You are investing in content, SEO, digital PR — yet when someone asks ChatGPT to recommend a brand in your industry, your name does not come up. The problem might not be your authority or your content. It could be something much more basic: the way AI physically reads your brand name.

Language models do not read words the way we do. They break them into fragments called tokens, and if your brand gets fragmented into meaningless pieces, for the AI it is as if it does not exist as an entity.

What tokenization is and why it decides whether AI sees you

Before generating any response, an LLM like GPT-4 or Claude converts text into a sequence of tokens — numerical units the model can process. Some words become a single token (“Apple” → 1 token), others get broken into sub-pieces (“Pinalli Profumerie” → 5 separate tokens).

If you want to dig deeper, I recommend a very interesting paper by Minaee et al. from 2024, which on this subject states verbatim:

“out-of-vocabulary (OOV) is a problem in this case because the tokenizer only knows words in its dictionary”

In simple terms: if a word is not in the tokenizer’s vocabulary, it gets broken into sub-pieces according to a statistical algorithm called BPE, Byte-Pair Encoding, and the resulting pieces — read this carefully — are valid tokens, but on their own they do not carry the meaning of the whole word: the model has to reconstruct it from context, and for rare names it often fails.

An independent experiment confirms this with even sharper data:

“in the absence of tokenization, [transformers] empirically fail to learn the right distribution and predict characters according to a unigram model. With the addition of tokenization, however, transformers break through this barrier and are able to model the probabilities of sequences drawn from the source near-optimally”
(Toward a Theory of Tokenization in LLMs)

In short, tokenization is not a technical detail — it is the step that lets the model move from statistical chaos to the recognition of patterns you MUST be part of if you want to be relevant.

And this matters to you a great deal, because if your brand does not clear this step well, it starts the whole chain at a disadvantage.

How it works in practice: BPE and the model’s vocabulary

Modern tokenizers use an algorithm called Byte-Pair Encoding (BPE). The principle is simple: during training, the algorithm counts the most frequent byte pairs in the corpus and merges them into a single token. It repeats the process thousands of times, building a vocabulary of about 200,000 tokens (GPT-4o uses 200,019).

A word like “marketing” appears millions of times in the training corpus, so it becomes a single token. But “TecnoImpianti” never appeared often enough — BPE breaks it into fragments: Tec|no|Im|pi|anti. Five tokens for a name that should be one unit.

The crucial point for your business: tokenization happens before the model starts reasoning. When the model receives the sequence [Tec][no][Im][pi][anti], it does not yet know that those 5 fragments form a brand. It has to reconstruct the meaning from context — and it often fails, because in the training corpus those 5 tokens appear in that sequence too few times.

Common mistake

If your website says “Digital Marketing Studio Rossi”, your social profiles say “StudioRossi”, the directories say “Studio Rossi S.r.l.” and the press releases say “STUDIO ROSSI” — you are fragmenting the signal.

Before and after: real numbers tested with tiktoken

To understand the impact, look at this data tested with tiktoken (GPT-4o, o200k_base tokenizer):

1-token brands — “Nike”, “Google”, “Apple”, “Amazon”. The model recognizes them as single units. When it has to generate a recommendation, it produces them as a single block with high probability. They are in the native vocabulary.

2-token brands — “Ferrari” (Ferr|ari), “Barilla” (Bar|illa), “HubSpot” (Hub|Spot). Still manageable: the model has seen these pairs often enough to assemble them easily. The computational cost is minimal.

Fragmented brands — “TecnoImpianti Soluzioni Industriali”: 9 tokens (Tec|no|Im|pi|anti|Sol|uzioni|Industrial|i). The model has to assemble 9 fragments in sequence to produce the full name. At each step, the conditional probability of the next token is low — the model might drift toward a more probable sequence (that is, a competitor with a 2-token name).

An industrial valve manufacturer from Brescia might have the best product on the market, but if its brand gets fragmented into 9 tokens, the AI prefers a competitor with a more compact name. It is not fair — it is the model’s mechanics.

Pro tip

Test your brand’s tokenization on tiktokenizer.vercel.app — select the GPT-4o model (o200k_base), enter the exact name of your brand and count how many tokens it produces.

Why the OOV problem hits compound and non-English brands harder

GPT-4’s vocabulary was built predominantly on English text. Compound words, names with Latin prefixes, legal company names with “di”, “del”, “e” are especially penalized by BPE because they do not have enough frequency in the corpus.

Some recurring patterns I see in companies:

Compound names without spaces — “AutomeccanicaRossi”, “ElettroserviceNord”. BPE does not recognize the compound and fragments it unpredictably. The same name with a space (“Automeccanica Rossi”) might tokenize better because “Rossi” is a known token.

Long legal company names — “Studio Associato Dott. Rossi & Partners Consulenza Tributaria” is the name on the business registry. For the AI it is a wall of fragmented tokens. The communicative name (“Studio Rossi”) is probably 2 tokens. Many websites use the full legal name in the header, the footer, the About page — and the AI sees it every time as a fragmented sequence.

Acronyms and initialisms — Sometimes the acronym tokenizes better than the full name. “ENEL” is 2 tokens, “Ente Nazionale per l’Energia Elettrica” is 12. If your industry uses the acronym, push that one.

The mistakes I see most often

Ignoring the variants. Your brand probably has a long name and an abbreviation. “Politecnico di Milano” and “PoliMi” have different tokenizations. If everyone in your industry uses the abbreviation and you push only the long name, the AI associates the abbreviation with others and does not generate the long name because it is too fragmented.

Never testing. Most companies do not know how their own brand is tokenized. It is a 30-second check that no one does — and that can explain months of invisibility in AI responses.

Using inconsistent variants. If your website says “Digital Marketing Studio Rossi”, your social profiles say “StudioRossi”, the directories say “Studio Rossi S.r.l.” and the press releases say “STUDIO ROSSI” — you are fragmenting the signal. The AI learns from repetitions in the corpus: every different variant dilutes the frequency of the correct token sequence.

What to do concretely

Test your brand’s tokenization on tiktokenizer.vercel.app — select the GPT-4o model (o200k_base), enter the exact name of your brand and count how many tokens it produces. One token = excellent. Two = acceptable. Four or more = you have a problem.
Test all the variants: full name, abbreviation, acronym, name without spaces. Often a shorter variant tokenizes better. If “Digital Marketing Studio Rossi” is 5 tokens but “Studio Rossi” is 2, you know which variant to push in your content.
Compare with competitors: enter the names of the 3-5 brands the AI recommends in your industry and count their tokens. If they are at 1-2 tokens and you are at 4-5, you have found one of the structural reasons the AI prefers them.
Check consistency: use the same brand variant everywhere — website, social, directories, press releases. The AI learns from repetitions. If you use 4 different variants, you are spreading the frequency across 4 different token sequences instead of concentrating it on one.
Consider future naming: if you are launching a new product or sub-brand, test the tokenization before choosing the name. A name the model tokenizes as a single unit has a permanent structural advantage.

Tokenization in the AI visibility chain

This mechanism is the first link in a chain. After tokenization, the model assigns a position to each token (positional encoding), then decides how much weight to give each token in context (attention mechanism), and finally generates the response within a limited context window (context window).

If your brand does not clear the first step well — tokenization — all the other mechanisms start with a weak signal. It is like showing up to an interview with your name misspelled on the badge: you can be the best candidate, but you start at a disadvantage.

Open tiktokenizer.vercel.app today, enter your brand name and check whether it is recognized as a single token or fragmented into pieces. It takes 30 seconds and it can explain months of invisibility.