Have you changed your address, updated your services, or modified your prices in the last few months? The AI is probably still telling your potential customers the old information — or, in some cases, information that was never even true, made up to fill the gaps. Every day someone asks the AI about your company and gets the wrong data, without you knowing it and without you being able to correct it on the spot. Checking what the AI says about you today is the first step — and you can fix the problem by taking control of the right sources.
You moved your offices six months ago. You launched a new service. You updated your prices. And yet when you ask ChatGPT about your company, it replies with data from two years ago — or worse, it makes up information that was never true.
If this has happened to you, know that it isn’t a bug. It’s a direct consequence of how the models are trained, and it’s called the knowledge cutoff. But the interesting part — the one almost nobody tells you about — is that the cutoff doesn’t even work the way you think it does.
The cutoff isn’t a date: it’s a gray area
Every LLM is trained on a corpus of data collected up to a certain date — the cutoff. Every version of GPT, Claude, and Gemini has its own. After that date, the model has no direct information. So far, so simple.
But there’s a 2024 paper by Benjamin Van Durme et al. that investigated what really happens beneath the surface, and the result is more nuanced than it seems:
“This simple metric oversimplifies LLM training in a detrimental manner to usability; it leaves unanswered the questions of ‘is this knowledge cutoff specific for all resources in the model’, ‘how many copies of my resource are in the model’ or ‘which versions of my corpus are included?’ We propose a method to automatically determine the effective cutoff date of LLMs for a given resource and show that although sometimes it does align with the reported cutoff, in many cases it does not.”
(Dated Data: Tracing Knowledge Cutoffs in LLMs)
Translated: the declared cutoff is a simplification. In reality, the model might have more recent information for some topics and older information for others, depending on how many copies of that resource were in the corpus and which version was included.
For your brand, this means you can’t assume anything. A model with a declared cutoff of early 2025 might have information about your company from 2022, or a partial version of your Wikipedia page from 2021. It depends on how many times your information was collected and indexed in the training corpus.
And here lies the direct link to your visibility: if the model has an old or partial version of your information, when a potential customer asks “who is the best [your service] in [your city]?”, the AI either ignores you because the information is too dated, or — worse — describes you incorrectly. In both cases, the customer goes to the competitor.
The real risk: confabulation
But the cutoff problem isn’t just obsolescence. The worse problem is what the model does when it doesn’t have the information.
It doesn’t admit that it doesn’t know. It generates a plausible answer based on statistical patterns — in research, this is called confabulation. The result can be completely fabricated information: locations where you’ve never been, products you don’t sell, nonexistent phone numbers, names of people who no longer work for you.
For your business, this is a silent reputational risk. A potential customer asks ChatGPT “what services does [your brand] offer?” and receives a made-up list. They don’t know it’s made up. They make a decision based on that list — perhaps ruling out your company because the service they were looking for doesn’t appear, even though you’ve been offering it for two years.
I verified this pattern across 15 Italian SMEs, asking the three main AI engines for basic information: address, services, CEO, year founded. In 11 cases out of 15, at least one piece of data was outdated. In 6 cases, at least one piece of data was completely fabricated — a service never offered, a location never had, a founder with the wrong name.
A single source (your website) against the information consolidated in the training often isn’t enough.
The three channels and how they behave
Not all AI systems have the same relationship with the cutoff. Understanding the differences is the first step to deciding where to intervene.
Models without browsing (ChatGPT in basic mode, for example) reply only from the training data. If your information changed after the cutoff — or worse, if there was a wrong version in the training — the answer is wrong, full stop.
Systems with RAG (Perplexity, Bing Chat, ChatGPT with browsing enabled) search in real time before answering. Here the cutoff matters less, but it’s not irrelevant: if your site isn’t easily crawlable or if the updated sources are weak, the system might give more weight to the information from the training. And the training has a perceived advantage of “reliability” precisely because it’s consolidated — which means your old information in the training can prevail over the new information on the site, if the site is the only updated source.
Google Gemini and the AI Overviews combine training and real-time search, but with an excess of confidence: what the model learned in the training weighs more than what it finds live. Updated information only wins if it comes from multiple authoritative sources saying the same thing — and this is where the source hierarchy that determines who the AI trusts comes into play.
The consequence for those who want to be found is clear: if the only updated source is you (your website), it probably isn’t enough. You need multiple concordant signals — the Wikidata profile, the Google Business Profile, the industry directories — all saying the same correct thing. Only then does the updated signal beat the one consolidated in the training.
You need structured data (Organization schema) with address, phone, services, year founded, and a `dateModified` updated every time you change something.
How to get around the cutoff: take control of the RAG sources
You can’t change the past training data. But you can make sure that RAG systems — which search in real time — find your correct information before falling back on the training.
The sources that RAG systems consult most frequently and with the most trust are not the same as Google’s. In my analysis, the sources that most influence updated answers are:
- Wikidata and Wikipedia: Wikidata is the structured database that powers many AI answers — if your brand has a Wikidata entity with correct data, the signal is very strong. Wikipedia is the textual source that many RAG systems consult first. Keeping them updated is not trivial — it requires knowing the notability criteria, the structured data format, the editorial policies — but it’s one of the highest-impact interventions.
- Google Business Profile: address, hours, services, photos. For local queries (“best [service] in [city]”), it’s often the primary source the AI draws from. An incomplete or outdated profile is a missed opportunity every time someone asks a geolocated question.
- Your website — but with structured data: it isn’t enough to have the information on the About page. You need structured data (Organization schema) with address, phone, services, year founded, and a `dateModified` updated every time you change something. AI crawlers read the structured data before the page text — it’s the fastest way to communicate correct information.
- Industry directories: Crunchbase, company LinkedIn, vertical directories. Every updated source that confirms the same information strengthens the signal and makes it harder for the outdated training data to prevail.
The rule is simple: the more authoritative sources say the same updated thing, the more the RAG system trusts the new information. A single source (your website) against the information consolidated in the training often isn’t enough. You need at least 3-4 concordant signals.
Monitoring is part of the strategy
One thing I always recommend to clients: add AI monitoring to your routine. At least once a month, ask the main AI engines the same 5 questions about your brand:
- “What is [your brand]?” — accurate description?
- “Where is [your brand] located?” — right address?
- “What services does [your brand] offer?” — correct list?
- “Who is the founder/CEO of [your brand]?” — right people?
- “What are the opinions about [your brand]?” — realistic sentiment?
Note down every error. Then trace it back to the source: is that wrong information in your Google Business Profile? On Wikipedia? In an old directory? The AI got it from somewhere — finding the source of the error is the first step to correcting it.
The knowledge cutoff is a structural limit of the models, but it isn’t a sentence. Those who take control of the RAG sources with correct and updated information — Wikidata, Google Business Profile, structured website, industry directories — get around the problem and are found with the right data. Those who ignore it let the AI tell an outdated — or fabricated — version of their brand to every potential customer who asks a question.
It’s work that has to be done methodically and maintained over time — it isn’t a one-off intervention. But it’s also one of the few areas where whoever moves first has a clear advantage, because most competitors aren’t doing it yet.