Entities and Knowledge Graph

Training Data Lifecycle: why corrections to your site don’t reach the AI right away

Roberto Serra 25 June 2026·~8 min read

You've changed your address, updated your prices, corrected the information on your site — but ChatGPT keeps telling your customers the old version. It's not an error you fix the way you do on Google: AI models update on their own schedule, which can be 6 or even 12 months. In the meantime, anyone looking for you gets the wrong data — and often doesn't even know it. There are specific channels to speed up the update, and using them the right way changes the waiting times dramatically.

I remember back in 2015 when Google took 3-6 months to absorb a substantial change to a site: you changed the structure, rewrote headings, updated your company data, and the engine noticed it at its own pace. With AI today the dynamic is similar, but you’re working with retraining cycles that differ for each model, and the window for it to register can be even longer. Let me explain how to think about the timing, so you’re not surprised if ChatGPT or Gemini still tell the old version of your company after you’ve corrected the site.

The case I’m starting from is real: an organic farm in Palermo, extra virgin olive oil, PGI citrus fruits and natural Sicilian wine. Three years ago it had changed its premises and trade name, updating the site right away. Yet in the autumn of 2025, asking Perplexity for “award-winning Sicilian organic olive oil producers,” the old name and the old address still came up. It’s not a bug. It’s the training data lifecycle working at its own pace.

What “training data lifecycle” means for an AI model

A language model is not a database that updates every night. It’s a snapshot of knowledge taken at a certain date, called the knowledge cutoff, and from there on that snapshot only updates in two ways: with a complete retraining (extremely expensive) or with continued pretraining on top of the existing checkpoint.

In the research world, Parmar and colleagues at NVIDIA explained well why almost no one chooses the first path today.

“Due to the large computational cost that pretraining of modern LMs incurs, frequent complete retraining is intractable.”

Parmar et al., 2024

Translated: the computational cost of training modern language models from scratch is so high that doing complete, frequent retraining is practically unfeasible. That’s why labs reuse existing models by adding targeted training phases on top of them.

For you as a business owner this means one simple thing: when you correct a piece of data on your site, that data doesn’t enter ChatGPT’s brain the following week. It enters when the lab that produces it decides to run a model update session. It might happen in 3 months. It might happen in 12. It depends on the model.

Why labs reuse models instead of rebuilding them from scratch

The same paper clarifies the economic logic that governs the market today.

“This makes the reuse of already developed LMs via continued pretraining an attractive proposition.”

Parmar et al., 2024

In plain terms: reusing already developed language models through continued pretraining is an advantageous option. The operational consequence for your business is that a model’s knowledge becomes layered: some parts get refreshed, others stay frozen in an earlier state for many cycles.

This directly connects the topic to the way I’ve already explained the workings of the vector space of embeddings: once your brand is mapped in the model’s semantic space with certain characteristics, moving it requires new data repeated at scale, not a single update of your site.

Common mistake

Updating only the site and waiting.

The three paths by which a correction reaches the AI

A business owner needs a simple mental map. The data you correct today reaches the AI through three channels, with very different timings.

RAG channel (real time or nearly so): when an engine like Perplexity, Google AI Overviews or ChatGPT’s web search builds the answer by pulling pages at that moment, your correction is read right away. Here you work on the site, schema markup, Google Business Profile, Wikidata.

Periodically refreshed index channel (weeks or months): some systems use semantic indexes that are regenerated every so often. Your correction enters when the index is rebuilt.

Model training channel (months or years): the knowledge inside the model only changes during continued pretraining sessions. It’s the slowest and least predictable channel.

For the Sicilian farm I mentioned, the solution was to work on all three in parallel. The site was rebuilt with Organization schema, Wikidata updated with the new name, the new premises and references to industry awards, the Google Business Profile rebuilt from scratch. Within three weeks Perplexity was already citing the corrected version. Google AI Overviews took two months. ChatGPT without browsing, on a direct question, kept returning the old name for several months.

Pro tip

Take inventory of the structured sources that talk about you: site, Google Business Profile, Wikidata, industry directories, professional registers.

The test you can run in ten minutes on your brand

You don’t need a professional tool to understand what state you’re in. Follow this path, in order.

Open Google’s Rich Results Test, paste your homepage URL and look for “Organization”: if the name, address and site come up correct, the RAG channel has something to work with. If they don’t show up, you have a problem upstream.
Go to Wikidata and search for your company name. If an entry exists, check that data such as address, sector and site are up to date. If it doesn’t exist, ask a professional to create it: it’s one of the structured sources most pulled by the models.
Open Perplexity and then ChatGPT with web search, and run three queries about your sector: one generic (“producers of X in Y”), one specific with a category (“best organic Z in Italy”), one comparative (“differences between A and B”). Look at which sources are cited and with which data.

This is an entry-level check; it doesn’t replace real analysis, which requires professional tools and deeper work on the brand’s entity mapping. But in ten minutes it gives you an honest picture of where you stand.

The reasoning behind the uptake times

I haven’t run a controlled test on the labs’ retraining windows, and it would be dishonest to make up percentages. I reason on what the research documents and on what I observe repeatedly with clients.

Parmar and colleagues, in the paper already cited, observe that by applying targeted continued pretraining techniques on a 15-billion-parameter model they achieved an average accuracy improvement of 9% compared to the baseline.

“When applying these findings within a continued pretraining run on top of a well-trained 15B parameter model, we show an improvement of 9% in average model accuracy compared to the baseline of continued training on the pretraining set.”

Parmar et al., 2024

The practical point isn’t the 9%. It’s that labs have a strong economic incentive to refresh models with periodic continued pretraining, not with retraining from scratch. From this it follows that the typical update windows for commercial models range between 3 and 12 months: your correction travels on that track, not on the “online in two days” track. Translated: if in 2026 you want the AI to describe your brand correctly, structural corrections need to be made in the summer-autumn of 2025, not the month before.

The mistakes I see most often when working on the lifecycle

There are four patterns that recur with annoying regularity in companies that call me after having “sorted everything out.”

Updating only the site and waiting. The site helps the RAG channel; it doesn’t touch the training. Without Wikidata, complete Organization schema and external structured profiles, you’re working on a quarter of the problem.

Changing the name without a transition plan. When a company changes its brand name, for 6-12 months the AI keeps telling the old name. You need a continuity page, a public note and a mention in authoritative industry sources.

Expecting a single change to move the needle. Models learn by repetition at scale: a single corrected source, if all the others tell the old version, moves nothing. You need consistency across all public touchpoints.

Not monitoring what the AI says. Without a periodic round of queries about your brand, you discover the problem when it’s too late. A monthly cadence on 5-10 key queries is the bare minimum.

What to do concretely in the next 30 days?

If you’re thinking about an important correction to your brand, or if the AI is already telling something outdated about you, move like this.

Take inventory of the structured sources that talk about you: site, Google Business Profile, Wikidata, industry directories, professional registers. Correct everything within a single time window, not piecemeal.
Build a page on the site that clearly tells the “before and after” of the change (new premises, new name, new range): it serves as an anchor for RAG systems and as a basis for external citations.
Find 3-5 authoritative sources in your sector (niche magazines, trade associations, award registers) and check that they have the updated version. These are the sources the models pull with high weight.
Plan a quarterly check on ChatGPT, Perplexity and Gemini with the same 10 queries: that way you see the pace at which each model takes things up.

Always compare against the 3-5 competitors the AI cites today in your sector: if they show up with updated data and you don’t, the gap isn’t about quality, it’s about the work on structured sources.

The thread with visibility in AI answers

The training data lifecycle is the reason why visibility in AI answers is built in advance, not in reaction. What I’ve told you in previous articles – from knowledge graph reconciliation to the entity’s geographic association – works precisely on the fact that models learn slowly and forget slowly. Once you’re in, you stay in for a long time. But to get in you have to think in months, not weeks.

In the next articles we’ll dive deeper into continuous entity monitoring and the management of domain changes, the two operational sides of this topic.