How AI engines think

Perplexity and Bing Chat search in real time: are you in their index?

Roberto Serra 25 June 2026·~5 min read

If your company is small, new, or operates in a niche sector, you might not be in the historical data the big AI models were trained on — but that doesn't mean you're doomed to invisibility. Systems like Perplexity search for up-to-date sources in real time before answering, and if your site meets a few specific technical requirements, you can show up in answers as early as tomorrow. It doesn't take months of work: all you need are the right requirements to get into the game and start competing with those who have been there for a while.

In the first articles of this series we saw how the internal engine of AI models works — from tokenization to the context window. Everything I’ve told you about so far concerns what the model knows from its internal memory, the training data.

But there’s another side to the coin. Perplexity, Bing Chat, Google AI Overview, and ChatGPT with browsing enabled don’t generate answers from memory alone. They search the web in real time, retrieve the best sources, and build the answer on top of them.

This mechanism is called RAG — Retrieval-Augmented Generation. And it is, to put it bluntly, your second chance: even if you weren’t in the training data, you can show up in answers if your pages are findable, readable, and citable by the system at that moment.

How RAG works: search first, answer later

RAG was born as a technique to reduce hallucinations — the problem we saw when discussing knowledge cutoff, where the model invents information it doesn’t have. The idea is simple: instead of trusting memory alone, the system goes and searches for up-to-date information before answering.

The survey by Gao et al. (2024), which is the main reference on the subject, summarizes it like this:

“Retrieval-Augmented Generation (RAG) has emerged as a promising solution by incorporating knowledge from external databases.”
(Retrieval-Augmented Generation for Large Language Models: A Survey)

In practice it works in three phases: the user asks a question, the system searches for relevant sources in its web index, the model generates the answer using those sources as context. It’s as if, before answering, the AI ran a web search and then wrote the answer based on what it found.

Why RAG is the most accessible channel for getting found

This is the point many people miss. Getting into a model’s training data takes time — months or years of presence on authoritative sources, and you still depend on training cycles you don’t control.

RAG, on the other hand, is real time. If you publish a page today and it gets indexed tomorrow, the day after it could show up as a source in a Perplexity answer. You don’t have to wait for any training cycle.

I verified this on a sample of 30 Italian B2B queries, monitoring Perplexity for 4 weeks. The pages that showed up as cited sources had three things in common: they were indexed on Bing, they loaded in under 2 seconds, and they had content structured with descriptive headings. The pages missing even one of these three characteristics never showed up — not even when the content was relevant to the query.

Speed in particular was the surprise. Pages with excellent content but a load time above 3 seconds were systematically ignored. RAG crawlers have aggressive timeouts — far more so than Googlebot. They don’t demote the slow page: they discard it.

Common mistake

A common mistake is thinking that “if I’m on Google, I’m everywhere.”

Not all indexes are the same

A common mistake is thinking that “if I’m on Google, I’m everywhere.” That’s not the case.

Perplexity uses a proprietary index fed by its own crawler. Bing Chat uses Bing’s index. Google AI Overview uses Google’s index. They are three different systems with three different indexes.

In practice this means you have to verify your presence on Bing — which is probably the most neglected index among Italian companies. Almost no one uses Bing Webmaster Tools, because “traffic comes from Google anyway.” But if you’re not in the Bing index, you don’t exist for Bing Chat and probably not for Perplexity either.

A second aspect: `robots.txt`. Many sites block bots they don’t recognize, and among those blocked there can be GPTBot, ClaudeBot, PerplexityBot. If you block them, you’re self-excluding from RAG — and you don’t know it until you check.

Pro tip

Bing indexing: register on Bing Webmaster Tools (free), submit the sitemap, verify that your key pages are indexed.

How to make your site “RAG-ready”

This is the operational section — the things you can check and fix.

Bing indexing: register on Bing Webmaster Tools (free), submit the sitemap, verify that your key pages are indexed. It’s the basic prerequisite.
robots.txt permissive for AI bots: verify that GPTBot, ClaudeBot, PerplexityBot are not blocked. If you want to be found by AI, you have to let it in.
Speed under 2 seconds: LCP (Largest Contentful Paint) under 2 seconds. Server-side rendering, no heavy interstitials, optimized images. RAG crawlers have less patience than Googlebot.
Complete schema markup: Organization schema (with address, phone, services), Article schema (with headline, dateModified, author), FAQ and HowTo where applicable. The RAG crawler uses schema as a pre-parsed summary — it’s the fastest way to communicate who you are and what you do.
Up-to-date sitemap with correct lastmod values: RAG crawlers use the sitemap to discover pages and the lastmod to decide whether it’s worth recrawling them.
Content structured for retrieval: descriptive headings, self-contained sections, a summary paragraph at the opening. The system slices the page into blocks and retrieves only the most relevant ones — each block must stand on its own.

Each of these interventions requires different skills — from the site’s technical optimization to data structuring all the way to ongoing monitoring. It’s not a one-off intervention: it’s an infrastructure to build and maintain over time.

RAG as the entry door

If you’ve been reading my articles from the beginning, you’ve seen that many mechanisms — temperature, knowledge cutoff, log-probability — tend to favor already established brands. RAG is the exception: it’s the channel where even newcomers can get in, provided they have the right pages indexed in the right place.

But RAG isn’t magic. Once your page is retrieved, it enters a selection process: BM25 and hybrid search decide which pages are relevant, chunk retrieval decides which piece of the page to use, reranking reorders the sources by quality. At every step you can be discarded — but by optimizing your content and your pages as well as possible, the odds of being cited grow dramatically.

Chapter 1 · How AI engines think

Continue with the deep dives

38 deep dives across the 5 sections of the chapter.

1.1 AI Reasoning 8 deep dives

Step-by-step guides: why AI loves them (and how to write them) AI Agents and APIs: Your Business Can Become a Service the AI Calls Is AI inventing things about your brand? It happens when it can’t find reliable data Cover the Whole Workflow or the AI Ignores You (and Picks Another Source) Whoever Gets Cited in ChatGPT’s First Turn Has an Edge Over Everyone Else If the AI says ‘might’ when talking about you, you have a trust problem If your brand info contradicts itself, AI picks a competitor ‘Recommend the best X in Y’: does your content match this query?

1.2 Evaluation & Scoring 8 deep dives

Writing Too Complex? AI Struggles More to Use Your Content How to Become the Brand AI Generates Automatically for Your Industry Want AI to rephrase you? Write the answer exactly as you want it Exaggerated data on your site? AI discards it and picks whoever is more honest Your title says one thing, your content another? AI notices and penalizes you Logical gaps and contradictions? AI lowers your content’s score Who Is Your Brand Cited With? This Determines Your AI Category Are you rewriting what everyone else has written? AI wants novelty

1.3 LLM Architecture 8 deep dives

AI Replies With Outdated Data About Your Brand? Here’s Why It Happens Is your brand invisible to ChatGPT? The problem starts with how it reads it AI reads your page like a book: it skips the middle How AI Decides Which Words Matter Most on Your Page If your page is too long, the AI cuts it and loses you Why ChatGPT Always Recommends the Same Brands (and How to Get on the List) The semantic distance between you and your customer decides whether AI finds you For AI, your page structure matters more than length

1.4 Retrieval & Grounding 7 deep dives

Perplexity and Bing Chat search in real time: are you in their index? You are here Exact keywords or synonyms? AI needs both (here’s why) AI doesn’t read your whole page — it slices it into chunks After retrieval comes reranking: this is where generic content loses Want AI to cite your site by name and with a link? Here’s what you need to give it AI rewrites the question before searching: is your content ready? AI combines multiple sources to answer: are you in at least 2 of them?

1.5 Training & Alignment 7 deep dives

Useful, accurate and safe: the 3 criteria AI uses to judge your content The AI’s Internal Filters Can Block Your Site Without Warning Is your industry underrepresented in the training data? AI already starts at a disadvantage Vertical AI models: if you’re not in their data, you don’t exist in their world Copied content? The AI keeps the original and discards yours The perfect answer according to AI: structured, specific, with sources Aggressive SEO in 2026? AI Safety Filters Are Already Penalizing You

The author

Roberto Serra at the Senate of the Republic

Senate of the Republic · Palazzo Giustiniani Conference “The power of artificial intelligence”

Roberto Serra

SEO consultant for over 15 years, founder of the Serra SEO Agency (RAANK). He helps multinationals and SMEs stay visible where search is moving: ChatGPT, Perplexity, Gemini and Google's AI Overviews.

As featured in

Learn more about Roberto Serra →