Measuring AI visibility

The prompt framework that turns AI monitoring into comparable data

Roberto Serra 25 June 2026·~8 min read

If every month you measure your AI visibility with different questions, you're comparing apples with oranges — and the variations you see in the numbers mean nothing. It's not a tooling problem: it's that without fixed, identical questions repeated over time, you don't have data, you have anecdotes. Building a set of standardized questions to use every month is the work nobody does right away but that makes everything else finally legible and actionable.

Open ChatGPT. Send the same query 3 times, at different hours. The answers change — even the brands cited. Anyone who doesn’t monitor is measuring a single frame.

This is the test I ran last week with the query “best Italian companies for trade fair stands at Fiera Milano Rho”. Three submissions, three different answers: in the first, five brands appeared, in the second four (three shared with the first, two new), in the third six appeared (with partial overlap). Had I stopped at a single submission, I would have told the client “you never appear” or “you always appear”. Both statements would have been false.

This is where the need for a framework of standardized prompts arises: a fixed set of questions to send over time, always identical, always to the same platforms, always at the same cadence. Without this discipline, your measurement of visibility in AI answers is not measurement, it’s a collection of anecdotes.

What I mean by prompt monitoring framework

A prompt monitoring framework is simply the written list of queries you use every month to check whether and how your brand appears in the answers of ChatGPT, Perplexity, Claude and Gemini. It’s your yardstick. If you change the yardstick every time, you can’t tell whether you’ve grown or declined.

In the world of scientific research this principle is called a repeatable measurement instrument: nobody would publish a study claiming to have weighed a sample with different scales at every reading. The same logic applies to AI visibility monitoring. There is not yet an academic paper that formalizes a standard prompt protocol for brand visibility: it’s new ground. From which it follows that you have to build the framework yourself, and it must be frozen. Standardization is an operational choice, not a technical option.

This is a case of deduction: I take a consolidated methodological principle (the repeatability of the measurement) and apply it to the new domain of visibility in AI engines. There is no direct academic citation, there is a transfer of principle.

Why without fixed prompts your numbers are worthless

If this month you send ChatGPT 12 queries and next month 18 (plus three reworded), any variation in share of voice is contaminated. You changed the sample and the instrument at the same time. You don’t know whether you improved because you worked on the content or because you added queries in which you happen to appear more.

The standardized framework unlocks four things you can’t do without it:

compare your AI share of voice month over month (I covered it in share of voice in AI answers)
measure the average position in which your brand is named (first, third, last in the list)
track the sentiment of mentions over time
calculate coverage, that is, how many of the queries in your sector include you at least once

None of these metrics is legible without fixed prompts. It’s the upstream prerequisite.

Common mistake

The marketing manager revises the queries “to make them more realistic” and starts over from scratch.

The standard framework: 50 prompts split into 5 types

The set I use with clients is 50 prompts, ten for each of the five families below. It’s a number that’s manageable manually for those just starting out and broad enough to be statistically legible.

Brand prompts (10): queries in which the client’s brand is named explicitly. Example for an exhibit designer in Rho/MI: “What does the company Acme Stand of Rho do?”. They serve to measure whether ChatGPT knows the brand and what it says.
Category prompts (10): generic category queries. “Best Italian companies for B2B trade fair setups”. They serve to measure share of voice in the category.
Comparison prompts (10): comparison queries. “Differences between traditional and modular trade fair setups for international fairs” if you sell stands, or “Is a local accountant in Bologna or a national firm better for an innovative SRL?” if you’re a professional firm in Emilia. They serve to measure whether the brand appears in comparisons.
Recommendation prompts (10): personalized recommendation queries. “I’m an extra-virgin olive oil producer in Puglia, which specialized e-commerce platform do you recommend for selling in Northern Europe?”, or “I’m looking for a notary in Lecce for a generational business handover: who should I turn to?”. They serve to measure the recommendation position — the one that converts the most.
Local prompts (10): queries with a geographic constraint. “Exhibit design companies near Fiera Milano Rho”. They serve to measure local visibility, particularly important for those selling to international buyers on the move.

These five types are not random: they cover the real behaviors with which a buyer queries an AI when choosing a supplier. Skip one of the five types and you have a measurement gap.

Pro tip

Open a spreadsheet, title the first tab “Prompt Framework v1 — locked until [date 12 months from now]”.

The test you can run in 30 minutes

Before building the 50 prompts, create a mini-framework of 10 prompts (two per category) and apply it to three platforms. It’s enough to grasp the discipline. Procedure:

Write the 10 prompts in a spreadsheet, one column “prompt”, one column for each platform (ChatGPT, Perplexity, Gemini), one column “date”.
Send each prompt 3 times in a row on the same platform. Note the brands cited in each answer. Do this for all three platforms.
For each prompt, mark as “stable” the brands that appear in at least 2 out of 3 answers and as “unstable” those that appear in 1 out of 3.
Repeat the entire operation 30 days later, same day of the week if possible.

Three consecutive submissions serve to smooth out intra-session variability. Without that step you risk recording as “lost” a brand that was instead merely absent in that single refresh. The check remains an entry-level check: the real analysis, done on 50 prompts × 4 platforms × 3 submissions × 12 months, requires professional tools and automation, not a spreadsheet.

The test I ran myself

I applied this mini-framework of 10 prompts to three SMEs in the exhibit design and B2B trade fair sector in the Rho/MI quadrant, over a span of eight weeks. An indicative test, not a study: small sample, no control for confounders, but the pattern was clear.

Across 30 total prompts (10 × 3 companies), 53% of the brands cited in week 1 were still present in week 8 (16 brands out of 30). 47% had changed: some had disappeared, others had newly appeared, others still oscillated in position (from first to fourth place in the list). On a single submission, week 1, the client who had commissioned the work would have said “I never appear”. Over three submissions across eight weeks the picture was “I appear in 30% of category queries, always at the tail, never in recommendations”. The second reading is useful for practical purposes, the first is merely frustrating.

An honest limitation: eight weeks is not enough to stabilize a trend, and three companies are not a sample. They serve to confirm that the variability between submissions is real and that the standardized framework is the only way to read it.

The mistakes I notice most often

Mistake one: changing the prompts every month. The marketing manager revises the queries “to make them more realistic” and starts over from scratch. Result: zero time series, zero comparability. Once chosen, the 50 prompts are frozen for at least 12 months.

Mistake two: using a single submission per prompt. Result: data too noisy to be interpreted. Three consecutive submissions are the minimum, five is better.

Mistake three: testing only ChatGPT. The answers across ChatGPT, Perplexity, Gemini and Claude diverge substantially. A brand can be strong on Perplexity (because it has good editorial backlinks, I covered the mechanism in backlinks as citation proxy) and weak on ChatGPT. Measuring a single platform is equivalent to measuring a single sales channel.

Mistake four: ignoring local prompts. For those who sell physically near a trade fair hub (and Rho/MI is the busiest in Italy for B2B buyers), local prompts are where last-minute orders are decided. Skipping those ten prompts means not seeing half the market.

What you can do concretely to get started?

Open a spreadsheet, title the first tab “Prompt Framework v1 — locked until [date 12 months from now]”.
Write 10 prompts for each of the 5 families (brand, category, comparison, recommendation, local) calibrated to your sector and your sales geography. For each prompt category, draw inspiration from the real ways your clients have described you over the phone.
Decide on a cadence: monthly is the minimum, bimonthly is acceptable for sectors with long cycles like exhibit design.
Choose four platforms: ChatGPT, Perplexity, Gemini, Claude. Nothing less.
For each prompt, do 3 consecutive submissions. Note brands cited, position, sentiment.
Compare against the 3-5 competitors the AI cites most often in your sector: they are your real benchmark, not the market competitors you’d expect.

Without these five steps, any “AI visibility” report handed to you is an opinion disguised as a number.

Where all of this leads…

The framework of standardized prompts is the floor on which all the other visibility in AI answers metrics rest: share of voice, average position, sentiment, coverage. Without the floor, the metrics fluctuate for reasons you can’t isolate. With the floor, you can finally tell whether the actions you take (work on entities, editorial citations, schema markup, inverted-pyramid content which I covered in inverted pyramid) are working or not.

In the following articles of this series I tell you how to turn the data collected by the framework into a monthly scorecard readable in two minutes by decision-makers (monthly AI visibility scorecard), how to compare them against competitors in a matrix (competitive comparison matrix) and which automatic tools are emerging to avoid doing it by hand (AI visibility tracking tools).