Your site looks great visually, but behind the scenes the code is a mess — headings used at random, tables built in a non-standard way, a structure with no logic. For a human visitor nothing changes, but for AI it's pure noise: it can't figure out what matters and what doesn't, and in the end it decides to use another source. It's not a Google ranking problem — it's a comprehension problem: the AI loses key information about you because it can't read you precisely. Fixing the code takes less than a morning, and the results in terms of AI visibility are immediate.
Do you know the first test I run when I analyze a site that doesn’t show up in AI answers? I don’t look at the content. I don’t look at the backlinks. I open the source code and look at the HTML structure. And in most cases, I find the same problem: headings that jump from the main title straight to a third-level subheading, sections with no landmarks, content blocks floating in the markup without any hierarchical relationship between them.
The thing is, to a human reader the page looks perfect. The design is clean, the fonts are right, the text flows well. But AI doesn’t see the design. It sees the code. And if the code doesn’t have a coherent semantic structure, the content loses its hierarchy — and content without hierarchy is content that AI has a harder time processing, breaking into useful chunks, and returning as an answer.
The markup AI actually reads
When I talk about semantic markup I’m not talking about code aesthetics or best practices for fussy developers. I’m talking about the way a RAG system — the kind that powers the answers from ChatGPT, Perplexity, Gemini — interprets the structure of your page to decide what to extract.
RAG systems convert pages into text and then break them into chunks. But they don’t cut at random: they use the document’s structural signals to understand where one concept ends and another begins. Section headings are the strongest signals. A correct hierarchical heading structure — main title, subsections, sub-subsections — creates a map that the system uses to isolate self-contained blocks of information.
Volpini et al. in 2026 precisely defined the advantage of pages with rich semantic structure:
“Enhanced pages transform opaque entity URIs into readable, structured information by resolving linked relationships and presenting them as human-readable content.”
“Readable, structured information” — that’s the key. Pages that turn opaque information into readable, structured content are the ones AI systems can process with less ambiguity. And HTML semantic structure is the first layer of this transformation: without it, the content is flat, undifferentiated text, with no anchor points for retrieval.
Why JSON-LD isn’t enough
If you’ve read my article on structured data, you already know that JSON-LD has a paradox: it works for Google and Bing parsers, but it produces no measurable benefits in RAG systems. The same paper by Volpini et al. says so explicitly:
“JSON-LD markup remains valuable for search engines with dedicated parsers (Google, Bing), but it provides no measurable benefit in RAG-based systems that treat pages as flat text.”
That’s why HTML semantic markup becomes essential. JSON-LD lives in the page head, invisible to the text the RAG processes. Semantic markup, on the other hand, lives inside the text: it’s the headings that give hierarchy, the <header>, <nav>, <main>, <footer> tags that define the logical boundaries of the content. When the system converts your page into flat text, these structural signals guide the segmentation.
The difference between a page with semantic markup and one without is the difference between a book with a table of contents and chapters and a wall of text with no breaks. Both contain the same words. But one is navigable, the other isn’t.
A site built entirely with generic divs is like a building with no signs: the rooms exist, but no one knows which is the entrance, which is the living room, which is the closet.
The data point that changes the perspective
When Volpini et al. compared pages with rich semantic structure (the “enhanced pages”) against those with JSON-LD only, the result was clear-cut:
“Enhanced pages exposed 2.4x more discoverable links than JSON-LD pages (102.2 vs. 41.9).”
2.4 times more discoverable links. This doesn’t just mean “more links on the page” — it means the system manages to discover and follow 2.4 times more connections when the HTML structure is semantically rich. The relationships between entities, the links between concepts, the cross-references become accessible because the structure makes them explicit.
In practical terms: if your page has correct hierarchical headings, landmarks that delimit the sections, aria attributes where they’re needed to clarify the role of the components, the AI system manages to extract more useful information from the same amount of content. Not because the content is different — because the structure makes it readable.
You can use the HeadingsMap browser extension to view the page’s heading tree in a second — it instantly shows you whether there are jumps or inconsistencies in the hierarchy.
The mistakes I see most often
After analyzing hundreds of sites, the wrong patterns repeat themselves. The first is the heading that skips levels: from the main title you jump straight to a third-level heading because “visually the font was too big”. The problem is that the choice of heading shouldn’t depend on the design — that’s what CSS is for. The heading defines the logical hierarchy of the document, and if it skips a level, the system loses a step in the structure.
The second mistake is using <div> for everything. A site built entirely with generic divs is like a building with no signs: the rooms exist, but no one knows which is the entrance, which is the living room, which is the closet. Semantic tags — <header>, <nav>, <main>, <article>, <footer> — are those signs. They tell the system what each block contains before it even reads it.
The third is the most insidious: headings used for decorative purposes. Section titles inserted as headings just because the CMS formats them a certain way, with no relationship to the content’s hierarchy. Every out-of-place heading is a false signal that confuses the segmentation.
And then there’s a fourth pattern that isn’t strictly a technical error, but causes the same damage: pages where the main content is drowned among sidebars, widgets, banners and repeated blocks. If the <main> tag contains more noise than signal, the chunk the system extracts will be diluted. The ratio between useful content and accessory markup matters — and a <main> landmark that wraps only the relevant content helps the system isolate what’s worth processing.
What you can check right away
Open the source code of your main pages and check three things. First: is there a single main title per page? Second: do the section titles follow a hierarchical order with no jumps? Third: is the main content wrapped in a <main> tag or at least in a tag with an explicit role?
If even one of these checks fails, the AI system is working harder than necessary to understand the structure of your content. You can use the HeadingsMap browser extension to view the page’s heading tree in a second — it instantly shows you whether there are jumps or inconsistencies in the hierarchy.
This is a first step to spotting the surface-level problems. But semantic structure goes beyond headings: landmarks, aria attributes, template organization, the ratio between content and accessory markup — these are interventions that require specific technical expertise and an overall vision of how the site communicates with AI crawlers.
The thread that holds it all together
Semantic markup isn’t an isolated aspect. It connects to everything I’ve talked about in the articles on crawlability — because if the crawler reaches your page but finds a flat structure, the content it extracts will be less usable. It connects to page experience because the technical quality signals add up. It connects to HTTPS because technical trust is a package, not a single factor.
And if you think structured data in JSON-LD solves the problem, I invite you to reread what I wrote about the dual strategy: JSON-LD speaks to the parsers, semantic markup speaks to the AI. You need both. But if you have to choose where to start, start from the HTML structure — because that’s what the RAG system reads first.
The next step is to understand how content freshness fits into all of this: a perfect semantic structure on outdated content is still a problem. But one by one, these optimizations build a technical profile that AI engines recognize as reliable — and from which they prefer to extract their answers.