BrandGEO
SEO Tutorials · · 7 min read

Auditing Your Own Site for AI: robots.txt, llms.txt, JSON-LD, and the Four Gates of Citation

Before you chase mentions on other people's sites, make sure an AI model can actually crawl, rank, read, and attribute your own. Here's the audit, gate by gate.

Most AI-visibility advice points outward — earn citations, get on Wikipedia, court the review platforms. All worthwhile. But there's a cheaper, faster lever sitting right under you: your own website. If a model can't retrieve your pages, can't rank them, can't extract clean claims from them, or can't attribute those claims back to you, no amount of off-site work fully compensates. This is a practitioner's walkthrough of the on-site AI audit — the files and signals that matter, organized around the four gates an answer has to pass through to cite you.

When an AI model cites your brand in an answer, that citation survived a gauntlet. The model had to be able to reach your page, choose it over alternatives, understand what it said, and connect the claim back to you by name. Four gates. Fail any one and you're invisible at the surface, no matter how good the content behind the gate is.

The useful thing about the four-gate framing is that it turns a vague goal ("be more visible to AI") into a sequence of concrete, checkable on-site fixes — most of which live in a handful of files you already control. This post walks through each gate, the signals that open it, and how to prioritize the work. It pairs naturally with Schema Markup for LLMs: 7 Elements That Matter, 12 That Don't, which goes deeper on the structured-data specifics.

Gate 1 — Retrieval: can a model reach your pages at all?

Retrieval is the dumbest and most common failure. If a model's crawler or its search layer can't fetch your page, nothing downstream matters.

robots.txt. Open yourdomain.com/robots.txt and read it like an adversary. Two questions: are your important pages allowed, and are AI crawlers blocked? Many sites added blanket disallows for AI bots in 2023–2024 out of caution, then forgot. If you want to be cited, the major AI user-agents need access to the pages that describe you. Decide deliberately — blocking them is a legitimate choice, but it should be a choice, not a leftover.

Sitemap. A current sitemap.xml, referenced from robots.txt, is the cheapest way to tell crawlers what exists and what changed. Stale sitemaps that list dead URLs or omit your key pages quietly suppress retrieval. Regenerate it on publish, not once a year.

llms.txt. A newer convention (see llmstxt.org): a plain-Markdown file at yourdomain.com/llms.txt that gives AI agents a curated map of your most important pages, with short descriptions, plus an optional llms-full.txt that aggregates the actual content. Think of it as a sitemap written for language models instead of search crawlers — it tells an agent which pages matter and what they're about without making it guess from your navigation. It's not yet universally consumed, but it's low-cost, forward-looking, and increasingly expected of brands that take the channel seriously.

Rendering. If your key content only appears after client-side JavaScript runs, assume some crawlers won't see it. The claims a model needs to cite should be present in the served HTML, not assembled in the browser.

Retrieval fixes are almost always P0 — do this week. They're small, mechanical, and they gate everything else.

Gate 2 — Ranking: does your page get chosen?

A model that browses doesn't read the whole web. It runs a search, takes the top handful of results, and grounds its answer in those. So for any question about you, the practical question is: does your page make the shortlist?

This is where AI visibility overlaps most with classical SEO, and the overlap is real. The pages that rank for "[your category] for [use case]" or "[your brand] vs [competitor]" are the pages a browsing model will read. If a competitor's comparison page outranks your own positioning page, the model grounds its answer in their framing of you — a failure mode we describe in The Visibility Gap: Why Your Brand Scores Differently in AI's Memory vs Live Search.

The on-site work here is conventional but pointed: make sure you actually have pages targeting the questions buyers ask a model, and that those pages are strong enough to rank. A brand with no comparison page, no clear "who we're for" page, and no answers to obvious category questions has nothing for the model to choose, regardless of how good its homepage looks.

Ranking work spans P1 and P2 — some pages you can ship and rank this quarter; durable authority for competitive queries compounds over the year.

Gate 3 — Extraction: can the model understand what it read?

Now the model has your page open. Can it pull clean, quotable, accurate claims from it? Extraction failures are subtle because the page looks fine to a human.

Plain, declarative claims. Models lift sentences that stand on their own. "Acme is a payroll platform for restaurants with 10–200 employees" extracts cleanly. "Reimagine the way you do people ops" does not — it's a vibe, not a fact. Every page that matters should contain a few unambiguous, self-contained statements about what you are, who you serve, and why you're credible.

JSON-LD structured data. This is the highest-signal extraction aid you control. Organization schema tells a model your name, URL, logo, founding, and social profiles as data rather than prose it has to infer. Product/Service, FAQPage, and Article (with author and date) schema do the same for the rest. Done right, structured data is the difference between a model guessing and a model knowing. Schema Markup for LLMs covers which types pay off and which are theater — don't bulk-add everything; add the few that describe entities and answers.

Structure and headings. Real headings, short paragraphs, and lists give a model the scaffolding to find the answer to a specific question. A wall of text buries the one sentence that mattered.

Freshness signals. Visible and structured dates help a model trust that a claim is current. A page making a 2026 claim with no date, in HTML that looks like 2019, gets discounted.

Extraction work is a mix of P0 and P1: adding Organization and core entity schema is fast and high-leverage; rewriting key pages for declarative clarity takes a bit longer.

Gate 4 — Attribution: does the claim get tied back to you?

The cruellest failure: the model reads your content, uses it, and attributes it to someone else — or to no one. Your insight becomes "experts say," your data becomes an unsourced statistic, your framing becomes a competitor's talking point.

Attribution is about making your name inseparable from your claims.

Name your claims. Put your brand name in the same breath as the facts you want associated with you — in headings, in the claim sentences, in image alt text and captions. "According to Acme's 2026 payroll benchmark…" travels with attribution; "the 2026 payroll benchmark…" does not.

Author and entity bylines. Bylines with real author entities (and Person/Article schema) signal that a human and an organization stand behind the content. This matters more in categories where models are cautious about authority — see GEO for Fintech: Earning LLM Trust in a Category Full of Scam Warnings for how this plays out under scrutiny.

Consistent entity identity. Same name, same URL, same descriptors everywhere — your site, your schema, your social profiles, your third-party listings. Inconsistent identity splits your reputation across half-entities the model can't confidently merge.

Attribution work is largely P1/P2 — it's the compounding, brand-and-PR-flavored layer, the same upstream-authority idea behind earning off-site citations.

Turning the gates into a prioritized plan

The four gates aren't just a diagnostic; they're a natural priority order. Retrieval and core extraction are cheap, mechanical, and unblock everything — do them first. Ranking and attribution are where the durable, compounding work lives.

A workable sequence:

  • P0 — this week. Fix robots.txt access, ship a current sitemap, publish an llms.txt, and add Organization JSON-LD. Mechanical, fast, foundational.
  • P1 — this quarter. Add Product/FAQ/Article schema to key pages, rewrite your top positioning and comparison pages for declarative clarity and named claims, and fill obvious gaps in the questions you answer.
  • P2 — this year. Build the comparison and category-authority pages that earn rankings for competitive queries, establish consistent author and entity identity, and align it with your off-site citation work.

That P0/P1/P2 shape isn't arbitrary — it's how a good recommendations pass orders the work, fast unblockers first, compounding plays last.

You don't have to run all four gates by hand

Most of this is checkable: fetch robots.txt, look for a sitemap and llms.txt, view-source for JSON-LD, read your key pages for declarative claims, search the queries that matter and see who ranks. Walking the four gates once, by hand, is the single most useful afternoon you can spend on AI visibility — and far cheaper than the off-site work most teams jump to first.

If you'd rather have it done for you, BrandGEO's audit ends with exactly this pass: an agent crawls your live site — robots.txt, sitemap, llms.txt, JSON-LD, metadata, bylines — and returns a gated, prioritized action plan with ready-to-paste structured data and a 90-day roadmap. You can run a free audit in about two minutes, on a seven-day trial with no credit card required.

See how AI describes your brand

BrandGEO runs structured prompts across ChatGPT, Claude, Gemini, Grok, and DeepSeek — and scores your brand across six dimensions. Two minutes, no credit card.

Keep reading

Related posts

BrandGEO
AI Visibility Jun 6, 2026

The Visibility Gap: Why Your Brand Scores Differently in AI's Memory vs Live Search

Ask ChatGPT about your brand with web browsing off and you get one answer — drawn from training data, the reputation baked into the model. Turn browsing on and you can get a different answer entirely, assembled from whatever the model finds on the live web in that moment. Most measurement programs only ever see one of these. The gap between them is diagnostic: it tells you whether the live web is rescuing a weak memory, or quietly eroding a strong one. This post is about why the gap exists, what its sign and size mean, and how to act on each case.

BrandGEO
AI Visibility Jun 6, 2026

Tracking the Queries That Matter: Keyword-Level Monitoring in the AI Era

A brand-level visibility score answers 'do AI models know us?' But buyers don't ask models about your brand — they ask about their problem. 'Best CRM for solo realtors.' 'Affordable accounting software Singapore.' 'Alternatives to [incumbent].' Whether you appear in those answers is a sharper, more commercial question than your headline score, and it deserves its own tracking. This post is about query-level monitoring: which queries to track, how to read the results per engine, and how to turn the data into work.

BrandGEO
SEO Jun 6, 2026

Where AI Gets Its Answers: Building a Citation Source Map and a Digital-PR Target List

Earning citations is the right goal, but most digital-PR programs aim blind — pitching whoever the team already knows, hoping it helps. There's a more precise way to work. When a model answers questions about your category, it draws on a finite, repeatable set of sources. If you can see which domains those are, classify them by whether they currently help you or your rivals, and find the ones that cite competitors but never you, your target list stops being a guess and becomes a map. This post is about building that map and reading it.