Technical SEO

Robots.txt File

Last reviewed May 2026

A robots.txt file is a plain-text file placed at the root of a domain — accessible at `https://example.com/robots.txt` — that uses the Robots Exclusion Protocol (REP, formalized as RFC 9309 in 2022) to communicate crawling instructions to web bots. It declares which user agents may crawl which paths via `Allow:` and `Disallow:` directives, and typically also references the XML sitemap.

Robots.txt is a request, not an enforcement mechanism. Compliant crawlers — Googlebot, Bingbot, GPTBot, ClaudeBot, PerplexityBot, CCBot, and the major SEO tools — respect it. Non-compliant scrapers can and do ignore it. For genuine access control, sensitive paths must be protected at the application layer (authentication, IP allowlisting), not via robots.txt.

Pattern matching follows specific rules: directives match by URL prefix, `*` is a wildcard, `$` anchors the end of the URL, and the most specific rule wins per user agent. Crawlers also evaluate each user agent block independently — Googlebot does not inherit rules from the `User-agent: *` block if a `User-agent: Googlebot` block exists.

Common configuration errors include accidentally blocking critical paths (`/api/` when the API powers SSR content), forgetting to update after a CMS migration, blocking CSS or JavaScript assets that Google needs to render the page (which suppresses Core Web Vitals signals), and using robots.txt to suppress duplicate content instead of canonical tags or noindex meta tags.

Validation tools include Google Search Console's robots.txt report, the open-source `robots.txt Parser` from technicalseo.com, and Screaming Frog's crawl simulator. After every robots.txt change, audit logs should be checked for unexpected drops in crawl rate on critical URLs.

Why it matters in GEO / AI search

For GEO, robots.txt is one of two files (along with llms.txt) that gate-keep your entire AI citation surface. If GPTBot, ClaudeBot, PerplexityBot, or CCBot is disallowed, the corresponding LLM cannot retrieve, train on, or cite your content — full stop. Many sites accidentally inherited blanket disallows from old security recommendations and now silently exclude themselves from AI search.

The single highest-leverage decision in robots.txt for a GEO-focused site is allowing CCBot. Common Crawl feeds the training datasets for GPT, Llama, Mistral, and most open-source LLMs. Blocking CCBot doesn't just hide your content from one crawler — it removes you from the upstream corpus that hundreds of downstream models train on for years. For a new domain with no Common Crawl footprint yet, allowing CCBot is the lowest-effort, highest-payoff allowlist change available.

The decision on Anthropic's `anthropic-ai` (training crawler, distinct from `ClaudeBot` which handles runtime retrieval) is more nuanced. Allowing it means your content may inform future Claude model training. Disallowing it protects IP from training-corpus inclusion but does not affect runtime citations. For agencies positioning around AI visibility, allowing aligns with the value proposition; for businesses with strict IP concerns, disallowing is defensible.

Examples

GEO-aligned baseline for an agency site

`User-agent: *` allow all; explicit `Allow: /` for GPTBot, ClaudeBot, PerplexityBot, OAI-SearchBot, GoogleOther, CCBot, anthropic-ai; reference sitemap at the bottom. Maximizes citation surface across every major AI engine.

Common mistake: blocking all AI crawlers

A `Disallow: /` block for CCBot inherited from a generic "block AI scrapers" template will silently remove the site from GPT/Llama/Mistral training data — a major GEO regression that takes 6-12 months to detect.

Application-layer protection, not robots.txt

Use authentication or signed URLs for genuinely sensitive paths. Listing `/admin/` in robots.txt advertises its existence to bad actors without preventing access.

Sitemap reference

Always include `Sitemap: https://example.com/sitemap.xml` at the bottom of robots.txt — it's the single most reliable discovery channel for new content across all crawlers including AI ones.

Authority Links

RFC 9309 — Robots Exclusion Protocol

The IETF standard that formalized the robots.txt protocol in 2022.

Google Search Central — Intro to robots.txt

Google's official guidance on writing and validating robots.txt files.

Google Search Central — Robots.txt specifications

Detailed parsing rules: precedence, wildcards, encoding.

Related Terms

Technical SEO

Allow Command

Refers to the code for web pages that are to be crawled and indexed.

Technical SEO

Crawl Budget

The number of URLs a crawler will fetch on a site within a given timeframe — a function of server capacity (crawl rate limit) and content popularity/freshness (crawl demand).

Technical SEO

Disallow Command

A command used for web pages that are not to be crawled and indexed by bots.

Technical SEO

Sitemap Command

Refers to the Robots.txt command that shows bots the path to the sitemap.

Rich Snippets (Rich Results)Schema (Structured Data Markup)