Every time a search engine crawler visits your site, the first thing it does is fetch one specific file: robots.txt. Before touching a single page, the bot reads this file to understand what it is allowed to crawl and what to skip.

Getting robots.txt right is one of the highest-leverage improvements you can make for SEO. Getting it wrong — and it is surprisingly easy to break — can silently remove your entire site from Google. This guide covers everything: basic syntax, blocking AI training crawlers, crawl budget optimization, and the mistakes that have hurt countless sites.

What Is robots.txt?

robots.txt is a plain-text file at the root of your domain — always at https://yourdomain.com/robots.txt, never in a subdirectory. It follows the Robots Exclusion Protocol (REP), a voluntary standard that well-behaved crawlers respect.

robots.txt vs noindex — Know the Difference

This is the most common source of confusion in SEO:

  • robots.txt controls crawling — whether a bot visits a URL at all.
  • noindex (meta tag or HTTP header) controls indexing — whether a visited page appears in search results.

A page blocked by robots.txt can still appear in search results if other sites link to it. Google can index a URL it has never crawled if it discovers the URL from external links. If you need a page definitively removed from search results, use noindex — not robots.txt alone.

robots.txt Syntax and Directives

The file is line-based. Each directive is a keyword followed by a colon and a value. Blank lines separate rule blocks.

User-agent

Specifies which crawler the following rules apply to.

User-agent: *          # All crawlers
User-agent: Googlebot  # Google only
User-agent: GPTBot     # OpenAI training crawler only

A single asterisk (*) matches every crawler not covered by a more specific block. If a crawler has its own named block, that block takes precedence over the wildcard.

Disallow and Allow

Disallow blocks a URL path and everything under it. Allow explicitly permits a path within a broader Disallow rule. The more specific rule wins.

User-agent: *
Disallow: /admin/         # Block /admin/ and all sub-paths
Disallow: /search?        # Block parameterized search URLs
Disallow: /               # Block ENTIRE site (dangerous!)
Disallow:                 # Empty = allow everything

# Allow one page inside a blocked directory
Disallow: /private/
Allow: /private/public-page.html

Path matching is prefix-based. Disallow: /blog blocks /blog, /blog/, and /blog/post-1. Trailing slashes matter: /admin only blocks that exact URL; /admin/ blocks the entire directory tree.

→ Generate valid syntax automatically: robots.txt Generator | Validate your file: robots.txt Checker

Sitemap

Tells crawlers the absolute URL of your XML sitemap. Not a crawl rule — a discovery hint.

Sitemap: https://example.com/sitemap.xml
Sitemap: https://example.com/sitemap-news.xml

Crawl-delay

Requests that the crawler wait N seconds between requests. Bingbot and Baiduspider respect this; Googlebot ignores it. Use Google Search Console to control Google's crawl rate instead.

User-agent: Bingbot
Crawl-delay: 10

Common Configuration Patterns

Allow everything — recommended starting point

User-agent: *
Allow: /

Sitemap: https://example.com/sitemap.xml

Block admin and authentication pages

User-agent: *
Disallow: /admin/
Disallow: /wp-admin/
Disallow: /login
Disallow: /dashboard/

Sitemap: https://example.com/sitemap.xml

Block low-value URL patterns (large sites)

User-agent: *
Disallow: /search?
Disallow: /tag/
Disallow: /?sort=
Disallow: /?filter=
Disallow: /page/

Sitemap: https://example.com/sitemap.xml

Blocking AI Training Crawlers

Since 2023, a new category of crawlers has emerged: bots that scrape the web to train large language models. All major AI companies have published their crawler's User-agent names and publicly committed to respecting robots.txt.

User-agentCompanyPurpose
GPTBotOpenAIChatGPT / GPT model training
ChatGPT-UserOpenAIChatGPT browsing plugin
ClaudeBotAnthropicClaude model training
anthropic-aiAnthropicAnthropic web research
Google-ExtendedGoogleGemini / Bard training
PerplexityBotPerplexity AIPerplexity AI search
BytespiderByteDanceTikTok AI / LLM training
CCBotCommon CrawlAI training datasets

Block all AI training crawlers while keeping search crawlers

User-agent: GPTBot
Disallow: /

User-agent: ChatGPT-User
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: anthropic-ai
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: PerplexityBot
Disallow: /

User-agent: Bytespider
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: *
Allow: /

Sitemap: https://example.com/sitemap.xml

This blocks AI training crawlers while keeping Googlebot, Bingbot, and other search crawlers fully active — your site stays indexed in search results while preventing AI companies from ingesting your content as training data.

The robots.txt Generator has one-click AI crawler blocking — check the crawlers you want to block and the correct syntax is generated automatically.

Crawl Budget and Why It Matters

Googlebot has a finite amount of time it can spend on any given site each day — this is called crawl budget. For small sites, it is rarely a concern. For large sites with thousands of pages, it becomes critical.

When crawl budget is wasted on low-value pages, important pages may not be crawled (and therefore not re-indexed) as frequently. Common crawl budget wasters:

  • Paginated archives (/page/2, /page/3...)
  • Faceted navigation URLs (/products?color=red&size=M)
  • Session ID and tracking parameter URLs
  • Duplicate content on multiple URL patterns
  • Soft 404 pages (return 200 but show "No results found")

Use robots.txt Disallow rules to steer crawlers away from these patterns and toward your most important content.

Common robots.txt Mistakes

Accidentally blocking the entire site

The most catastrophic mistake: deploying a robots.txt with Disallow: / under User-agent: *. This tells every crawler to skip everything. Rankings drop as Google's cache expires and it cannot re-crawl to verify content.

Always check the live file at https://yourdomain.com/robots.txt immediately after any deployment. Use the robots.txt Checker to validate syntax and detect this mistake automatically.

Blocking CSS and JavaScript

Googlebot renders pages like a browser — it needs CSS and JS to understand layout, content, and links. Blocking /wp-content/ or /assets/ prevents correct rendering and harms rankings. Block only admin panels, search result pages, and private APIs. Never block theme, style, or script directories.

Using robots.txt as a security measure

robots.txt is publicly viewable. Publishing Disallow: /secret-admin-panel/ effectively announces its existence to the world. For sensitive content, use server-level authentication — not robots.txt.

Wrong file location

robots.txt must be at the domain root. https://example.com/robots.txt is valid. https://example.com/blog/robots.txt is silently ignored by all crawlers.

How to Add robots.txt to Your Site

WordPress

  • Yoast SEO: SEO → Tools → File Editor → robots.txt tab. Edit directly, no FTP needed.
  • Rank Math: Rank Math → General Settings → Edit robots.txt.
  • Manual: Upload robots.txt to the WordPress root directory (same folder as wp-config.php) via FTP or your hosting file manager.

Shopify

  1. Go to Online Store → Themes → Actions → Edit code.
  2. Open robots.txt.liquid in the Templates folder.
  3. Replace the content with your custom rules. Changes apply immediately.

Static sites (Next.js, Astro, Hugo)

Place robots.txt in the public/ folder (Next.js, Create React App) or static/ folder (Hugo, Astro). The build process copies it to the site root automatically.

Testing after setup

  • Visit https://yourdomain.com/robots.txt in a browser to confirm the file is live.
  • Use the robots.txt Checker to validate syntax, find errors, and audit AI crawler blocks — no account needed.
  • Use Google Search Console → Settings → robots.txt to test whether specific URLs are blocked or allowed.
  • The URL Inspection Tool in Search Console shows whether Googlebot can currently access a specific page.

Frequently Asked Questions

Does blocking a page in robots.txt remove it from Google search results?
No. robots.txt prevents crawling but does not guarantee removal from the index. Google can discover and index a URL from external links without ever crawling it. For guaranteed removal from search results, allow crawling and add a noindex meta tag to the page, then request removal via Google Search Console.
How long do robots.txt changes take to take effect?
Google re-fetches robots.txt approximately every 24 hours and may cache the old version for up to a few days. For urgent changes, use the robots.txt section in Google Search Console to request a refresh.
Do I need robots.txt if my site is small?
Not required, but recommended. Even small sites benefit from including a Sitemap directive to help Google discover content faster. If you have admin pages or staging content, robots.txt is the right place to block crawlers from them.
Is robots.txt different for subdomains?
Yes. Each subdomain needs its own robots.txt file. The file at example.com/robots.txt does not apply to blog.example.com. Each subdomain is treated as a separate host.
Can I block specific file types?
Yes. Use wildcards: Disallow: /*.pdf blocks all PDF files. Disallow: /*.json blocks all JSON files. The asterisk matches any characters in the path.

Key Takeaways

  • robots.txt controls crawling — use noindex to remove pages from search results.
  • Always include a Sitemap directive so crawlers discover your content efficiently.
  • Block admin panels, search result URLs, and low-value pagination to focus crawl budget.
  • To block AI training crawlers (GPTBot, ClaudeBot, Google-Extended), add a named User-agent block with Disallow: / for each.
  • Never block CSS or JavaScript — Googlebot needs them to render and evaluate pages.
  • Do not use robots.txt as a security mechanism — the file is publicly readable.
  • Test your live file in Google Search Console after every change.

Ready to act? Generate your robots.txtValidate itBrowse templates