Robots.txt & Indexing
You don’t need every page crawled. But you do need the right ones.
What Is robots.txt?
Instructions for Crawlers
robots.txt is a plain text file at the root of your domain that tells search engine bots which URLs they can and can’t crawl.
Important:
- It’s about crawling, not indexing
- It’s the first file bots check when visiting a domain
How It Works?
Example:
User-agent: *
Disallow: /checkout/
Allow: /blog/
Means all bots can crawl everything except the /checkout/ folder.
Common robots.txt Mistakes
What NOT to Do
Checklist:
- ❌ Blocking entire site with /
- ❌ Using 'Disallow' for pages that need to be indexed
- ❌ Forgetting to allow CSS/JS (breaks rendering)
- ❌ Thinking 'Disallow' prevents indexing (use noindex instead)
- ❌ Not updating when site structure changes
Means all bots can crawl everything except the /checkout/ folder.
Best Practices
Crawl Smarter, Not Harder
Tips:
- ✅ Allow all assets needed for rendering
- ✅ Block low-value or duplicate URLs (cart, filters, etc.)
- ✅ Always test using GSC’s Robots Tester
- ✅ Place robots.txt at yourdomain.com/robots.txt
Related Technical Pages
Build Around Your Stack
XML Sitemaps
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Ut elit tellus, luctus nec ullamcorper mattis, pulvinar dapibus leo.
Indexing Basics
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Ut elit tellus, luctus nec ullamcorper mattis, pulvinar dapibus leo.
Canonical Tags
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Ut elit tellus, luctus nec ullamcorper mattis, pulvinar dapibus leo.
Control the Crawl
Don’t let Google waste time on URLs that don’t help you rank.