andy pai's tils

Robots.txt Cheatsheet

Today I wanted to create a robots.txt file that would allow me to control access to certain pages and bots. Here's what I learned:

Allow Everything

To create a robots.txt file that allows all web crawlers to access everything on the website:

User-agent: *
Disallow:

This tells all user agents (web crawlers) that they are allowed to access all pages on the website. The Disallow: directive with no value means no pages are disallowed.

This file just needs to be saved in the root directory of the website to take effect.

Block Certain Bots

If you want to block specific bots from accessing your website, you can use the User-agent directive followed by the name of the bot. For example, to block AhrefsBot and PetalBot:

User-agent: AhrefsBot
​Disallow: /

User-agent: PetalBot
​Disallow: /

Block Certain Pages

Block bot access to specific pages or directories, using the Disallow directive followed by the path:

User-agent: *
Disallow: /user/
Disallow: /admin/

This will prevent all bots from accessing any page within the /user/ and /admin/ directories. To allow access to specific pages under the /user directory like the /user/signup page:

User-agent: *
Disallow: /user/
Allow: /user/signup/$

The "$" symbol at the end of the Allow directive means that only the exact URL "/user/signup/" is allowed, and no subpages or subdirectories.

To allow access to any page under the /user/signup directory:

User-agent: *
Disallow: /user/
Allow: /user/signup/*

The "*" symbol at the end of the Allow directive means that any URL starting with "/user/signup/" is allowed, including subpages and subdirectories.