Reviews, SEO

Robots.txt: Powerful but Picky!

The Robots.txt file is powerful but picky!I suspect most of us set up our robots.txt file as basically a one-size-fits-all for the spiders. We’ll instruct all spiders to crawl or not to crawl the same files. For instance, a simple robots.txt file covering all spiders would look something like this:

User-agent:*
Disallow: /cgi-bin/
Disallow: /ar/
Disallow: /el/
Disallow: /ja/
SITEMAP: http://www.domain.com/sitemap.xml

This tells all bots (that’s the * after User-agent) to stay away from four directories and also provides the location for the domain sitemap.

But, what if you want to give Google special instructions? You’d think it would be a simple matter of telling Google to do something since you’ve used the * wild card to tell all spiders to avoid certain files or directories. Unfortunately, it’s not that easy. Let’s say you add these lines to your robots.txt file to keep ONLY Google out of your /info-pages/ directory:

User-agent: googlebot
Disallow: /info_pages/

User-agent:*
Disallow: /cgi-bin/
Disallow: /ar/
Disallow: /el/
Disallow: /ja/
SITEMAP: http://www.domain.com/sitemap.xml

You would think that Google would understand that it should stay out of the /info-pages/ directory and then since the * was used in the next User-agent statement, it would also avoid those designated directories just like all of the other bots.

Danger, Will Robinson!

Sorry, but it doesn’t work that way. In this case, Google will avoid the /info-pages/ directory as instructed in its specific category in the robots.txt file and ignore all other instructions found in the file. It would still crawl all of those other directories. If you want to give Google (or any other bot) special instructions, they have to be complete. In this case, you would need to add all of the other directories to the Google section to keep that bot out of the /info-pages/ directory AND the other four directories along with pointing out where the domain’s sitemap is located. This is what the complete robots.txt file would look like:

User-agent: googlebot
Disallow: /info_pages/
Disallow: /cgi-bin/
Disallow: /ar/
Disallow: /el/
Disallow: /ja/
SITEMAP: http://www.domain.com/sitemap.xml

User-agent:*
Disallow: /cgi-bin/
Disallow: /ar/
Disallow: /el/
Disallow: /ja/
SITEMAP: http://www.domain.com/sitemap.xml

Quick robots.txt lesson: The robots.txt file has to be very specific. If you set up a category for a certain bot, it ONLY pays attention to the instructions for it in THAT category. All others are ignored.

For more information, see Robots Exclusion Standard.

Standard

One thought on “Robots.txt: Powerful but Picky!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s