Google’s search engine spider “Googlebot” crawls a site it accesses the Robots.txt to determine which URLs it is not allowed to crawl. This is particularly useful if you have content on your site that you don’t want search engines to index as it may be private or duplicate content. Having fought half the battle, now knowing what a Robots.txt file is, let’s press on to learn how to execute the protocol.
Before we get into the nitty-gritty details I would like to point out that it’s not the end of the world if your site does not have a Robots.txt as it may not need one. There is no point in telling Googlebot, or any other crawler, to go ahead and crawl everything on your site as that is what it would normally do anyway. If a search engine spider looks up your Robots.txt file and finds that it doesn’t exist then a 404 will be returned and it will just keep on keepin’ on like Curtis Mayfield. However, there is an important implication if your Robots.txt is inaccessible, which we will touch on later on as we discuss testing.
Having said that, if you do need to upload a robots.txt there are three ways to implement one on a site:
To manually upload your Robots.txt you will need access to the root of your domain and a text editor. Once you have those two squared away then it’s time to get started. When creating your file you want to include two simple rules:
User-agent: the bot the rule(s) apply to
There are countless User-agents that you can select to block and you can find many of the most common ones listed in The Web Robots Database or you can choose to block them all:
- Apply to a Single bot – User-agent: Googlebot
- Apply to all bots – User-agent: *
Disallow: the URL you want to block
You can block everything from specific URLs to an entire site:
- To block an entire site – Disallow: /
- To block a file type – Disallow: /*.jpeg$
- To block a specific URL – Disallow: /for-my-eyes-only/
Once you’re done adding all of your directives you need to save your text editor file as robots.txt and then upload it to your root directory.
There may be instances where for whatever reason you don’t have access to your root directory, but that’s okay, as you can also meta-tag single pages of a site with the following tags:
- Prevent all robots from indexing a page: <meta name=”robots” content=”noindex”>
- Prevent a single robot from indexing a page: <meta name=”googlebot” content=”noindex”>
If a search engine, such as Google, were to read these tags their protocol would be to not include the page in their search results. There are also many other tagging methods which include:
- Nofollow: if you don’t want links to be followed from a page
- Nosnippet: if you don’t want a snippet to be shown in search results
- Noodp: if you don’t want to use ODP/DMOZ descriptions
- Noarchive: if you don’t want a search engine showing a cached link for a page
- Unavailable_after:[date]: select a date that you want to stop crawling and indexing
- Noimageindex: if you don’t want to show up as the referring page for an image in search results
- None: save some code by using one tag instead of two for noindex and nofollow
Another method you can take advantage of is various Robots.txt file generating tools that are freely available. You can find these tools by performing a Google Search.
Don’t Forget to Test
As mentioned previously, it is better to not have a Robots.txt than to have one that doesn’t work. The reason being, if a Robots.txt file exists, but is inaccessible, then search engines will likely postpone crawling the site until the problem is fixed. Call it “erring on the side of caution,” “respect for privacy,” or a flat out “fix your $@!#,” as they would rather not crawl a site to avoid the risk of crawling blocked URLs.
To make sure your file is working and blocking the correct URLs before you upload it to your root directory go to your Webmaster Tools page and select your site. Then, under Crawl, click Blocked URLs. Once there, click Test robots.txt and copy the contents of your file into the appropriate fields. It is also important to note that if you are using Google’s Webmaster Tools then only Google’s user-agents will be tested against.
Stay Tuned for More on Robots.txt
In my follow up post we will review a case study for a client that has a robots.txt implemented to block duplicate search results. We will discuss its impact on SEO and user experience.