Google Webmaster Central Blog - Official news on crawling and indexing sites for the Google index

Improving on Robots Exclusion Protocol

Tuesday, June 03, 2008 at 9:00 AM



Web publishers often ask us how they can maximize their visibility on the web. Much of this has to do with search engine optimization -- making sure a publisher's content shows up on all the search engines.

However, there are some cases in which publishers need to communicate more information to search engines -- like the fact that they don't want certain content to appear in search results. And for that they use something called the Robots Exclusion Protocol (REP), which lets publishers control how search engines access their site: whether it's controlling the visibility of their content across their site (via robots.txt) or down to a much more granular level for individual pages (via META tags).

Since it was introduced in the early '90s, REP has become the de facto standard by which web publishers specify which parts of their site they want public and which parts they want to keep private. Today, millions of publishers use REP as an easy and efficient way to communicate with search engines. Its strength lies in its flexibility to evolve in parallel with the web, its universal implementation across major search engines and all major robots, and in the way it works for any publisher, no matter how large or small.

While REP is observed by virtually all search engines, we've never come together to detail how we each interpret different tags. Over the last couple of years, we have worked with Microsoft and Yahoo! to bring forward standards such as Sitemaps and offer additional tools for webmasters. Since the original announcement, we have, and will continue to, deliver further improvements based on what we are hearing from the community.

Today, in that same spirit of making the lives of webmasters simpler, we're releasing detailed documentation about how we implement REP. This will provide a common implementation for webmasters and make it easier for any publisher to know how their REP directives will be handled by three major search providers -- making REP more intuitive and friendly to even more publishers on the web.

So, without further ado...

Common REP Directives
The following list are all the major REP features currently implemented by Google, Microsoft, and Yahoo!. With each feature, you'll see what it does and how you should communicate it.

Each of these directives can be specified to be applicable for all crawlers or for specific crawlers by targeting them to specific user-agents, which is how any crawler identifies itself. Apart from the identification by user-agent, each of our crawlers also supports Reverse DNS based authentication to allow you to verify the identity of the crawler.

1. Robots.txt Directives
DIRECTIVE IMPACT USE CASES
Disallow Tells a crawler not to index your site -- your site's robots.txt file still needs to be crawled to find this directive, however disallowed pages will not be crawled 'No Crawl' page from a site. This directive in the default syntax prevents specific path(s) of a site from being crawled.
Allow Tells a crawler the specific pages on your site you want indexed so you can use this in combination with Disallow This is useful in particular in conjunction with Disallow clauses, where a large section of a site is disallowed except for a small section within it
$ Wildcard Support Tells a crawler to match everything from the end of a URL -- large number of directories without specifying specific pages 'No Crawl' files with specific patterns, for example, files with certain filetypes that always have a certain extension, say pdf
* Wildcard Support Tells a crawler to match a sequence of characters 'No Crawl' URLs with certain patterns, for example, disallow URLs with session ids or other extraneous parameters
Sitemaps Location Tells a crawler where it can find your Sitemaps Point to other locations where feeds exist to help crawlers find URLs on a site

2. HTML META Directives
DIRECTIVE IMPACT USE CASES
NOINDEX META Tag Tells a crawler not to index a given page Don't index the page. This allows pages that are crawled to be kept out of the index.
NOFOLLOW META Tag Tells a crawler not to follow a link to other content on a given page Prevent publicly writeable areas to be abused by spammers looking for link credit. By using NOFOLLOW you let the robot know that you are discounting all outgoing links from this page.
NOSNIPPET META Tag Tells a crawler not to display snippets in the search results for a given page Present no snippet for the page on Search Results
NOARCHIVE META Tag Tells a search engine not to show a "cached" link for a given page Do not make available to users a copy of the page from the Search Engine cache
NOODP META Tag Tells a crawler not to use a title and snippet from the Open Directory Project for a given page Do not use the ODP (Open Directory Project) title and snippet for this page


These directives are applicable for all forms of content. They can be placed in either the HTML of a page or in the HTTP header for non-HTML content, e.g., PDF, video, etc. using an X-Robots-Tag. You can read more about it here:X-Robots-Tag Post or in our series of posts about using robots and Meta Tags.

Other REP Directives
The directives listed above are used by Microsoft, Google and Yahoo!, but may not be implemented by all other search engines. In addition, the following directives are supported by Google, but are not supported by all three as are those above:

UNAVAILABLE_AFTER Meta Tag - Tells a crawler when a page should "expire", i.e., after which date it should not show up in search results.

NOIMAGEINDEX Meta Tag - Tells a crawler not to index images for a given page in search results.

NOTRANSLATE Meta Tag - Tells a crawler not to translate the content on a page into different languages for search results.


Going forward, we plan to continue to work together to ensure that as new uses of REP arise, we're able to make it as easy as possible for webmasters to use them. So stay tuned for more!

Learn more
You can find out more about robots.txt at http://www.robotstxt.org and at Google's Webmaster help center, which contains lots of helpful information, including:We've also done several posts in our webmaster blog about robots.txt that you may find useful, such as:There is also a useful list of the bots used by the major search engines.

To see what our colleagues have to say, you can also check out the blog posts published by Yahoo! and Microsoft.
The comments you read here belong only to the person who posted them. We do, however, reserve the right to remove off-topic comments.

25 comments:

infrared said...

Before this post, I was under the impression that you can't use wildcards within URLs.

Would the following work now in a hypothetical situation to remove different sorting URLs?

Disallow: /product/sort/*

This would cause pages /product/sort/alpha/ and /product/sort/price/ to then not be indexed.

Ludwig Weinzierl said...

"They can be placed in either the HTML of a page or in the HTTP header for non-HTML content, e.g., PDF, video, etc. using an X-Robots-Tag. "

Is the HTTP header of HTML pages also considered? In other words: Can I put the directives in either the HTTP header or the HTML meta tags? What is prefered in case of conflict?

Devon said...

didn't know about the wildcards before. I like that. Thanks for the list.

Anand said...

Doesn't Google, Yahoo and Microsoft support NOINDEX meta differently? This post fails to mention that. See http://www.mattcutts.com/blog/google-noindex-behavior/

mike.mikowski said...

Hi, I want to ask why Google ingores X-Robots-Tag when removing an URL using google Url removal tool ? This X-Robots-Tag header should work as meta noindex, nofollow, so I don't understand it ?

Secondly I would like to ask if google follows 302 redirect for a page with meta noindex, nofollow ? Is meta noindex only valid for 200 status code ?

Kaj Kandler said...

I'd like to know when does any of the three major SE start to honor those directives?

With sitemaps there were announcements but it was not clear when actually the SE accepted their submission or looked at them. With this there is no way of knowing. so please be specific.

Kaj Kandler said...

I'd also like to know more specific spec of the * wildcard.

Can I use it multiple times? For example like:
/abc/*/xyz/*.php

What exactly does that mean?
Does it cover
/abc/edf/xyz/lmn.php
and
/abc/edf/ghj/xyz/lmn.php ?

I more complete spec would be much appreciated. Also, where is the official spec?

searchtools said...

I am also bit confused about how complex a wildcard expression you will support, and in particular, how much of this is a common protocol, so what works with one robot works with the other two (and any others that come on board).

If you could address this issue in a future post, that would be great.

Susan Moskwa said...

@infrared:
You don't need to put a star at the end of a line; each disallow statement disallows all URLs that begin with that string. So

Disallow: /product/sort/

would disallow all of the following:

www.example.com/product/sort/
www.example.com/product/sort/alpha.html
www.example.com/product/sort/price/foo.html

The star is most useful in the middle of a Disallow statement. Check out the documentation for details.

Susan Moskwa said...

@Ludwig:
Yes, we do look for X-Robots-Tag in all headers (HTML as well as other filetypes). I believe that if we find both X-Robots-Tags and meta tags on a page, we'll adhere to the most restrictive one we find (e.g. if we find both nosnippet and noindex, we'll follow the noindex).

@Kaj:
All three engines are honoring these REP directives as of this announcement.

* matches any sequence of characters, so

Disallow: /abc/edf/*lmn.php

would disallow both
/abc/edf/xyz/lmn.php
and
/abc/edf/ghj/xyz/lmn.php

As the blog post mentions, REP has been a largely de facto standard (there's no one organizing body), so there's no one "official" spec; but our blog post links to our documentation, as well as to Yahoo! and Live Search's blog posts on this, which list their documentation.

@searchtools:
Everything documented in this post will work for Google, Yahoo!, and MSN. While many other robots adhere to REP, we can't guarantee the implementation details for other engines. If you're interested in how a particular engine treats these directives, I'd recommend checking out their documentation.

jeet said...

Typically, how long does it take for a new site to get crawled by google bot and indexed. I am waiting for my new tech blog - www.technewsbeats.com to get indexed by Google for quite sometime. I have also verified and submitted my sitemap via google webmaster tool. But still now no result. Please help.

Susan Moskwa said...

Also, you can use the * wildcard multiple times, e.g.

Disallow: /abc/*/xyz/*.php

Nyceane said...

the entire SEO is really complicated these days, I am just surprised how entire company's advertising is depended on google and not internet itself... lol

Susan Moskwa said...

Mike,
Google's crawler will follow redirects and will look at the target of the redirect. We won't look at any meta tags on the page that originates the redirect (because we immediately follow the redirect), but we'll see and adhere to meta tags on the redirect target page, including noindex and nofollow directives.

beddy said...

What does google recommend for the redirect? A meta redirect or the server-side 301 redirect or any others? Does it make a difference?

MRC-T said...

Ok, I have a question that I have searched every place I can find, including webmaster tools, discussion board, etc. My site has been indexed by google repeatedly, but I am puzzled by the fact that in the search results it lists an OLD meta description, but the correct new meta title. I have checked and re-checked my web page, and there is nothing in it with the old meta description. Google has indexed my site numerous times, most recently a week ago, but the search results still show the old meta description. I am concerned because the old meta description is VERY different than the new one and it has information I would really like to be updated. I have also tried to find a way to report this, but have had no luck with that either. Please please, do you have any tips for me?

KellyLynch said...

What about situation when Allow directive specifies empty value ('Allow: ')? Is such directive just ignored?

Сергей в Темноте said...

Good article, thanks.

spunk mouth said...

I dont want indexing from search engines for www.mydomain.com but I want indexing for mydomain.com

what to I do in robots.txt???

Matt747 said...

I'm interested in the no snippets order. I would like to know if using this for part of the content has any other effect on how google rates this page, e.g. will it no longer count the no snippeted content in considering its relevancy.

Kiefers Corner said...

Up to about a week ago i only had a few pages with robot txt files on them. Now the majority of them are the only thing i did was to add photos to my blogs and added amazon.

Did i do something wrong ?

How do i fix this ?

Suraj Shrestha said...

Is there a way that we can access and edit robots.txt file of our blog on blogspot?

tanzy-panzy said...

my site used to be on the first page for a specific keyword but now it shows" Disallow: /search in the robots.txt file and due to this was thrown out of search engine.

My mistake: I thought using the keyword frequently in the title of the posts would help but it did so much of harm that I am still repenting.

After two months and my constant efforts. I see my site on 38th position. The fact is that my website provides more information on the topic than any other site that is on first page.

My effort to provide more and concise information has gone to waste.

I have removed the keyword from the title and labels now but I dont see mcuh results.

CAn anyeone advice of what can be done for it to be listed on the first page again.

Thanks for reading this.

Cheers!
T.

Nits said...

quite well described blog, before reading this i was totally confused, but now i m clear. Thanks http://ansblog.com

Maile Ohye said...

Hi everyone,

Since some time has passed since we published this post, we're closing the comments to help us focus on the work ahead. If you still have a question or comment you'd like to discuss, free to visit and/or post your topic in our Webmaster Help Forum.

Thanks and take care,
The Webmaster Central Team