Official Google Webmaster Central Blog: Improving on Robots Exclusion Protocol

Webmaster Central Blog

Official news on crawling and indexing sites for the Google index

Improving on Robots Exclusion Protocol

Tuesday, June 03, 2008

Written by Prashanth Koppula, Product Managerdon'tRobots Exclusion Protocol (REP)SitemapsCommon REP Directives1. Robots.txt Directives

                    DIRECTIVE                                         IMPACT                              USE CASES                            
                    Disallow                     Tells a crawler not to index your site -- your site's robots.txt file still needs to be crawled to find this directive, however disallowed pages will not be crawled                    'No Crawl' page from a site. This directive in the default syntax prevents specific path(s) of a site from being crawled.                 
                    Allow                               Tells a crawler the specific pages on your site you want indexed so you can use this in combination with Disallow                    This is useful in particular in conjunction with Disallow clauses, where a large section of a site is disallowed except for a small section within it                 
         $ Wildcard Support                     Tells a crawler to match everything from the end of a URL -- large number of directories without specifying specific pages                    'No Crawl' files with specific patterns, for example, files with certain filetypes that always have a certain extension, say pdf                 
         * Wildcard Support                              Tells a crawler to match a sequence of characters                                         'No Crawl' URLs with certain patterns, for example, disallow URLs with session ids or other extraneous parameters                  
         Sitemaps Location                               Tells a crawler where it can find your Sitemaps                   Point to other locations where feeds exist to help crawlers find URLs on a site                 

2. HTML META Directives

                    DIRECTIVE                                         IMPACT                              USE CASES                            
         NOINDEX META Tag                               Tells a crawler not to index a given page                              Don't index the page. This allows pages that are crawled to be kept out of the index.                   
         NOFOLLOW META Tag                               Tells a crawler not to follow a link to other content on a given page                            Prevent publicly writeable areas to be abused by spammers looking for link credit. By using NOFOLLOW you let the robot know that you are discounting all outgoing links from this page.                   
         NOSNIPPET META Tag                               Tells a crawler not to display snippets in the search results for a given page                   Present no snippet for the page on Search Results                  
         NOARCHIVE META Tag                               Tells a search engine not to show a "cached" link for a given page                   Do not make available to users a copy of the page from the Search Engine cache                   
         NOODP META Tag                    Tells a crawler not to use a title and snippet from the Open Directory Project for a given page                             Do not use the ODP (Open Directory Project) title and snippet for this page                  

X-Robots-Tag Postour series of postsOther REP DirectivesUNAVAILABLE_AFTER Meta Tagwhen a page should "expire"NOIMAGEINDEX Meta Tagnot to index images for a given pageNOTRANSLATE Meta Tagnot to translate the content on a page into different languagesLearn moredocumentationGoogle's Webmaster help center