Saturday, August 19, 2006 at 11:51 AM
I've seen a lot of questions lately about robots.txt files and Googlebot's behavior. Last week at SES, I spoke on a new panel called the Bot Obedience course. And a few days ago, some other Googlers and I fielded questions on the WebmasterWorld forums. Here are some of the questions we got:If my site is down for maintenance, how can I tell Googlebot to come back later rather than to index the "down for maintenance" page?
You should configure your server to return a status of 503 (network unavailable) rather than 200 (successful). That lets Googlebot know to try the pages again later.
What should I do if Googlebot is crawling my site too much?
You can contact us -- we'll work with you to make sure we don't overwhelm your server's bandwidth. We're experimenting with a feature in our webmaster tools for you to provide input on your crawl rate, and have gotten great feedback so far, so we hope to offer it to everyone soon.
Is it better to use the meta robots tag or a robots.txt file?
Googlebot obeys either, but meta tags apply to single pages only. If you have a number of pages you want to exclude from crawling, you can structure your site in such a way that you can easily use a robots.txt file to block those pages (for instance, put the pages into a single directory).
If my robots.txt file contains a directive for all bots as well as a specific directive for Googlebot, how does Googlebot interpret the line addressed to all bots?
If your robots.txt file contains a generic or weak directive plus a directive specifically for Googlebot, Googlebot obeys the lines specifically directed at it.
For instance, for this robots.txt file:
User-agent: *Googlebot will crawl everything in the site other than pages in the cgi-bin directory.
Disallow: /
User-agent: Googlebot
Disallow: /cgi-bin/
For this robots.txt file:
User-agent: *Googlebot won't crawl any pages of the site.
Disallow: /
If you're not sure how Googlebot will interpret your robots.txt file, you can use our robots.txt analysis tool to test it. You can also test how Googlebot will interpret changes to the file.
For complete information on how Googlebot and Google's other user agents treat robots.txt files, see our webmaster help center.


19 comments:
I really appreciate the 503 status code tip and the robots answers.. I learned a lot in a few short paragraphs!
Robots.txt examples
There is an example of implementing the "503 Service Temporarily Unavailable" at
Instruct Search Engines to come back to site after you finish working on it
Check it out!
Dear Vanessa
Thank you for the educationful info
very much appreciated
regards
Frank
What happens if you have two meta tags in a web page and they conflict? Will Googlebot listen to the first or second meta tag?
Thanks,
Rourke
Every time I look in the Google Webmaster tools, it says googlebot last looked at my site on May 21. I am posting new content at least weekly, and don't understand why it hasn't come back to my site.
I have a blog on blogspot, and have given google the atom feed as a sitemap, which appears to be valid.
Thoughts?
Where do I go to find out why our 12 yo city website www.palmdesert.com was dropped back in Jan 07 and all that I see is,
Home page crawl: Googlebot last successfully accessed your home page on Jan 14, 2007.
Index status: No pages from your site are currently included in Google's index. Indexing can take time. You may find it helpful to review our information for webmasters and webmaster guidelines. [?]
Please let me know what to do.
Art Davis
In my webmaster tools I see Home page crawl:Googlebot last successfully accessed your home page on May 1, 2007
However the pages are being updated on a regular basis. Is there some reason why the date isn't being updated?
Does G ever ban a web site? Do they tell the webmaster of the issue? Why would our site www.palmdesert.com which always had great indexing get dropped completely for over 10 months?
Is there a way of telling Googlebot to only visit once a day for example.
Does it recognise Crawl-Delay or Visit-Time commands in the robot.txt file, which would enable this control???
I am trying to figure out why google will not visit my site. It's been 2 and half months since a googlebot has been by.
http://www.atmospheretv.com
Mike
So, the webmaster dashboard says my home page has not been crawled since 7/31. I look at the list of bad URLs, and the first one is the homepage (www.magazine-agent.com) and the result is a 400 Bad Request. I immediately open a few browsers just to make sure it's just Google. It is.
Upon reading the SPECIFICATION for the INTERNET, a properly constructed request to the server appears to be:
GET / HTTP/1.1
Host: www.magazine-agent.com
And lo and behold, if you make this request, you are greeted with the homepage.
Google documentation specifically states your web server must support the If-Modified-Since header. So I try that one:
GET / HTTP/1.1
If-Modified-Since: Tue, 18 Sep 2007 01:01:01 GMT
Host: www.magazine-agent.com
And again, since we recompile daily, it works.
Some people don't pass the HTTP/1.1. Or pass a bad version. Those will cause our site to fail.
Some people use get instead of GET. Those will not cause our site to fail.
In short, our server is following the specification. The crawler appears to be ignoring it. I can create a 400 Bad Request 100 different ways, but none of them follow the specification for HTTP.
If someone can tell me what's wrong with our server, or confirm that the problem lies with Google's crawler incorrectly forming a request which works fine for other configurations, but not our own, that would be great.
Also, since I don't know if responses get mailed, I would greatly appreciate an email as well to shawn@magazineagent.com.
Thanks,
Thanks for the insights. One question:
I am planning to use mod_rewrite to hide ugly urls
e.g. /places/coffee-house points to /index.php?pid=43
Each such mod_rewrite url has a unique content. How do I make sure that the googlebot indexes these pages?
Do I need to physically create the url path (e.g. mkdir places/coffee-house)?
Thanks.
unmesh@sadakmap.com
Google has suddenly started showing older cache of my website. I checked the site with Google Webmasters tools and found that "robots.txt file is unreachable" due to which Googlebot has not crawled my website after Aug 15, 2007.
I've verified the robots.txt file and its is correct. I've not blocked a single page in this file. Can anyone tell me why is this error coming? what about the sites that do not have robots.txt file and still get indexed easily? Please tell me how i can rectify this error and make Googlebot crawl my website.
Hi folks--
If you have site-specific questions (Why is X happening with my site/in my webmaster tools account?), please search our Webmaster Help Group and post your question there—along with your site's URL—if it hasn't already been answered.
Regarding Richard's crawl rate question: if you feel Googlebot is crawling you too fast you can use our crawl rate tool to ask us to crawl you slower.
Few months ago a robot.txt file with disallow command has maliciously planted on the root of my website. the website (www.flowers-israel.net) has been totally removed from google index. Since then I was trying several things to get listed again including the "reconsider tool". google sitmap is show that there are no restricted urls by google bot and there are 181 urls submitted. the probelm is that nothing is listed on the index.
Any other ideas how to get listed again ?
Thank in advance,
Eyal.
Our website (www.mingloo.com) doesn't have all the pages indexed and our site has been up for over 6 months now. Site:www.mingloo.com on yahoo shows over 800 pages but on Google it only returns 154 pages indexed. Please advice what I am supposed to do to ensure the site is timely indexed.
Thanks
Deepak
Thank you for the 503 status tip.
People having problems with Google Indexing, some quick tips:
1) Check there is a 200 status being returned in the header.
2) Check your html - mingloo, I note you have html code outside of the html tags. Run your site through a html validator, and fix as many errors as you can.
3) Take a look at sitemaps. Google may not actually be able to find your content through your website. Sitemaps can help show what you've got available.
My forum http://PS3gamechat.forumotion.com has a bot on it and what does it do does it summit it to google or what?
Hi everyone,
Since some time has passed since we published this post, we're closing the comments to help us focus on the work ahead. If you still have a question or comment you'd like to discuss, free to visit and/or post your topic in our Webmaster Help Group.
Thanks and take care,
The Webmaster Central Team
Post a Comment