Google Webmaster Central Blog - Official news on crawling and indexing sites for the Google index

Debugging blocked URLs

Tuesday, September 19, 2006 at 4:36 PM

Vanessa's been posting a lot lately, and I'm starting to feel left out. So here my tidbit of wisdom for you: I've noticed a couple of webmasters confused by "blocked by robots.txt" errors, and I wanted to share the steps I take when debugging robots.txt problems:

A handy checklist for debugging a blocked URL

Let's assume you are looking at crawl errors for your website and notice a URL restricted by robots.txt that you weren't intending to block:
http://www.example.com/amanda.html URL restricted by robots.txt Sep 3, 2006

Check the robots.txt analysis tool
The first thing you should do is go to the robots.txt analysis tool for that site. Make sure you are looking at the correct site for that URL, paying attention that you are looking at the right protocol and subdomain. (Subdomains and protocols may have their own robots.txt file, so https://www.example.com/robots.txt may be different from http://example.com/robots.txt and may be different from http://amanda.example.com/robots.txt.) Paste the blocked URL into the "Test URLs against this robots.txt file" box. If the tool reports that it is blocked, you've found your problem. If the tool reports that it's allowed, we need to investigate further.

At the top of the robots.txt analysis tool, take a look at the HTTP status code. If we are reporting anything other than a 200 (Success) or a 404 (Not found) then we may not be able to reach your robots.txt file, which stops our crawling process. (Note that you can see the last time we downloaded your robots.txt file at the top of this tool. If you make changes to your file, check this date and time to see if your changes were made after our last download.)

Check for changes in your robots.txt file
If these look fine, you may want to check and see if your robots.txt file has changed since the error occurred by checking the date to see when your robots.txt file was last modified. If it was modified after the date given for the error in the crawl errors, it might be that someone has changed the file so that the new version no longer blocks this URL.

Check for redirects of the URL
If you can be certain that this URL isn't blocked, check to see if the URL redirects to another page. When Googlebot fetches a URL, it checks the robots.txt file to make sure it is allowed to access the URL. If the robots.txt file allows access to the URL, but the URL returns a redirect, Googlebot checks the robots.txt file again to see if the destination URL is accessible. If at any point Googlebot is redirected to a blocked URL, it reports that it could not get the content of the original URL because it was blocked by robots.txt.

Sometimes this behavior is easy to spot because a particular URL always redirects to another one. But sometimes this can be tricky to figure out. For instance:
  • Your site may not have a robots.txt file at all (and therefore, allows access to all pages), but a URL on the site may redirect to a different site, which does have a robots.txt file. In this case, you may see URLs blocked by robots.txt for your site (even though you don't have a robots.txt file).
  • Your site may prompt for registration after a certain number of page views. You may have the registration page blocked by a robots.txt file. In this case, the URL itself may not redirect, but if Googlebot triggers the registration prompt when accessing the URL, it will be redirected to the blocked registration page, and the original URL will be listed in the crawl errors page as blocked by robots.txt.

Ask for help
Finally, if you still can't pinpoint the problem, you might want to post on our forum for help. Be sure to include the URL that is blocked in your message. Sometimes its easier for other people to notice oversights you may have missed.

Good luck debugging! And by the way -- unrelated to robots.txt -- make sure that you don't have "noindex" meta tags at the top of your web pages; those also result in Google not showing a web site in our index.
The comments you read here belong only to the person who posted them. We do, however, reserve the right to remove off-topic comments.

7 comments:

ATW said...

In Sri Lanka it is noted that most of the ISP's have blocked the website named www.tamilnet.com But we are gathering information from the site. So I need to know whether I cann access to this site by an alternative way. Your response is very much appreciated.

Loan said...

This problem can be resolved by using a "secure proxy server" located outside Sri Lanka, preferably run by someone you can trust, and who also has a secure 'https' server constantly running.

The following is the safest and best solution, but it requires the co-operation of a friend with a good "always on connection" (e.g. Cable Broadband) located outside Sri Lanka installing these two applications on their PC:
1. (See - http://www.jmarshall.com/tools/cgiproxy/ for info' and download of proxy)
"...Through it, you can retrieve any resource that is accessible from the server it runs on. This is useful when your own access is limited, but you can reach a server that in turn can reach others that you can't. In addition, the user is kept as anonymous as possible from any servers..."
2. (For an easy self-installing "server combi" which works with the "MS Windows platform" - see: http://sourceforge.net/project/showfiles.php?group_id=93507 and http://apache2triad.net/ )

Alternatively, if you can't do this (as above) there are many proxies available by google-ing "nph-proxy", however -
"Some Proxy Servers are very Bad!
Not all proxy servers do as they claim and in fact, there are a ton of junk proxy servers out there that give people a false sense of security or worse, record everything you do in hopes to score a password or two!" (paragraph from http://www.auditmypc.com/anonymous-surfing.asp )
Another problem with most of these proxies is that they're not using SSL encryption (URL needs to begin with "https") possibly allowing them to 'fall foul' of an I.S.P's "banned words and phrases" filtering.

If you're still stuck, I run a "secure proxy" which I allow trusted friends and family to use on the condition that they do not divulge its location (on the web) to others (because if the whole world tried to use the thing it could get over-whelmed and crash!)
Contact me ("Editor") through my website ( https://loanranger.no-ip.org/crisis%20loan/ ) if you wish to surf via my proxy.

Everymatter said...

Following URL of my blog is restricted by robots.txt

please help me out.


http://everymatter.blogspot.com/search/label/Corruption
http://everymatter.blogspot.com/search/label/Internet
http://everymatter.blogspot.com/search/label/Nuclear

sandeep

Loan said...

To sandeep: The help is available at "http://www.google.com/support/webmasters/bin/answer.py?answer=35237", however, I've looked at your robots.txt file and can tell you that "search" and sub-ordinate files and directories within the directory named "search" are being restricted by the lines -
"User-agent: *
Disallow: /search" - and more specifically the following line - "Disallow: /search".
I.e. URLs' beginning with - "http://everymatter.blogspot.com/search"... cannot be crawled by the Googlebot. You can fix this by removing these 2 "problem" lines from your robots.txt file. After you have done this "Google Webmaster Tools" can take a few weeks or so to reflect this change because this could be the time cycle it takes to re-index your entire site (page by page).
Nonetheless, you can use "Analyze robots.txt" (Tools menu) to check if the previously restricted URLs' can now be accessed. The page which does this has 2 text boxes; first is your sitemap, second is where you can paste in (1 per line) the URLs' to check. MAKE SURE THAT THE SITEMAP SHOWN IS EXACTLY THE SAME AS YOUR REVISED ONE, this may require you to edit it in the text box.
Click "Check" and the result will tell you if the URLs' are now allowed, and, if they are not - it will indicate to you which are the restrictive lines in your "robots.txt".

I hope this helps you, Loan Ranger. Home: http://loanranger.no-ip.org/crisis%20loan/

Susan Moskwa said...

Actually, Sandeep, every blog hosted on blogspot.com has the /search directory blocked by robots.txt. You can find more information about why here (see my answer near the bottom of the page). It's not something you need to worry about.

Loan said...

Hi sandeep, please refer to the excellent (previous) post from "susan" because it is specific to blogspot.com sites, whereas my answer is applicable to webmasters' in general.
Loan Ranger -
Home

Google Webmaster Central said...

Hi everyone,

Since over a year has passed since we published this post, we're closing the comments to help us focus on the work ahead. If you still have a question or comment you'd like to discuss, free to visit and/or post your topic in our Webmaster Help Group.

Thanks and take care,
The Webmaster Central Team