Wednesday, November 24, 2010 at 1:16 PM
Webmaster level: AllDo you know how Google's crawler, Googlebot, handles conflicting directives in your robots.txt file? Do you know how to prevent a PDF file from being indexed? Do you know Googlebot’s favorite song? The answers to these questions (except for the last one :)), along with lots of other information about controlling the crawling and indexing of your site, are now available on code.google.com:
Controlling crawling and indexing

Now site owners have a comprehensive resource where they can learn about robots.txt files, robots meta tags, and X-Robots-Tag HTTP header directives. Please share your comments, and if you have questions you can post them in our Webmaster Help Forum.


25 comments:
Google bots Favorite song appears to be:
I Will Derive! (LOL)
http://www.youtube.com/watch?v=P9dpTTpjymE
(Based on search "favorite song" under videos.)
Well its like the same thing we know before
Great News... One more update from GWT team.
Thanks for finally providing the much needed documentation.
We have previously identified the 'Order of precedence for group-member records' which chooses the directive with the most number of characters in the directive path.
http://blog.semetrical.com/googles-secret-approach-to-robots-txt/
We don't think this makes much sense and is not intuitive to anyone except the person who programmed it.
Surely Allow always beats Disallow, or the most specific match would make more sense.
@Christopher Great timing :-). "Allow" always beating "Disallow" would make some site structures impossible, so it's important that there is something more than that. It's good to see that you have covered this as well.
Hi everyone
I started a thread at http://www.google.com/support/forum/p/Webmasters/thread?tid=16803f4e0cc16716&hl=en for questions and comments.
Brilliant :) shame this wasnt around yesterday when i needed it more though
Can you clarify the following issue please?
In the robots.txt specification: http://code.google.com/web/controlcrawlindex/docs/robots_txt.html you mentioned that the following URL:
http://www.domain.com/page.htm
is *undefined* if robots.txt contains the following rules:
User-agent: *
Disallow: /page
Allow: /*.htm
When you test this example in Webmastertools “Test robots.txt” tool though the results are slightly different and are clearly giving the upper hand to Allow in this case. See results below:
* Allowed by line 4: Allow: /*.htm
Can you confirm which one is correct: the specification or the Webmastertools “Test robots.txt” tool?
@momo If the outcome is undefined, robots.txt evaluators may choose to either allow or disallow crawling. Because of that, it's not recommended to rely on either outcome being used across the board.
Great thanks! I'm not alone !
Is this new...
I have a lot of love for the googlebot.
I have learn X-Robots-Meta.That's a new concept.
This line:
" but in fact if you want all your content to be crawled, you don't need a robots.txt file at all"
Surely contradicts the use of adding a XML sitemap reference in the robots.txt file?
hello
hai,,
this is a good things ,because we can know much about how the robot run later =.=
Hi dear friends,
I have a question about keywords detection used by webmaster tools.
Google detects invalid keywords on my site like "to", "for", "and" etc.
That words used repeatedly in all of internet sites. It seems that google ignore my keywords meta tag. How can i fix this problem?
Dude.. Google does not index or read the
Keywords Meta Tag anymore. This is due to keyword stuffing done by webmasters.
Focus your on-page optimization on other SEO factors like Title tags, h1-h6 tags, bold and strong tags, Img alt tags, and relevant content.
why does all the new changes take year to get active worldwide...?
GoogleBOT > http://youtu.be/vIi6Nxzoyp4
Glad to see this all in a single place :)
he is safe to liberate the control of ActiveX in my computer for msn2go?
Excellent GWT team, this is a good things, I'm not alone.
hey peeps i wonder if you can help me. for some reason my website hasnt been indexed and the only link or address that comes up when you search on google using our key words is a strange adress that brings up a component link which makes people think our website is crap or broken. i dont know alot about this stuff and would really appreciate some help.
Hi everyone,
Since over a year has passed since we published this post, we're closing the comments to help us focus on the work ahead. If you still have a question or comment you'd like to discuss, free to visit and/or post your topic in our Webmaster Central Help Forum.
Thanks and take care,
The Webmaster Central Team
Post a Comment