Google Webmaster Central Blog - Official news on crawling and indexing sites for the Google index

Controlling crawling and indexing now documented on code.google.com

Wednesday, November 24, 2010 at 1:16 PM

Webmaster level: All

Do you know how Google's crawler, Googlebot, handles conflicting directives in your robots.txt file? Do you know how to prevent a PDF file from being indexed? Do you know Googlebot’s favorite song? The answers to these questions (except for the last one :)), along with lots of other information about controlling the crawling and indexing of your site, are now available on code.google.com:

Controlling crawling and indexing



Now site owners have a comprehensive resource where they can learn about robots.txt files, robots meta tags, and X-Robots-Tag HTTP header directives. Please share your comments, and if you have questions you can post them in our Webmaster Help Forum.

The comments you read here belong only to the person who posted them. We do, however, reserve the right to remove off-topic comments.

24 comments:

John Fenton said...

Google bots Favorite song appears to be:

I Will Derive! (LOL)

http://www.youtube.com/watch?v=P9dpTTpjymE

(Based on search "favorite song" under videos.)

Profesionales said...

Well its like the same thing we know before

Sankar Datti said...

Great News... One more update from GWT team.

Christopher Evans said...

Thanks for finally providing the much needed documentation.

We have previously identified the 'Order of precedence for group-member records' which chooses the directive with the most number of characters in the directive path.

http://blog.semetrical.com/googles-secret-approach-to-robots-txt/

We don't think this makes much sense and is not intuitive to anyone except the person who programmed it.

Surely Allow always beats Disallow, or the most specific match would make more sense.

John Mueller said...

@Christopher Great timing :-). "Allow" always beating "Disallow" would make some site structures impossible, so it's important that there is something more than that. It's good to see that you have covered this as well.

John Mueller said...

Hi everyone
I started a thread at http://www.google.com/support/forum/p/Webmasters/thread?tid=16803f4e0cc16716&hl=en for questions and comments.

andy said...

Brilliant :) shame this wasnt around yesterday when i needed it more though

momo said...

Can you clarify the following issue please?

In the robots.txt specification: http://code.google.com/web/controlcrawlindex/docs/robots_txt.html you mentioned that the following URL:
http://www.domain.com/page.htm

is *undefined* if robots.txt contains the following rules:
User-agent: *
Disallow: /page
Allow: /*.htm

When you test this example in Webmastertools “Test robots.txt” tool though the results are slightly different and are clearly giving the upper hand to Allow in this case. See results below:
* Allowed by line 4: Allow: /*.htm

Can you confirm which one is correct: the specification or the Webmastertools “Test robots.txt” tool?

John Mueller said...

@momo If the outcome is undefined, robots.txt evaluators may choose to either allow or disallow crawling. Because of that, it's not recommended to rely on either outcome being used across the board.

Jean Marc said...

Great thanks! I'm not alone !

craig said...

Is this new...
I have a lot of love for the googlebot.

JohnGeogie said...

I have learn X-Robots-Meta.That's a new concept.

David said...

This line:

" but in fact if you want all your content to be crawled, you don't need a robots.txt file at all"

Surely contradicts the use of adding a XML sitemap reference in the robots.txt file?

anand kumar said...

hello

lagi belajar ngeblog said...

hai,,

Yabin Lee said...

this is a good things ,because we can know much about how the robot run later =.=

Mostatil said...

Hi dear friends,

I have a question about keywords detection used by webmaster tools.

Google detects invalid keywords on my site like "to", "for", "and" etc.

That words used repeatedly in all of internet sites. It seems that google ignore my keywords meta tag. How can i fix this problem?

R.Baylon said...

Dude.. Google does not index or read the
Keywords Meta Tag anymore. This is due to keyword stuffing done by webmasters.

Focus your on-page optimization on other SEO factors like Title tags, h1-h6 tags, bold and strong tags, Img alt tags, and relevant content.

Tal said...

why does all the new changes take year to get active worldwide...?

Aposition said...

GoogleBOT > http://youtu.be/vIi6Nxzoyp4

Jonathan Becker said...

Glad to see this all in a single place :)

Marcio said...

he is safe to liberate the control of ActiveX in my computer for msn2go?

www.masthideals.com said...

Excellent GWT team, this is a good things, I'm not alone.

hh mag said...

hey peeps i wonder if you can help me. for some reason my website hasnt been indexed and the only link or address that comes up when you search on google using our key words is a strange adress that brings up a component link which makes people think our website is crap or broken. i dont know alot about this stuff and would really appreciate some help.