Google Webmaster Central Blog - Official news on crawling and indexing sites for the Google index

New robots.txt feature and REP Meta Tags

Wednesday, August 15, 2007 at 4:01 PM



We've improved Webmaster Central's robots.txt analysis tool to recognize Sitemap declarations and relative URLs. Earlier versions weren't aware of Sitemaps at all, and understood only absolute URLs; anything else was reported as Syntax not understood. The improved version now tells you whether your Sitemap's URL and scope are valid. You can also test against relative URLs with a lot less typing.

Reporting is better, too. You'll now be told of multiple problems per line if they exist, unlike earlier versions which only reported the first problem encountered. And we've made other general improvements to analysis and validation.

Imagine that you're responsible for the domain www.example.com and you want search engines to index everything on your site, except for your /images folder. You also want to make sure your Sitemap gets noticed, so you save the following as your robots.txt file:

disalow images

user-agent: *
Disallow:

sitemap: http://www.example.com/sitemap.xml

You visit Webmaster Central to test your site against the robots.txt analysis tool using these two test URLs:

http://www.example.com
/archives

Earlier versions of the tool would have reported this:



The improved version tells you more about that robots.txt file:





We also want to make sure you've heard about the new unavailable_after meta tag announced by Dan Crow on the Official Google Blog a few weeks ago. This allows for a more dynamic relationship between your site and Googlebot. Just think, with www.example.com, any time you have a temporarily available news story or limited offer sale or promotion page, you can specify the exact date and time you want specific pages to stop being crawled and indexed.

Let's assume you're running a promotion that expires at the end of 2007. In the headers of page www.example.com/2007promotion.html, you would use the following:

<META NAME="GOOGLEBOT"
CONTENT="unavailable_after: 31-Dec-2007 23:59:59 EST">


The second exciting news: the new X-Robots-Tag directive, which adds Robots Exclusion Protocol (REP) META tag support for non-HTML pages! Finally, you can have the same control over your videos, spreadsheets, and other indexed file types. Using the example above, let's say your promotion page is in PDF format. For www.example.com/2007promotion.pdf, you would use the following:

X-Robots-Tag: unavailable_after: 31 Dec
2007 23:59:59 EST


Remember, REP meta tags can be useful for implementing noarchive, nosnippet, and now unavailable_after tags for page-level instruction, as opposed to robots.txt, which is controlled at the domain root. We get requests from bloggers and webmasters for these features, so enjoy. If you have other suggestions, keep them coming. Any questions? Please ask them in the Webmaster Help Group.
The comments you read here belong only to the person who posted them. We do, however, reserve the right to remove off-topic comments.

25 comments:

Vahid Chaychi said...

Hi,

Thanks Google!

I really needed all of these features.

Keep on your good work.

Best regards,
Vahid

Ghosty said...

As wonderful as these things are, it would be really wonderful if you folks could give more usability to us BlogSpot users. This is all great stuff for domain owners, but we get left behind a lot. Even the sitemaps thing is not like it should be.

Jennifer Mathews Somogyi said...

How effective is the sitemap in getting pages indexed? I haven't found that when I submit a sitemap that pages are indexed any faster than if I were to not submit one.

I think it is great that we can test against relative URLs in the webmaster tools, but how is this going to help?

I do think the
< META NAME="GOOGLEBOT" CONTENT="unavailable_after: 31-Dec-2007 23:59:59 EST" > tag is an awesome idea. The larger websites I have worked on have promotions that only last for so long that we try to create landing pages for that when pulled we want to stop ranking.

capono said...

That should be 'disallow images' instead of 'disalow images'?

Sebastian said...

@Capono That's an intended typo to demonstrate the error reporting.

@Jennifer Depends, sometimes you want to 301 these pages to a similar offering, saving the asset (indexed page) instead of writing it off.

@John Yup, X-Robots-Tags are really sexy :) I've a few pedantic details though:

For www.example.com/2007promotion.pdf, you would use the following in the file's HTML headers:

Typo. Doesn't work because a PDF file has no HTML header, but here is a simple procedure to set the REP tags in the HTTP header.


REP META tags can be useful for implementing noarchive, nosnippet, and now unavailable_after tags for page-level instruction

I thought one can use everything valid in a robots meta tag, including index, noindex, follow, nofollow, all, none, noodp, and noydir if Yahoo joins the party.

BTW there is no way to tell you that you shouldn't offer the "View as HTML" preview, hence I suggest adding the "nopreview" value.

asylumet said...

Speaking of the robots.txt This is in response to the tool for url removal. I would have commented elsewhere or contacted google but the more time passes the harder it is to find good contact info for them outside of snail mail. It seems that the url tool does not work properly due to not being able to read the robots.txt. Below is the message google offers.

It states do "one" of the following. I chose the 3rd option to block with robots.txt which did not work and states google can not access the robots.txt even though it is viewable in a browser as with returning proper response headers. After these findings on this site and one other with a robots.txt set up as follows.

User-agent: *
Disallow: /

User-agent: Googlebot
Disallow: /

I then decided that maybe I would just password protect the whole thing making it inaccessible although this returns 403 as apposed to 404 but just the same is surely not accessible.

Not sure what the cause is now so I thought maybe someone else would have an idea or some thoughts about it. Also google may want to take note of the first option below "Requests for the page or imate" should read "Requests for the page or image"

Why was my URL removal request denied?

Your request may have been denied because your content did not meet the eligibility requirements for removal.

To block a page or image from your site, do one of the following, and then submit your removal request:

* Make sure the content is no longer live on the web. Requests for the page or imate you want to remove must return an HTTP 404 (not found) or 410 status code.
* Block the content using a meta noindex tag.
* Block the content using a robots.txt file.

To remove your entire site, or a complete directory, use a robots.txt file.

Ian said...

I find that my sitemaps are constantly coming back with errors about not being valid, but I submit them and test them with wc3 and whatever and they seem fine. Only google has a problem with it..I just can't get around why?

enigma4ever said...

I am sorry to bother you here, but I am confused what to do- my blog http://watergatesummer.blogspot.com/, has been listed on google and usually comes up first, it has been listed that way for over 2 years, and tonight I went to find it and it does not come up first- please help me figure this out. ( enigma4ever@earthlink.net) . I am fearful because of political issues that I blog that Google has censored me- it happened right after I blogged a very personal post regarding Voting.And immediately after that my blog vanished from google. PLEASE SOMEONE HELP me figure this out . Many people google my blog, but this is going to confuse them. So please I hope no one has requested my wonderful blog be removed from the top of the google list ( I have over 1 million hits total in 2 years. so I don't think I should have been removed from the top. My blog is polite and sensitive, it is by a mom/nurse. I hope Google is NOT censoring.If they are I will write to Keith at Countdown.PLEASE HELP.

Tyler Dockery, www.dockerydesign.com said...

Great News! As a webmaster for larger sites like www(dot)Aspetech(dot)com, I look forward to using this on my new sites!

kd6lvw said...

I too am having problems with my sitemap declaration in the robots.txt file. I have a relative URL listed, and with your recent change last week, you're now flagging this as invalid. Sitemaps.org does not say that the URL must be an absolute one (only that it "should" be), thus relative URLs are valid - and googlebot does seem to be fetching them properly. ...So why are the tools flagging it as an error?

calamity said...

I have a problem with the webmaster tool. I had put all the meta on my page to forbid indexing, I have a robot file at the root of my website.
Once, I used the automatic page to remove my website from google indexing (since it didn't seem to care about my correct meta forbidding), and indeed after a while my website was removed.
But, I discovered today that it is back again.
I tried the webmaster tool, but because of my robot.txt file, the tool can not work, which would be fine but the problem is that I can not access to the url removal link.
So, I don't know how to really forbid my website to google index...

Paula said...

Well, I would be thrilled about these new tools, except that I now have a brand new problem with my site. It seems that one of the pages (http://www.geocities.com/rpcv.geo/intelligence.html) has been listed as a page that can harm a computer, which now makes it pretty hard for anyone to access it via Google. I have no idea why. There is nothing on the page aside from text, a few photos and some links that go to perfectly respectable places like the official CBC site and other parts of my site. No other pages on my site have this problem. No one has ever reported a problem with the page and it's been up for over a year, with a fair number of hits. So, what gives? And who do I go to in order to find out what the heck is going on?

aletsch said...

Is it also possible to point Google to my sitemap using a relative URL?:

Sitemap: sitemap.xml

zammy said...

Regarding the "sitemap" syntax support in the robots.txt, I am not sure if the impact of this has been thought about before.

We actually worry that revealing the filename of our sitemaps file to others (via the robots.txt)would make it easier for scrapers to locate our pages and our competitors to know more about our important pages than we care to share.

Does anyone share this concern?

rilwis said...

Thanks for your great news!

I have a small questio: what's the deference between classical way to submit sitemap and the new way to submit sitemap inside robots.txt?

benry said...

Minor typo in the examples. Robots.txt should be 'disallow' not 'disalow' -- missing an 'L'

Susan Moskwa said...

Actually, benry, that misspelling is intentional (to demonstrate the error reporting feature that was added).

ctbanni said...

Hi All

Need help with robot.txt. Plain English please as I am a complete novice.

I logged on to Analyze robots.txt and is says 404 not found. It also says the the robot text is in the box below but the box is blank. What do I need to do to rectify this problem.

Susan Moskwa said...

This means you don't have a robots.txt file yet.

It's fine not to have one; but if you want one, here are some articles on how to make and upload one. You can also check robotstxt.org for more information.

Alan Doherty said...

considering the unavailable_after:
meta tag and header option
would you consider extending the sitemaps protocol to not just track last modified and changefreq
but to allow the addition of expires-on or unavailable-after as well
so it allows the easy tracking/removal of expired content without googlebot having to GET and recive the 404 from each url when it can simply find the intentionally removed content in the sitemap file

{just notice that i have repeated gets long after content is removed as some eejit elswhere links to a long dead url}

Roseate said...

Hi,

I am a bit confused about how should I edit my robots.txt for a wordpress blog? ... currently the following appears on it:

User-agent: *
Disallow:

What does that mean, and should I change it or not?

Thanks,
Roseate

Alan Doherty said...

Roseate said...

stuff
your robots.txt is currently fine it just means all robots {*}
are allowed anywhere
the dissalow:{followed by no uri}
thus not banning anyone

if you have a sitemap you can add it {per instructions}
if there are some regions you would rather the robots didn't read then you can add them otherwise its all fine

Larry Lang said...

Do you need to use robot text files if you want to have all pages indexed. Google has indexed all my pages, but has not given then any rank, just the homepage. Just wondering if this has any bearing. I do have a Google sitemap.
Thanks,
Larry
South Florida Homes

Alan Doherty said...

Larry Lang said...
Do you need to use robot text files if you want to have all pages indexed.

no robots.txt is to tell what page you don't want indexed

{also to say where your sitemap is with the new extentions}

Google has indexed all my pages, but has not given then any rank, just the homepage. Just wondering if this has any bearing. I do have a Google sitemap.

pagerank is a function of who/how many links there are to a page from around the internet
minus how many obviously paid links it has to it

i would suspect therfore people only link to your frontpage as none of the other pages happens to be interesting enough for anyone to link to it

thus no pagerank

Google Webmaster Central said...

Hi everyone,

Since several months have passed since we published this post, we're closing the comments to help us focus on the work ahead. If you still have a question or comment you'd like to discuss, free to visit and/or post your topic in our Webmaster Help Group.

Thanks and take care,
The Webmaster Central Team