Google Webmaster Central Blog - Official news on crawling and indexing sites for the Google index

Remove your content from Google

Wednesday, January 09, 2008 at 11:39 AM



Confused about the best uses of robots.txt, nofollow, URL removal tool? Wondering how to keep some of your pages off the web? Our webspam lead, Matt Cutts, talks about the best ways to stop Google from crawling your content, and how to remove content from the Google index once we've crawled it.



We love your feedback. Tell us what you think about this video in our Webmaster Help Group.

* Note from Matt: Yes, robots.txt has been around since at least 1996, not 2006. It's hard for me to talk for 12-13 minutes without any miscues. :)



Update: for more information, please see our Help Center articles on removing content.
The comments you read here belong only to the person who posted them. We do, however, reserve the right to remove off-topic comments.

41 comments:

JLH said...

Good discussion, I'm glad that Matt covers the fact that a page with a noindex Meta Tag on it, that isn't allowed to be crawled via robots.txt does not get read. That seems to allude some people.

beussery said...

Thanks for the great tips Matt and by the way, I really like the happy faces!

Vinz said...

Nice video presentation..Very interesting and informative...
Thanks for the great tips Matt.
By the way I have my Web Hosting Provider for my websites. It looks good and very helful. I hope Google will index my sites.

http://search-engines-web.com/ said...

////We love your feedback. Tell us what you think about this video in our Webmaster Help GroupObviously - from that remark - Google is trying to push the Webmaster Help Forum.

But think of how rude that is, to use someone's video, then tell readers to visit another site to comment on it. :-o

Why would they? Wouldn't they want to visit the same site of the owner of that video to give and get direct feedback?!

The Webmaster Help Group just does not have the charisma to be alluring.

Lets not divide and conquer - lets build and invest in what is working the best and the resource the public has chosen as their favorite!!

After all this time, if they had found the help group more alluring - they would be visiting it.

Ian M said...

It's a real pain that the search engines treat the meta noindex tag differently. This would be a good area for Google to push for harmonization with the other major search engines.

Actually, I was thinking that a 'noindex' option in robots.txt would be the ideal complement - you could order spiders not to crawl the site, and also order search engines not to index the content, all in one location, e.g. something like:

User-Agent *
Disallow /sekrit/
Noindex /sekrit/

Jon said...

Is YouTube having problems? Seems like every embedded video I try to play says "We're sorry, this video is no longer available".

Michael Clark said...

(Cross posted in the Forum and here).

You didn't mention sitemap.xml and how that affects a page in the Google Index. If we don't list a page in our sitemap, will it still be included in the Google Index? It's very possible we'll have links to a page no in our sitemap.

I am splitting my web site into two sites. I've moved the content, now I'm using .htaccess to rewrite the URL from PlanetMike.com to the new URL at MichaelClark.name. I'm giving out server code 301, and updating my sitemap.xml. What will happen when the Googlebot next sees one of my pages that has been moved? I assume the 301 will tell Googlebot that the page isn't at PlanetMike.com any more, effectively removing that page from Google. And at the same time the 301 is telling GoogleBot to add the new page to its records for MichaelClark.name. I've been watching both domains pretty closely in the Webmaster Tools area, and it looks like everything is working smoothly.

Thanks, Mike

Sebastian said...

@Matt
Thanks for the great explanations!

As for password protected contents, are you sure that you don't index those based on 3rd party signals like ODP listings or strong inbound links?

You totally forgot to mention the neat X-Robots-Tag that allows outputting REP tags like "noindex" even for non-HTML resources like PDFs or videos in the HTTP header. That's an invention Google can be very proud of. :)


@Ian M
Actually, Google experiments with Noindex: in robots.txt, but that's "improvable".


@Google

Currently Google interprets Noindex: in robots.txt as (Disallow: + Noindex:). I think that's completely wrong, because:

1. It's not compliant to the Robots Exclusion Standard.

2. It confuses Webmasters because "noindex" in robots.txt means something completely different than "noindex" in meta tags or HTTP headers.

3. Mixing crawler directives and indexer directives this way is a plain weak point that will produce misunderstandings resulting in traffic losses for Webmasters and less compelling contents available to searchers. All indexer directives (noindex,nofollow,noarchive,noodp, unavailable_after etc.) do require crawling when put elsewhere. I do Webmaster support for ages and I assure you that Webmasters will not get it. If nobody understands it and adapts it, it's as useless as Yahoo's robots-nocontent class name that only 500 sites on the whole Web make use of.

4. The REP's "noindex" tag has an implicit "follow" that Google ignores in robots.txt for technical reasons (it's impossible to follow links from uncrawled pages). When I put a robots meta tag with a "noindex" value, then Google rightly follows my links, passes PageRank and anchor text to those, and just doesn't list the URL on the SERPs. When I do the same in robots.txt Google behaves totally different, for no apparent reason. (Of course there's a reason but I want to keep this statement simple.)

Having said all that, I appreciate it very much that Google works on robots.txt evolvements. Kudos to Google! However, please don't assign semantics of crawler directives to established indexer directives, that doesn't work out. I see the PageRank problem, and I think I know a better procedure to solve that. If you're interested, please read my "RFC" linked above. ;)

@all

Do not make use of experimental robots.txt directives unless you really know what you do, and that includes monitoring Google's experiment very closely. If you've the programming skills, then better make use of X-Robots-Tags to steer indexing respectivele deindexing of your resources on site level. X-Robots-Tags work with HTML contents as well as with all other content types.


Thanks for your time and have a nice day!
Sebastian

google said...
This comment has been removed by the author.
Greg said...

Hi, we have about 500 "Not Found" errors in Google Webmaster tools. Almost all are do to bad incoming links. Do we need to find ways to eliminate these or does it matter? We were thinking about building pages to make the broken pages valid and doing 301s. Thanks.

MJK said...

Great information, Matt - But what should webmasters use when trying to sculpt PageRank on their sites? Ex. I don't want PR to be passed to my "Privacy Policy" page, which might be linked to in my global navigation, and don't necessarily want/need it indexed. I just want to push some of the PR that might be going to that page to my other, more relevant pages. What should I use?

Net Mom said...

Really good video. Matt presented the information very clearly and in a way so that even newbies could easily understand without feeling overwhelmed by terminology. Thanks!

Angelo said...

Using the .htaccess file to password protect a portion of your web site only applies to Apache hosted web sites. For other web servers, you can achieve the same thing without a .htaccess file, just refer to the servers documentation.

Another useful way to restrict access to all or portion of a web site is to only allow certain IP addresses. This is useful for beta web sites where you want to limit who can see the beta site.

Apache example (could use absolute path on server and replace Location with Directory):
<Location /path/>
Order Deny,Allow
Deny from all
Allow from 192.168.2.123
Allow from 192.168.2.125
</Directory>

Lighty example:
$HTTP["remoteip"] !~ "192.168.2.123|192.168.2.125" {
$HTTP["url"] =~ "^/path/" {
url.access-deny = ( "" )
}
}

--angelo

Zolodoco said...

Wth guys? Pulling Jon Swift's (http://jonswift.blogspot.com) content from searches for 'Liberal Fascism' was low class and smacks of first rate caving-in. This is yet another reason I have to question Google's integrity and continue using multiple search engines to get the least biased results.

AjiNIMC said...

Hey Matt,

Thanks for the video but can you please explain a bit about removing https pages using webmaster tool as webmaster URL removal console starts with http. Is their a help file I can read for it?

Thanks, looking for an answer, we been discussing it at WMW for few days without a proper answer.

Thanks,
AjiNIMC

spencer said...

Thanks for the terrific tips. When will we get a lesson on how to fix it so that *other people's* content doesn't appear in Google searches?

(And yes, this is a Jon Swift reference. That was very, very badly played, Google, and you should be ashamed of yourselves.)

Richard said...

I'm more concerned with Google itself removing content. If fair comment on a book like "Liberal Fascism" is removed, are there plans to remove all reviews of books from searches? Does this not take away a considerable part of the usefulness of book searches? Or is Google just overly sensitive to being pressured?

Артемий said...

Saink u, wery intresting, it's look wery smart, for me and my resurs www.drevolife.ru

James said...

Nice and brief! Smiley faces are good too. ;)

LarryE said...

Ditto on the poor performance on the Jon Swift post. However, I notice it is now back in the rankings, turning up at #3.

Good. Glad to see you can undo missteps.

Ben said...

Who do I need to talk to at Google if I messed up my nameservers for 2 days and my domain dropped out of Google? My site was ranked #4 and now I can't find it at all. I am in big trouble!!! Anyone know what I need to do in order to fix my site's positioning?

Clicksharp Marketing said...

I'd love it if someone came up with an application to graphically display the effect of robots.txt -- you see the entire folder structure, and the files/folders excluded by robots.txt are just grayed out or boldfaced or something. Anyone need a project for the weekend?

Domenico said...

hallo everybody! :D

i' ve installed rewrite mod for my phpbb board. Recently i've updated it from simple mod to mixed mod. Everything works fine but now my indexed pages on google presents duplicates that are linked to the same page, example:

*mysite.com/forum/viewtopic.php?p=49&sid=d6edb4ec3eed0885874a3cc2b6ba316e (old version with sid, you can reach content)
*mysite.com/forum/viewtopic.php?p=49 (old version without sid, link works)
*mysite.com/forum/post49.html (old simple mod version, link works)
*mysite.com/forum/topic40.html#49 (actual ok version)

same thing for forums and subforums pages, ex.:

*mysite.com/forum/viewforum.php?f=43&sid=38bea68463f83640d37f50ec51a31b0e (old version with sid, link works)
*mysite.com/forum/viewforum.php?f=43 (old version without sid, link works)
*mysite.com/forum/forum43.html
*mysite.com/forum/catXX.html (old simple mod version, now obviously give 404)
*mysite.com/forum/name-of-forum-f43.html (actual ok version)

now i've eliminate sids from board and create a robots.txt that hide also

/forum/viewtopic.php
/forum/viewforum.php
/forum/post
/forum/cat
/forum/forum

i know that i can wait weeks, (or mre?) for a total recrawl of my site but i'd like to eliminate urls and old cached pages from google index using the 'official url eliminate tool'.

Before doing this, i want to be sure that eliminating the .php url (or one of the others), with this tool, i will not delete for the future also the newest .html url that link to the same page content!

What do you suggest? is better wait that google automatically update these urls in next months? or can i use this tool without any problem for my .html urls?

thanks!

Susan Moskwa said...

Hi, AjiNIMC:

Regarding your https question, we consider http://www.example.com and https://www.example.com to be two different sites. Once you've added and verified the https version of your site in your Webmaster Tools account, you should be able to submit URL removal requests for that site.

AjiNIMC said...

Thanks Susan for the reply. I have added that but how do I find total number of pages listed with https in google. I tried site:idealwebtools.com:443 but it did not work.

Also if a page is not existing in google listing and we add that page for removal from webmaster tool, does it show denied status? I see that pages that I am adding to webmaster console is getting denied. I am assuming that to be the reason.

I need a way to know all the pages that are listed with https. It will be helpful if you can show me a way.

Thanks,
AjiNIMC

Susan Moskwa said...

I think this discussion needs to be moved into the Help Group (too detailed for the blog!). I've followed up in this thread.

Fevzi Çakır said...

just great information what is given but I wonder if giving out a 404 error header will be enough for google to remove it from its indexes because when I look through my indexes I see pages crawled one year ago and unfortunately these indexes belong to the previous owner of the domain.

Tugacari said...

Hi all, I had a video sitemap with option "Yes" on "allow_embed" . I decided to change that to "No", so users would only see the video directly on my site.
After a couple of days videos still can be played on google's video website.
Is this question of time before all video will no play there ?
Should I remove completely the option "video:player_loc"?.
Any thoughts or experience about this ?

Regards

Pedro Lobito

Europe Trip 2005 said...

I am having the exact same problem with video sitemaps. I had indexed a video months back, since removed it and removed it entirely from my video sitemap. now there are 3 versions of the video still in Google video and they are broken. How do I remove a video submitted via sitemap?

reenata said...

Thanks for the great explanations!

colnector said...

Very useful information nicely presented. Thanks.

George said...

Dear Friend,
My hosting account has been locked and i have access to nothing..no robots and no meta tag...The host provider is returning 400 Htpp error for my pages and they say they can't changed it.I want to move to a different host.Does google remove my content with 400 http error if i ask google to remove my site in it's webmaster tools?
What can i do to remove my content form google? thanks
best regard

Roe said...

okay, it was an informative video but i haven't a clue of what to do in total. i finally went to Yahoo bought a domain and host and am on Wordpress.com but with a CC license of 3.0, now bought another domain, another host@Bluehost
This is under Construction as Constitunionalert so I suppose i will use a metatag of no index and hope i don't forget to remove it.
its ashame i asked for help from Google for 5 months and nobody helped@not once. thanks for the pointers i so wish Google cared because i never would have left but you all are a@part of each other so i suppose its no worries for Google and the rest of the Corporacracy's.
Peace
Rhoda VFP

Roe said...

Sorry I have been awake all night, it should say "Constitutionalert", please forgive me, very sleepy.

Roe said...

some person with an inane username sent me an email in some Asain language, i do not ever want a person on this entire site to send a personal email again, or i'll contact EFF.org, ACLU.org, ccrjustice.org and international email blasts to every pro-peace grassroots group around, how in the world is this allowed?
thank you
Mrs. Rhoda Ozen

Costa Maya said...

dear Roe, that person with an inane username don't sent YOU an email in some Asian language, but has published here his spam so an email arrives all of us.. in a spam's world is more difficult to believe and understand your surprise and outrage than this spam message...where do you live?

la nena said...

Hi,
sounds good, I wood use the removal tool since there is a lot of trash in my cached pages, however I really don't see how to access this tool. I have authenticated my site but don't see anything like removal in my webmaster section.

Jonathan Simon said...

@la nena See the "Remove URLs" page under the Tools section within Webmaster Tools.

Vovka said...

Please, tell me why dont crawling only part of page?
tag < noindex >...< /noindex >, or something else?

Crypto said...

On my site:
http://www.successful-candle-making.com
I incorrectly used the noindex nofollow tags in my html code and it looked like google didn't index the site because of this which i have read is correct. However, I want my site to be followed and indexed so, I have corrected the no index and nofollow to index and follow, but it still seems to show authentication problems in webmaster tools. Will this likely be corrected in future searches or will i have to resubmit and wait three weeks or so for the site to begin to be indexed?

Also there was an old .hhtaccess file hanging around in the public_html folder and google originally searched when this was there blocking it. Could this affect anything?
Thanks in advance.

Maile Ohye said...

Hi everyone,

Since some time has passed since we published this post, we're closing the comments to help us focus on the work ahead. If you still have a question or comment you'd like to discuss, free to visit and/or post your topic in our Webmaster Help Forum.

Thanks and take care,
The Webmaster Central Team