Wednesday, January 09, 2008 at 11:39 AM
Confused about the best uses of robots.txt, nofollow, URL removal tool? Wondering how to keep some of your pages off the web? Our webspam lead, Matt Cutts, talks about the best ways to stop Google from crawling your content, and how to remove content from the Google index once we've crawled it.
We love your feedback. Tell us what you think about this video in our Webmaster Help Group.
* Note from Matt: Yes, robots.txt has been around since at least 1996, not 2006. It's hard for me to talk for 12-13 minutes without any miscues. :)


41 comments:
Good discussion, I'm glad that Matt covers the fact that a page with a noindex Meta Tag on it, that isn't allowed to be crawled via robots.txt does not get read. That seems to allude some people.
Thanks for the great tips Matt and by the way, I really like the happy faces!
Nice video presentation..Very interesting and informative...
Thanks for the great tips Matt.
By the way I have my Web Hosting Provider for my websites. It looks good and very helful. I hope Google will index my sites.
////We love your feedback. Tell us what you think about this video in our Webmaster Help GroupObviously - from that remark - Google is trying to push the Webmaster Help Forum.
But think of how rude that is, to use someone's video, then tell readers to visit another site to comment on it. :-o
Why would they? Wouldn't they want to visit the same site of the owner of that video to give and get direct feedback?!
The Webmaster Help Group just does not have the charisma to be alluring.
Lets not divide and conquer - lets build and invest in what is working the best and the resource the public has chosen as their favorite!!
After all this time, if they had found the help group more alluring - they would be visiting it.
It's a real pain that the search engines treat the meta noindex tag differently. This would be a good area for Google to push for harmonization with the other major search engines.
Actually, I was thinking that a 'noindex' option in robots.txt would be the ideal complement - you could order spiders not to crawl the site, and also order search engines not to index the content, all in one location, e.g. something like:
User-Agent *
Disallow /sekrit/
Noindex /sekrit/
Is YouTube having problems? Seems like every embedded video I try to play says "We're sorry, this video is no longer available".
(Cross posted in the Forum and here).
You didn't mention sitemap.xml and how that affects a page in the Google Index. If we don't list a page in our sitemap, will it still be included in the Google Index? It's very possible we'll have links to a page no in our sitemap.
I am splitting my web site into two sites. I've moved the content, now I'm using .htaccess to rewrite the URL from PlanetMike.com to the new URL at MichaelClark.name. I'm giving out server code 301, and updating my sitemap.xml. What will happen when the Googlebot next sees one of my pages that has been moved? I assume the 301 will tell Googlebot that the page isn't at PlanetMike.com any more, effectively removing that page from Google. And at the same time the 301 is telling GoogleBot to add the new page to its records for MichaelClark.name. I've been watching both domains pretty closely in the Webmaster Tools area, and it looks like everything is working smoothly.
Thanks, Mike
@Matt
Thanks for the great explanations!
As for password protected contents, are you sure that you don't index those based on 3rd party signals like ODP listings or strong inbound links?
You totally forgot to mention the neat X-Robots-Tag that allows outputting REP tags like "noindex" even for non-HTML resources like PDFs or videos in the HTTP header. That's an invention Google can be very proud of. :)
@Ian M
Actually, Google experiments with Noindex: in robots.txt, but that's "improvable".
@Google
Currently Google interprets Noindex: in robots.txt as (Disallow: + Noindex:). I think that's completely wrong, because:
1. It's not compliant to the Robots Exclusion Standard.
2. It confuses Webmasters because "noindex" in robots.txt means something completely different than "noindex" in meta tags or HTTP headers.
3. Mixing crawler directives and indexer directives this way is a plain weak point that will produce misunderstandings resulting in traffic losses for Webmasters and less compelling contents available to searchers. All indexer directives (noindex,nofollow,noarchive,noodp, unavailable_after etc.) do require crawling when put elsewhere. I do Webmaster support for ages and I assure you that Webmasters will not get it. If nobody understands it and adapts it, it's as useless as Yahoo's robots-nocontent class name that only 500 sites on the whole Web make use of.
4. The REP's "noindex" tag has an implicit "follow" that Google ignores in robots.txt for technical reasons (it's impossible to follow links from uncrawled pages). When I put a robots meta tag with a "noindex" value, then Google rightly follows my links, passes PageRank and anchor text to those, and just doesn't list the URL on the SERPs. When I do the same in robots.txt Google behaves totally different, for no apparent reason. (Of course there's a reason but I want to keep this statement simple.)
Having said all that, I appreciate it very much that Google works on robots.txt evolvements. Kudos to Google! However, please don't assign semantics of crawler directives to established indexer directives, that doesn't work out. I see the PageRank problem, and I think I know a better procedure to solve that. If you're interested, please read my "RFC" linked above. ;)
@all
Do not make use of experimental robots.txt directives unless you really know what you do, and that includes monitoring Google's experiment very closely. If you've the programming skills, then better make use of X-Robots-Tags to steer indexing respectivele deindexing of your resources on site level. X-Robots-Tags work with HTML contents as well as with all other content types.
Thanks for your time and have a nice day!
Sebastian
Hi, we have about 500 "Not Found" errors in Google Webmaster tools. Almost all are do to bad incoming links. Do we need to find ways to eliminate these or does it matter? We were thinking about building pages to make the broken pages valid and doing 301s. Thanks.
Great information, Matt - But what should webmasters use when trying to sculpt PageRank on their sites? Ex. I don't want PR to be passed to my "Privacy Policy" page, which might be linked to in my global navigation, and don't necessarily want/need it indexed. I just want to push some of the PR that might be going to that page to my other, more relevant pages. What should I use?
Really good video. Matt presented the information very clearly and in a way so that even newbies could easily understand without feeling overwhelmed by terminology. Thanks!
Using the .htaccess file to password protect a portion of your web site only applies to Apache hosted web sites. For other web servers, you can achieve the same thing without a .htaccess file, just refer to the servers documentation.
Another useful way to restrict access to all or portion of a web site is to only allow certain IP addresses. This is useful for beta web sites where you want to limit who can see the beta site.
Apache example (could use absolute path on server and replace Location with Directory):
<Location /path/>
Order Deny,Allow
Deny from all
Allow from 192.168.2.123
Allow from 192.168.2.125
</Directory>
Lighty example:
$HTTP["remoteip"] !~ "192.168.2.123|192.168.2.125" {
$HTTP["url"] =~ "^/path/" {
url.access-deny = ( "" )
}
}
--angelo
Wth guys? Pulling Jon Swift's (http://jonswift.blogspot.com) content from searches for 'Liberal Fascism' was low class and smacks of first rate caving-in. This is yet another reason I have to question Google's integrity and continue using multiple search engines to get the least biased results.
Hey Matt,
Thanks for the video but can you please explain a bit about removing https pages using webmaster tool as webmaster URL removal console starts with http. Is their a help file I can read for it?
Thanks, looking for an answer, we been discussing it at WMW for few days without a proper answer.
Thanks,
AjiNIMC
Thanks for the terrific tips. When will we get a lesson on how to fix it so that *other people's* content doesn't appear in Google searches?
(And yes, this is a Jon Swift reference. That was very, very badly played, Google, and you should be ashamed of yourselves.)
I'm more concerned with Google itself removing content. If fair comment on a book like "Liberal Fascism" is removed, are there plans to remove all reviews of books from searches? Does this not take away a considerable part of the usefulness of book searches? Or is Google just overly sensitive to being pressured?
Saink u, wery intresting, it's look wery smart, for me and my resurs www.drevolife.ru
Nice and brief! Smiley faces are good too. ;)
Ditto on the poor performance on the Jon Swift post. However, I notice it is now back in the rankings, turning up at #3.
Good. Glad to see you can undo missteps.
Who do I need to talk to at Google if I messed up my nameservers for 2 days and my domain dropped out of Google? My site was ranked #4 and now I can't find it at all. I am in big trouble!!! Anyone know what I need to do in order to fix my site's positioning?
I'd love it if someone came up with an application to graphically display the effect of robots.txt -- you see the entire folder structure, and the files/folders excluded by robots.txt are just grayed out or boldfaced or something. Anyone need a project for the weekend?
hallo everybody! :D
i' ve installed rewrite mod for my phpbb board. Recently i've updated it from simple mod to mixed mod. Everything works fine but now my indexed pages on google presents duplicates that are linked to the same page, example:
*mysite.com/forum/viewtopic.php?p=49&sid=d6edb4ec3eed0885874a3cc2b6ba316e (old version with sid, you can reach content)
*mysite.com/forum/viewtopic.php?p=49 (old version without sid, link works)
*mysite.com/forum/post49.html (old simple mod version, link works)
*mysite.com/forum/topic40.html#49 (actual ok version)
same thing for forums and subforums pages, ex.:
*mysite.com/forum/viewforum.php?f=43&sid=38bea68463f83640d37f50ec51a31b0e (old version with sid, link works)
*mysite.com/forum/viewforum.php?f=43 (old version without sid, link works)
*mysite.com/forum/forum43.html
*mysite.com/forum/catXX.html (old simple mod version, now obviously give 404)
*mysite.com/forum/name-of-forum-f43.html (actual ok version)
now i've eliminate sids from board and create a robots.txt that hide also
/forum/viewtopic.php
/forum/viewforum.php
/forum/post
/forum/cat
/forum/forum
i know that i can wait weeks, (or mre?) for a total recrawl of my site but i'd like to eliminate urls and old cached pages from google index using the 'official url eliminate tool'.
Before doing this, i want to be sure that eliminating the .php url (or one of the others), with this tool, i will not delete for the future also the newest .html url that link to the same page content!
What do you suggest? is better wait that google automatically update these urls in next months? or can i use this tool without any problem for my .html urls?
thanks!
Hi, AjiNIMC:
Regarding your https question, we consider http://www.example.com and https://www.example.com to be two different sites. Once you've added and verified the https version of your site in your Webmaster Tools account, you should be able to submit URL removal requests for that site.
Thanks Susan for the reply. I have added that but how do I find total number of pages listed with https in google. I tried site:idealwebtools.com:443 but it did not work.
Also if a page is not existing in google listing and we add that page for removal from webmaster tool, does it show denied status? I see that pages that I am adding to webmaster console is getting denied. I am assuming that to be the reason.
I need a way to know all the pages that are listed with https. It will be helpful if you can show me a way.
Thanks,
AjiNIMC
I think this discussion needs to be moved into the Help Group (too detailed for the blog!). I've followed up in this thread.
just great information what is given but I wonder if giving out a 404 error header will be enough for google to remove it from its indexes because when I look through my indexes I see pages crawled one year ago and unfortunately these indexes belong to the previous owner of the domain.
Hi all, I had a video sitemap with option "Yes" on "allow_embed" . I decided to change that to "No", so users would only see the video directly on my site.
After a couple of days videos still can be played on google's video website.
Is this question of time before all video will no play there ?
Should I remove completely the option "video:player_loc"?.
Any thoughts or experience about this ?
Regards
Pedro Lobito
I am having the exact same problem with video sitemaps. I had indexed a video months back, since removed it and removed it entirely from my video sitemap. now there are 3 versions of the video still in Google video and they are broken. How do I remove a video submitted via sitemap?
Thanks for the great explanations!
Very useful information nicely presented. Thanks.
Dear Friend,
My hosting account has been locked and i have access to nothing..no robots and no meta tag...The host provider is returning 400 Htpp error for my pages and they say they can't changed it.I want to move to a different host.Does google remove my content with 400 http error if i ask google to remove my site in it's webmaster tools?
What can i do to remove my content form google? thanks
best regard
okay, it was an informative video but i haven't a clue of what to do in total. i finally went to Yahoo bought a domain and host and am on Wordpress.com but with a CC license of 3.0, now bought another domain, another host@Bluehost
This is under Construction as Constitunionalert so I suppose i will use a metatag of no index and hope i don't forget to remove it.
its ashame i asked for help from Google for 5 months and nobody helped@not once. thanks for the pointers i so wish Google cared because i never would have left but you all are a@part of each other so i suppose its no worries for Google and the rest of the Corporacracy's.
Peace
Rhoda VFP
Sorry I have been awake all night, it should say "Constitutionalert", please forgive me, very sleepy.
some person with an inane username sent me an email in some Asain language, i do not ever want a person on this entire site to send a personal email again, or i'll contact EFF.org, ACLU.org, ccrjustice.org and international email blasts to every pro-peace grassroots group around, how in the world is this allowed?
thank you
Mrs. Rhoda Ozen
dear Roe, that person with an inane username don't sent YOU an email in some Asian language, but has published here his spam so an email arrives all of us.. in a spam's world is more difficult to believe and understand your surprise and outrage than this spam message...where do you live?
Hi,
sounds good, I wood use the removal tool since there is a lot of trash in my cached pages, however I really don't see how to access this tool. I have authenticated my site but don't see anything like removal in my webmaster section.
@la nena See the "Remove URLs" page under the Tools section within Webmaster Tools.
Please, tell me why dont crawling only part of page?
tag < noindex >...< /noindex >, or something else?
On my site:
http://www.successful-candle-making.com
I incorrectly used the noindex nofollow tags in my html code and it looked like google didn't index the site because of this which i have read is correct. However, I want my site to be followed and indexed so, I have corrected the no index and nofollow to index and follow, but it still seems to show authentication problems in webmaster tools. Will this likely be corrected in future searches or will i have to resubmit and wait three weeks or so for the site to begin to be indexed?
Also there was an old .hhtaccess file hanging around in the public_html folder and google originally searched when this was there blocking it. Could this affect anything?
Thanks in advance.
Hi everyone,
Since some time has passed since we published this post, we're closing the comments to help us focus on the work ahead. If you still have a question or comment you'd like to discuss, free to visit and/or post your topic in our Webmaster Help Forum.
Thanks and take care,
The Webmaster Central Team
Post a Comment