Tuesday, August 05, 2008 at 1:27 PM
When Googlebot crawls the web, it often finds what we call an "infinite space". These are very large numbers of links that usually provide little or no new content for Googlebot to index. If this happens on your site, crawling those URLs may use unnecessary bandwidth, and could result in Googlebot failing to completely index the real content on your site.Recently, we started notifying site owners when we discover this problem on their web sites. Like most messages we send, you'll find them in Webmaster Tools in the Message Center. You'll probably want to know right away if Googlebot has this problem - or other problems - crawling your sites. So verify your site with Webmaster Tools, and check the Message Center every now and then.
Examples of an infinite space
The classic example of an "infinite space" is a calendar with a "Next Month" link. It may be possible to keep following those "Next Month" links forever! Of course, that's not what you want Googlebot to do. Googlebot is smart enough to figure out some of those on its own, but there are a lot of ways to create an infinite space and we may not detect all of them.
Another common scenario is websites which provide for filtering a set of search results in many ways. A shopping site might allow for finding clothing items by filtering on category, price, color, brand, style, etc. The number of possible combinations of filters can grow exponentially. This can produce thousands of URLs, all finding some subset of the items sold. This may be convenient for your users, but is not so helpful for the Googlebot, which just wants to find everything - once!
Correcting infinite space issues
Our Webmaster Tools Help article describes more ways infinite spaces can arise, and provides recommendations on how to avoid the problem. One fix is to eliminate whole categories of dynamically generated links using your robots.txt file. The Help Center has lots of information on how to use robots.txt. If you do that, don't forget to verify that Googlebot can find all your content some other way. Another option is to block those problematic links with a "nofollow" link attribute. If you'd like more information on "nofollow" links, check out the Webmaster Help Center.




14 comments:
Where is the elusive message center link? I can't find it within webmaster tools.
When you sign in to Webmaster Tools "Dashboard" it's down the right hand side... roughly below the bit saying my account....
Excellent addition to webmaster tools - I have actually run into this a couple times, but ran my own bot on the site to see how the pages were getting crawled. When I found the infinite crawling of these dynamically generated pages with content of no value I added the directory to the robots.txt. and pulled it from the index through the webmaster tools.
very good info. We changed over to a dynamic system that maybe causing this issue. We want quality links too and not infinity links in the index. I did not even think of the robots.txt file
Should have :)
thanks
I need help! I can't verify my site because it built on Vignette (don't ask :-( )... How can I actually verify it with VIgnette when they are all porlets pulling together the homepage? I actually don't have a real home page...
Thanks for your help!
I may agree with the endless calendar issue. But I certainly disagree that the endless combination of items is an issue webmasters should solve.
A combination may be the relevant information a user may search for.
For example if a user searches for "red male t-shirts" we can assume that the user wants to find every red t-shirt for men that is available (and that fit him or the person he has in mind etc.). As I understand, you now recommend that the user only finds one product for each domain - since Google groups results and only shows one/two of the same domain - but this way the user might miss all the other red male t-shirts a shop might offer (if he doesn't think of using the "more from this domain" link). And to miss that information is not the intention of the user. And furthermore to skip that information is to lose information.
So if a website already offers the ability to specify groups, why should you want to loose this specific information in favor of easier work for your bots?
I think that would be the wrong end to work on the problem of collecting a huge index.
I really need to start using the Google Webmaster Tools. Just kind of hard to set it up with a shopping cart.
Webmaster tools does show this message for my site. But...
Sorry to say that, but according to the URLs attached to the message, Googlebot's behavior is quite stupid.
* Most of the URLs listed have "noindex,follow" meta tag
* There is even rel="nofollow" tag in internal links to these URLs (may be not in all links, but I cannot affect external linking anyway)
* My site does provide a sitemap, which does not list these URLs
* There are even some URLs that are blocked by robots.txt! The same type of URL is listed in "URLs blocked by robots.txt"
Please tell me what else can I do to prevent Googlebot crawling unnecessary pages.
robots.txt is very bad option in my opinion - that standard is very old and cannot be simply used on dynamic URLs as it does not support regexps.
Reading your post I have the feeling that you want webmasters to create websites for robots, not for humans...
Great post-- this is very valuable feedback for site owners. Thanks!
I am seeing this message in Webmaster tools for my site. Only thing is... the pages it lists as examples are all unique content.
Has anyone else received this message in (I think) error?
Seemingly redundant pages may be a problem for folks who search from outside the site, but it makes google search an unusable tool on sites like one for which I'm webmaster.
I can't use it because we have one hard page per product and although, in many cases the pages will look similar (with the greatest change being the image of the product), google doesn't index most of those pages. (Our site map has 170 pages, but only
We seem to be running into the same issue as Jaimiec in last 2 weeks. Our site is also getting this message but for URLs that are also unique. Any comment on if site legitimately has lots and lots of unique URLs of public content that may seem similar but is actually unique. Can we get these messages in error and will google eventually realize they are legit or do I need to initiate something on my end?
hello I recently had a virus issue. The websites affected. okfmcservices.com and shortsalefundamentals.com I verified the sites, but I still get an error that comes back from Google. Also, I have cleaned both sites but Google still shows with virus. Can you please help? I'm really having customer challenges due to this. Thank you.
Yeah, I remembered this one, including the nice title "To infinity and beyond" :-)
See here:
Websites penalized due to developper mistake
-luzie-
Hi everyone,
Since some time has passed since we published this post, we're closing the comments to help us focus on the work ahead. If you still have a question or comment you'd like to discuss, free to visit and/or post your topic in our Webmaster Help Forum.
Thanks and take care,
The Webmaster Central Team
Post a Comment