Sunday, August 09, 2009 at 10:40 PM
Webmaster Level: Intermediate to AdvancedMany questions about website architecture, crawling and indexing, and even ranking issues can be boiled down to one central issue: How easy is it for search engines to crawl your site? We've spoken on this topic at a number of recent events, and below you'll find our presentation and some key takeaways on this topic.
The Internet is a big place; new content is being created all the time. Google has a finite number of resources, so when faced with the nearly-infinite quantity of content that's available online, Googlebot is only able to find and crawl a percentage of that content. Then, of the content we've crawled, we're only able to index a portion.
URLs are like the bridges between your website and a search engine's crawler: crawlers need to be able to find and cross those bridges (i.e., find and crawl your URLs) in order to get to your site's content. If your URLs are complicated or redundant, crawlers are going to spend time tracing and retracing their steps; if your URLs are organized and lead directly to distinct content, crawlers can spend their time accessing your content rather than crawling through empty pages, or crawling the same content over and over via different URLs.
In the slides above you can see some examples of what not to do—real-life examples (though names have been changed to protect the innocent) of homegrown URL hacks and encodings, parameters masquerading as part of the URL path, infinite crawl spaces, and more. You'll also find some recommendations for straightening out that labyrinth of URLs and helping crawlers find more of your content faster, including:
- Remove user-specific details from URLs.
URL parameters that don't change the content of the page—like session IDs or sort order—can be removed from the URL and put into a cookie. By putting this information in a cookie and 301 redirecting to a "clean" URL, you retain the information and reduce the number of URLs pointing to that same content. - Rein in infinite spaces.
Do you have a calendar that links to an infinite number of past or future dates (each with their own unique URL)? Do you have paginated data that returns a status code of 200 when you add &page=3563 to the URL, even if there aren't that many pages of data? If so, you have an infinite crawl space on your website, and crawlers could be wasting their (and your!) bandwidth trying to crawl it all. Consider these tips for reining in infinite spaces. - Disallow actions Googlebot can't perform.
Using your robots.txt file, you can disallow crawling of login pages, contact forms, shopping carts, and other pages whose sole functionality is something that a crawler can't perform. (Crawlers are notoriously cheap and shy, so they don't usually "Add to cart" or "Contact us.") This lets crawlers spend more of their time crawling content that they can actually do something with. - One man, one vote. One URL, one set of content.
In an ideal world, there's a one-to-one pairing between URL and content: each URL leads to a unique piece of content, and each piece of content can only be accessed via one URL. The closer you can get to this ideal, the more streamlined your site will be for crawling and indexing. If your CMS or current site setup makes this difficult, you can use the rel=canonical element to indicate the preferred URL for a particular piece of content.
If you have further questions about optimizing your site for crawling and indexing, check out some of our previous writing on the subject, or stop by our Help Forum.


25 comments:
Finally i get a clear overview of Google crawling and indexing. thanks for the info.
I would like to know how the Googlebot crawls comments, and how it weighs the links to other websites through the comment section. I work for the payday lender http://www.savvysponds.co.uk
Thanks for the great slide show on crawling and indexing
Thanks
Sankar
Does "Remove user-specific details from URLs" also apply to the campaign code (&utm_campaign=) added to the end of URLs for Google Analytics?
What is the best way to "Disallow actions Googlebot can't perform", should I use the meta noindex or block it with the robots.txt or should I nofollow all the links on my website pointing to these page?
Quick question...
Does URL length play a large role in rankings? Say, www.x.com/y/z/99 vs www.x.com/99. will the 2nd rank higher than if i used the 1st url structure now?
thanks
all done !!
If we block paginated listings from Google in robots.txt, won't that cause important detail pages not to get indexed?
For example, if I had 200 pages of hotel listings, how can I be certain Google will find all Hotel Detail pages if I only let Google crawl and index the first page?
Ignore my comment (or delete it), I skimmed too fast. Thanks for the post!
"nearly-infinite quantity"
No where even near. It is simply an absurd comment that makes no sense. Had to point it out, Google or any other man made technology will never be close to near infinite - how do you call something near infinite?
@G-Force: When I said "Disallow" I was referring to the Disallow: command in a robots.txt file.
(Fine print: robots.txt controls crawling, meta noindex controls indexing. If your goal is to not have crawlers waste bandwidth crawling those URLs, you'll want to use robots.txt; but if your goal is to be 100% sure that those URLs don't get indexed, you'll want to use meta noindex.)
@Dan: if it's just a couple directories' difference, it shouldn't make a difference. In general, URLs that are human-friendly (relatively easy to read and share) are usually search-friendly too.
Yazıların ve sunumların Türkçe'lerini istiyoruz.Lütfen!
Great tips!
I think that many sites around the glove do not use at least a robots.txt, this could improve dramatically your job! Good presentation! I'll write a post about it.
Hi,
my personal website http://djalil.chafai.net/ has completely lost his ranking in Google for the natural keywords "djalil chafai". I ignore the reason. I regret the absence of relevent information in the Google Webmasters central. Is it a crawling optimization problem? Any idea?
Best.
Thanks for your sharing, i will follow with your instruction.
we want Turkish versions of these posts
Hi, I have a question regarding this if a have a site like this
www.example.com that is my corp home, and I start making www.example.com/argentina and www.example.com/mexico
those other site within the same core URL would be optimized themselves or they are always going to depend on the main one?
And finally, would you recomend to make subdomains?
thanks
I have started a blog ten days before, till i have posted information dailly, but yet now google didnt crawl my blog.
I dont know why. Any body is there to suggest me ?
If you think this is cool, check out bing.
Even though my site follow these guidelines but still I have a PR 0 for the last two and a half years. I don't get it.
Can someone tell me why?
my site is
www.computerstar.ca
And don't forget that even Google may have bugs!
http://www.google.com/support/forum/p/Webmasters/thread?fid=09eb7e9ff7b35d21000471993a72c8fe&hl=en
Ya very fine tune crawling info but more things to consider , how CSS, Images, Links and other element play a role in crawling or what is the different behavior of these elements
Great tips!! But what is the use?? Even though I followed all, still my blog is not indexed. So sad Know!!!!!
Its so nice that google give each and every help about blogging thank you google
Hi everyone,
Since over a year has passed since we published this post, we're closing the comments to help us focus on the work ahead. If you still have a question or comment you'd like to discuss, free to visit and/or post your topic in our Webmaster Central Help Forum.
Thanks and take care,
The Webmaster Central Team
Post a Comment