Google Webmaster Central Blog - Official news on crawling and indexing sites for the Google index

The number of pages Googlebot crawls

Thursday, November 09, 2006 at 4:19 PM

The Googlebot activity reports in webmaster tools show you the number of pages of your site Googlebot has crawled over the last 90 days. We've seen some of you asking why this number might be higher than the total number of pages on your sites.


Googlebot crawls pages of your site based on a number of things including:
  • pages it already knows about
  • links from other web pages (within your site and on other sites)
  • pages listed in your Sitemap file
More specifically, Googlebot doesn't access pages, it accesses URLs. And the same page can often be accessed via several URLs. Consider the home page of a site that can be accessed from the following four URLs:
  • http://www.example.com/
  • http://www.example.com/index.html
  • http://example.com
  • http://example.com/index.html
Although all URLs lead to the same page, all four URLs may be used in links to the page. When Googlebot follows these links, a count of four is added to the activity report.

Many other scenarios can lead to multiple URLs for the same page. For instance, a page may have several named anchors, such as:
  • http://www.example.com/mypage.html#heading1
  • http://www.example.com/mypage.html#heading2
  • http://www.example.com/mypage.html#heading3
And dynamically generated pages often can be reached by multiple URLs, such as:
  • http://www.example.com/furniture?type=chair&brand=123
  • http://www.example.com/hotbuys?type=chair&brand=123
As you can see, when you consider that each page on your site might have multiple URLs that lead to it, the number of URLs that Googlebot crawls can be considerably higher than the number of total pages for your site.

Of course, you (and we) only want one version of the URL to be returned in the search results. Not to worry -- this is exactly what happens. Our algorithms selects a version to include, and you can provide input on this selection process.

Redirect to the preferred version of the URL
You can do this using 301 (permanent) redirect. In the first example that shows four URLs that point to a site's home page, you may want to redirect index.html to www.example.com/. And you may want to redirect example.com to www.example.com so that any URLs that begin with one version are redirected to the other version. Note that you can do this latter redirect with the Preferred Domain feature in webmaster tools. (If you also use a 301 redirect, make sure that this redirect matches what you set for the preferred domain.)

Block the non-preferred versions of a URL with a robots.txt file
For dynamically generated pages, you may want to block the non-preferred version using pattern matching in your robots.txt file. (Note that not all search engines support pattern matching, so check the guidelines for each search engine bot you're interested in.) For instance, in the third example that shows two URLs that point to a page about the chairs available from brand 123, the "hotbuys" section rotates periodically and the content is always available from a primary and permanent location. If that case, you may want to index the first version, and block the "hotbuys" version. To do this, add the following to your robots.txt file:

User-agent: Googlebot
Disallow: /hotbuys?*

To ensure that this directive will actually block and allow what you intend, use the robots.txt analysis tool in webmaster tools. Just add this directive to the robots.txt section on that page, list the URLs you want to check in the "Test URLs" section and click the Check button. For this example, you'd see a result like this:

Don't worry about links to anchors, because while Googlebot will crawl each link, our algorithms will index the URL without the anchor.

And if you don't provide input such as that described above, our algorithms do a really good job of picking a version to show in the search results.
The comments you read here belong only to the person who posted them. We do, however, reserve the right to remove off-topic comments.

9 comments:

Info4BeingRich said...

For my blog at http://info4beingrich.blogspot.com when I see in the google webmaster tools, it shows me that the blog has 701 external links. Whereas, when i try to see the sites linking to me by using links:info4beingrich.blogspot.com, it says there are no sites linking to your blog. Why does this happen and what can be done from my side to make things better..

Eve said...

Hi, I hope I am not straying too far from the main theme here, as my question is about a specific page not showing rather than the number of pages indexed, but I have searched and searched to find the answer to my question without success. The question is this... my site is indexed and a lot of inner pages appear for given search terms and in the site:command in webmasters tools, but the site's home page does not. In the Webmaster Tools it says the home page was last accessed on a given date but then if I run the site:command the home page is not there. This has happened before, and usually the home page comes back again within a few days - this time it has been missing for over two weeks!! and I am worried that it may be gone for ever.

hans said...

Dear Vanessa,
I came to this message because with the site www.privesafari.com I have the opposit problem: the Googlebot always crawls less pages then Google has indexed.
The Google activity report shows: Indexed pages: 49, crawled pages: average 6, min 1, max 40.
I have issued a sitemap with 49 URLs, which is read frequently.
I update my pages 1 in 2 weeks.
Can you help me with this problem?

Warm greetings,
Hans de Bats

Deepak Bhasin said...

Our website (www.mingloo.com) doesn't have all the pages indexed and our site has been up for over 6 months now. Site:www.mingloo.com on yahoo shows over 800 pages but on Google it only returns 154 pages indexed. Please advice what I am supposed to do to ensure the site is timely indexed.

Thanks
Deepak

D said...

why pages indexed in google are decreased so much in this period??
www.farmaceuti.com has 18000+ pages indexed bu google lately, but in couple weeks ago, it is decreased dramatically on 2900+ pages today??

Anuj said...

I am a little confused on how Google is indexing my site. The site is here www.anujgakhar.com and Google webmaster tools keeps showing that GoogleBot is crawling my website but under the Links Section, it doesnt show any internal or external links at all. Looks like its crawling only the homepage. Any ideas on what could be wrong ?

Şahin said...

How can we understand that we made google friendly link direction?
Our redirection is
Firefly Türkiye to Çocuk Cep Telefonu

Jyoti said...

how does Google bot crawls dynamic pages, why do not they index all the dynamic pages?
And what about the pages resulting from search query form, do they get indexed with google

Google Webmaster Central said...

Hi everyone,

Since over a year has passed since we published this post, we're closing the comments to help us focus on the work ahead. If you still have a question or comment you'd like to discuss, free to visit and/or post your topic in our Webmaster Help Group.

Thanks and take care,
The Webmaster Central Team