Google Webmaster Central Blog - Official news on crawling and indexing sites for the Google index

Google, duplicate content caused by URL parameters, and you

Wednesday, September 12, 2007 at 1:13 AM



How can URL parameters, like session IDs or tracking IDs, cause duplicate content?
When user and/or tracking information is stored through URL parameters, duplicate content can arise because the same page is accessible through numerous URLs. It's what Adam Lasnik referred to in "Deftly Dealing with Duplicate Content" as "store items shown (and -- worse yet -- linked) via multiple distinct URLs." In the example below, URL parameters create three URLs which access the same product page.

(click to enlarge)

Why should you care?
When search engines crawl identical content through varied URLs, there may be several negative effects:

1. Having multiple URLs can dilute link popularity. For example, in the diagram above, rather than 50 links to your intended display URL, the 50 links may be divided three ways among the three distinct URLs.

2. Search results may display user-unfriendly URLs (long URLs with tracking IDs, session IDs)
* Decreases chances of user selecting the listing
* Offsets branding efforts


How we help users and webmasters with duplicate content
We've designed algorithms to help prevent duplicate content from negatively affecting webmasters and the user experience.

1. When we detect duplicate content, such as through variations caused by URL parameters, we group the duplicate URLs into one cluster.

2. We select what we think is the "best" URL to represent the cluster in search results.

3. We then consolidate properties of the URLs in the cluster, such as link popularity, to the representative URL.

Consolidating properties from duplicates into one representative URL often provides users with more accurate search results.


If you find you have duplicate content as mentioned above, can you help search engines understand your site?
First, no worries, there are many sites on the web that utilize URL parameters and for valid reasons. But yes, you can help reduce potential problems for search engines by:

1. Removing unnecessary URL parameters -- keep the URL as clean as possible.

2. Submitting a Sitemap with the canonical (i.e. representative) version of each URL. While we can't guarantee that our algorithms will display the Sitemap's URL in search results, it's helpful to indicate the canonical preference.


How can you design your site to reduce duplicate content?
Because of the way Google handles duplicate content, webmasters need not be overly concerned with the loss of link popularity or loss of PageRank due to duplication. However, to reduce duplicate content more broadly, we suggest:

1. When tracking visitor information, use 301 redirects to redirect URLs with parameters such as affiliateID, trackingID, etc. to the canonical version.

2. Use a cookie to set the affiliateID and trackingID values.

If you follow this guideline, your webserver logs could appear as:

127.0.0.1 - - [19/Jun/2007:14:40:45 -0700] "GET /product.php?category=gummy-candy&item=swedish-fish&affiliateid=ABCD HTTP/1.1" 301 -

127.0.0.1 - - [19/Jun/2007:14:40:45 -0700] "GET /product.php?item=swedish-fish HTTP/1.1" 200 74

And the session file storing the raw cookie information may look like:

category|s:11:"gummy-candy";affiliateid|s:4:"ABCD";

Please be aware that if your site uses cookies, your content (such as product pages) should remain accessible with cookies disabled.


How can we better assist you in the future?
We recently published ideas from SMX Advanced on how search engines can help webmasters with duplicate content. If you have an opinion on the topic, please join our conversation in the Webmaster Help Group (we've already started the thread).
The comments you read here belong only to the person who posted them. We do, however, reserve the right to remove off-topic comments.

26 comments:

Mike said...

Interesting. I have a site that uses an alias; the main url is news.motorbiker.org and the alias is blogs.motorbiker.org. In the mean time, Google has also indexed www.motorbiker.org/blogs.nsf (the platform is Lotus Domino).

All three point to the exact same page.

I can't redirect (no htaccess available), so I'm stuck in limbo land.

The urls themselves are clean.

What would you suggest, since it's confusing for my readers, Google crawlers and plays havoc with my PR factor.

dslr said...

My main issue with fighting duplicate pages in the index is there is no easy way to block google from using certain query string params which do not materially change the page.

For example if you have a table of data, and the URL you wish to index is /blah/datatable but there is also /blah/datatable?sort=r1&color=blue&xyz=3 you certainly don't want search engines to index every combination of these three params.

Forcing all non material params into a cookie is not a solution. Offering pages to crawlers without evidence of these links is not very good either. They can't be blocked in robots.txt, and on the fly returning 301s if we know it is a crawler is a kludge as well.

If robots.txt is the "robot exclusion protocol" then it needs to be extended to allow nomination of query string arguments that are to be stripped out before concluding a page has been crawled already or not.

So something like
Disallow: /badpath
Avoid: /path arg,arg,arg
Include: /path arg,arg,arg

Or if that can't be done then a sitemap extension to list URLs + Args that should be ignored.

Jennifer Mathews Somogyi said...

I have ran into the same issues with duplicate content in the past with some of the larger sites I have worked on. We had used a tracking similar to a session ID and tried to get different files in different parts of the site indexed and ranked with different IDs. The problem is that the IDs were dynamic, so that when a bot hit one section with an ID and crawled the rest of the site it held onto the ID no matter where it went. The result was a list of 4-5 URLs for the same exact page. The cleanup as was a big mess, and is still an issue to this day.

I have also ran into issues with duplicate content in serving up dynamic landing pages for natural SEO. We wanted the user searching a specific term to land on a page that was most relevant for that term in order to provide them with a more user friendly process than just dumping them at the home page and making them go through the process of finding what they were looking for all over again. The problem was that we had one project with over 200,000dynamically generated pages, and another with 40 million pages.

I ran a case study on duplicate content to find out what is considered duplicate content so that we wouldn't get pegged for it. The results were very interesting...

Jennifer Mathews Somogyi said...

Another response to Mike's comment above -
I had issues with the multiple domains for one of my sites. I usually try to avoid using multiple domains, but in this case there was no way around it.

The solution (for Google) was to remove the domain from the Google index through the webmaster tools.

You can read more about it and how to correct the issue in my blog post - Problems With Multiple Domains

Shawn K. Hall said...

DSLR,

you can use robots.txt to filter URLs based on patterns - at least as far as Google is concerned:

User-agent: Googlebot
Disallow: /*?sort*
Disallow: /*&sort*

More information here:
http://www.google.com/support/webmasters/bin/answer.py?answer=40367&topic=8846


On a side note, I wrote an article describing the duplicate content issue of URL naming last year, here:
http://12pointdesign.com/advice/url_canonicalization.asp

Neyne said...

Interesting about transferring all the links to the chosen URL from all the others.

I have a question - does this happen also when the duplicate content is found on a different domain ? Let's say someone is scraping my content and Google decides to dump my website and show the scraper instead, does all the link juice get transferred to the scraping site ?

Klaus said...

Maile (or anyone out there)

I read with great interest your posting on duplicate content (12th Sept).
I run www.selectiveasia.com and have just launched my site in Australia (www.selectiveasia.com.au) with the vast # of pages being direct duplicates. From what I have read in various forums this is not going to cause any problems with my rankings however the #'s do seem to be falling a little and I woudl be interested if you had any thoughts on this (ie if what I have read is correct, is it safe to put another site up. The .com.au is hosted in Australia. My priority is to protect the ranking of the .com

Any information would be most welcome.

Nick

Spanish speaker said...

Hello. This web http://posicionar- web.blogspot.com/2007/09/posicionar-web-con-blogs.html copy to me all contents.

My web is http://www.mecagoenlos.com/Posicionamiento/posicionamiento-con-blogs.php

This content are privated, but he copy me.

How can I deleted of blogspot?
The next stept will be judicial.

Adam said...

Maile,

Can you clarify what you wrote in the "Why should you care?" section:

"1. Having multiple URLs can dilute link popularity. For example, in the diagram above, rather than 50 links to your intended display URL, the 50 links may be divided three ways among the three distinct URLs."

Vs. what you said in the "How we help users and webmasters with duplicate content" section:

"3. We then consolidate properties of the URLs in the cluster, such as link popularity, to the representative URL."

If multiple coded URLs are used for tracking (for example www.site.com/story, www.site.com/story?xid=rss, www.site.com/story?xid=topstories, etc.) is the link popularity of all the URLs completely consolidated into the URL that Google deems the "best URL"? Or is link popularity dilution still an issue?

Thank you.

ZIP Drugs - Legal Discount Online Pharmacy said...

I have submitted sitemap file go Google over 3 months ago. That file included about 4200 products. To date only 2300 have been indexed. Is there a way to get the rest of the products indexed?

SmartlikeStreetcar said...

How long does it take before Google starts using the URLs listed in the Sitemap?

I recently redesigned a site, and as part of the redesign, I renamed most pages, and added the new URLs to the sitemap. I deleted the old sitemap, and let Google know about the new one.

But I kept many of the old URLs active, to give Google time to find the new pages. (And some of the less important URLs have bee inactivated).

Yet more than six weeks into the process, Google is still showing errors, unable to find several old pages. And the Googlebot seems to be following the old, dead-end URLs, and ignoring the new ones.

I know that six weeks isn't such a long time... I didn't expect the new pages to appear in the index after only six weeks. But I'm discouraged that the Googlebot seems to be ignoring the new sitemap, and searching out old links, and old pages.

rosjules said...

Hi I am having trouble verifying my website
I click on choose verification method then select upload an HTML file
I then click on number 2 and reads
Ive uploaded the file to my website name with google numbers after it. I click on this but is timed out am I doing this right
thanks Ross hoping someone can reply

Maile Ohye said...

We updated the thread in our Webmaster Help Group.

Nevyan said...

It is also important to clear up the session variables appended to the URL.

Using .htaccess one could allow only .html requests, clearing up all the remaining query parameters:
ie: example.com/page.html?id=123
will be redirected to example.com/page.html

RewriteCond %{query_string} .
RewriteRule ^([^.]+)\.html$ http://example.com/$1.html? [R=301,L]

Thanks to jdMorgan's method from webmasterworld.

More: Avoiding dupplicate content

MLazarus said...

I also have an issue with a site that uses an alias.

I have www.GLComputing.com.au but it's also found at www.glcomp.bevhost.com

I have found that some links on web pages to www.GLComputing.com.au/hhc actually show up in the Google's index as pointing to www.glcomp.bevhost.com/hhc and I can't explain why.

I've tried to remove the glcomp.bevhost.com site from Google, but it responds that because it doesn't return a 404 or 410, it won't

Appreciate any suggestions...

cape said...

Ok so how ca i found out if i was really banned for duplicate content?

Increase Search Engine Ranking said...

301 redirects and .htaccess access are critical. You should consider using a new hosting provider if you want to Increase Your Search Engine Ranking.

SavaS said...

Hey neyvan you said that,

Blogger Nevyan said...

It is also important to clear up the session variables appended to the URL.

Using .htaccess one could allow only .html requests, clearing up all the remaining query parameters:
ie: example.com/page.html?id=123
will be redirected to example.com/page.html

RewriteCond %{query_string} .
RewriteRule ^([^.]+)\.html$ http://example.com/$1.html? [R=301,L]

Thanks to jdMorgan's method from webmasterworld.

is it right?

Visit here

Dan said...

We used rel="nofollow". Specifically:

We have a page that contains a list of products (e.g. /products.aspx)

On the page we have links that allow users to filter and sort that result in various ways. Each one of those redirects to the same page with various parameters (so that users can bookmark that version. e.g. /products.aspx?sort=price, /products.aspx?sort=brand&category=10). To prevent search engines from trying to index those page versions, we put rel="nofollow" attributes on each of those sort and filter links.

Also, since our listing is long, we have it spread out over multiple pages and we have links to those pages (e.g. /products.aspx?page=2, etc.). We want those links followed and those pages indexed, so these links do not have "nofollow".

Is this good practice? If Google sees a link to a page with "nofollow" does it simply ignore that one link but still follow other links to that same page? We are concerned about possible penalties resulting from using "nofollow" when pointing to our pages.

So far, things seems to be OK in terms of what we see indexed, but our pagerank is still pretty low.

Susan Moskwa said...

Dan:

Using nofollow will only prevent Googlebot from following that particular link from page A to page B. It's still entirely possible for Googlebot to find other links to page B on the web, so page B could still be crawled and indexed.

If you want to block Googlebot from crawling certain pages, I'd recommend using a robots.txt file. You can use pattern matching to block access to URLs with specific parameters or patterns.

gumbi said...

Hi!

I am using Wordpress blogs for several of my sites.

Do I need to worry about duplicate content since you can find old posts via /archives/post.html or via the tags, such as /tags/post.html?

I am wondering if this is considered duplicate content or not and there is a great debate in the Wordpress community about this topic.

Thanks for any help that you can offer!

Dan said...

Susan,

Any link in our site that points to a filtered (duplicate) version of the page uses a "nofollow" attribute. Only links to the unfiltered (full/unique) version of the page have a regular link without "nofollow." We think (and appears confirmed from our logs) that that is sufficient to prevent indexing of these page variants.

So, it would seem necessary to use robots.txt for this instance.

Then again, we have full control of all our source code so it may not be an option for some.

Dan said...

I meant:

So, it would seem *unnecessary* to use robots.txt for this instance

Susan Moskwa said...

Hi Dan:

I understand that you're only linking to these pages using nofollowed links; but how do you know that no one else on the web is linking to them? If even one person links to a page without nofollowing the link, it's possible that Googlebot could crawl and index that page.

If you're happy with the current setup, then that's great; but I just want to make sure anyone reading this understands that nofollowing a link to page X does not necessarily preventing search engines from crawling or indexing page X.

Rockz said...

yes ths right i knw a pligg cms based site http://www.jeqq.com which has a better pr and pligg cms is desidned in a bad way which affects seo...i have tweak ths way..pls chk ths site http://www.jeqq.com and let me knw the feedback

Google Webmaster Central said...

Hi everyone,

Since several months have passed since we published this post, we're closing the comments to help us focus on the work ahead. If you still have a question or comment you'd like to discuss, free to visit and/or post your topic in our Webmaster Help Group.

Thanks and take care,
The Webmaster Central Team