Google Webmaster Central Blog - Official news on crawling and indexing sites for the Google index

New parameter handling tool helps with duplicate content issues

Monday, October 05, 2009 at 12:33 PM

Duplicate content has been a hot topic among webmasters and our blog for over three years. One of our first posts on the subject came out in December of '06, and our most recent post was last week. Over the past three years, we've been providing tools and tips to help webmasters control which URLs we crawl and index, including a) use of 301 redirects, b) www vs. non-www preferred domain setting, c) change of address option, and d) rel="canonical".

We're happy to announce another feature to assist with managing duplicate content: parameter handling. Parameter handling allows you to view which parameters Google believes should be ignored or not ignored at crawl time, and to overwrite our suggestions if necessary.


Let's take our old example of a site selling Swedish fish. Imagine that your preferred version of the URL and its content looks like this:
http://www.example.com/product.php?item=swedish-fish

However, you may also serve the same content on different URLs depending on how the user navigates around your site, or your content management system may embed parameters such as sessionid:
http://www.example.com/product.php?item=swedish-fish&category=gummy-candy
http://www.example.com/product.php?item=swedish-fish&trackingid=1234&sessionid=5678

With the "Parameter Handling" setting, you can now provide suggestions to our crawler to ignore the parameters category, trackingid, and sessionid. If we take your suggestion into account, the net result will be a more efficient crawl of your site, and fewer duplicate URLs.

Since we launched the feature, here are some popular questions that have come up:

Are the suggestions provided a hint or a directive?
Your suggestions are considered hints. We'll do our best to take them into account; however, there may be cases when the provided suggestions may do more harm than good for a site.

When do I use parameter handling vs rel="canonical"?
rel="canonical" is a great tool to manage duplicate content issues, and has had huge adoption. The differences between the two options are:
  • rel="canonical" has to be put on each page, whereas parameter handling is set at the host level
  • rel="canonical" is respected by many search engines, whereas parameter handling suggestions are only provided to Google
Use which option works best for you; it's fine to use both if you want to be very thorough.

As always, your feedback on our new feature is appreciated.

The comments you read here belong only to the person who posted them. We do, however, reserve the right to remove off-topic comments.

34 comments:

HollowMarkeD said...

Great tool to help SEO of a site, particularly if we can't get access to URL rewrites - one question though - what happnes to links pointed at URLs with variables within them? If those variables are ignored, are the links treated like a canonical tag will behave, or are those links just not counted?

Simon Lynch said...

Great feature! Why the limit to 15 params? What about a switch to turn all params off for a site? As GoogleBot seems to be following more and more JavaScript, it would be helpful to be able to tell it to take the non-JavaScript pages and stop trying to do things which will only work for users in session with JS on. We have ended up moving a load of stuff out of having logical URLs to a directory which we disallowed with robots.txt. This not only doesn't make sense, but also means thousands of errors in Webmaster Tools.

1918 said...

Smart for all parties involved!

Sagar Kamdar said...

@HollowMarkeD, yes if we adhere to the hint provided all the links will be treated similarly to how we treat the canonical tag.

@Simon, can you provide your site and example on the forum so we can take a further look: http://www.google.com/support/forum/p/Webmasters?hl=en

ellipsis said...

Please clarify ... assuming the directive is followed

1) Will http://www.example.com/product.php?item=swedish-fish&category=gummy-candy be fetched as a) http://www.example.com/product.php?item=swedish-fish&category=gummy-candy or b) http://www.example.com/product.php?item=swedish-fish ?

2) Will http://www.example.com/product.php?item=swedish-fish&category=gummy-candy be indexed as a) http://www.example.com/product.php?item=swedish-fish&category=gummy-candy or b) http://www.example.com/product.php?item=swedish-fish ?

3) Will links to http://www.example.com/product.php?item=swedish-fish&category=gummy-candy be treated as links to a) http://www.example.com/product.php?item=swedish-fish&category=gummy-candy or b) http://www.example.com/product.php?item=swedish-fish ?

4) Does the answer to question 3 depend on whether the links are on http://www.example.com/ (i.e. the same domain) or on another domain or subdomain?

5) If the answer to question 3 is "b", will the full link weight be ascribed?

6) Is it worth adding all of the Google Analytics utm_ parameters to the list (costing many of the 15 entries), or are these automatically excluded anyway?

My assumed answers:
1) b
2) b
3) b
4) No
5) Probably about as much as is passed through a 301
6) Add them but cull them first if the 15 parameter limit is exceeded

Phil said...

Fantastic :)

DataPlus - Custom Data Services said...

Thanks for clearing this up. I saw the option in my Webmaster Tools but didn't know what tod o with it.

Admin said...

This is what I've been waiting for a full explanation of it... always to handling duplicate content...nice job

Chien-I Liao said...

@ellipsis

Once you make a parameter "IGNORE", we treat all URLs differ only on the value of that parameter equivalent. Therefore:
(1) Will Google still crawl URL containing "IGNORE" parameters?
YES! But once we crawl one URL with that parameter we won't crawl another equivalent URLs, including the URL without that parameter.
For example, if your site has English and Spanish version and you set "language" parameter to IGNORE, there is no guarantee whether we crawl English version or Spanish one, even the URL without that parameter defaults to English.
(2) We only follow links on the one URL we crawl, which is, a random one. If two pages have links to different content you might lose the crawl coverage (in a random way).

skiseo said...

I've got an issue with nested jsessionid's related to a 301 redirect such that

www.site-v1.com/product-detail.jsp

redirect points to

www.site-v2.com/product.jsp?product-detail

the resulting page is

www.site-v2.com/product.jsp;jsessionid=xxx.ecom105_main?product-detail

as you can see the redirect destination URL's session ID is appended to the middle of the destination path.

Question: If I define "jsessionid=*.ecom*_main" as a parameter to "Ignore" within the new tool, will this suggest the spiders consider said nested param URL as the URL less "Ignored" parameter?

ellipsis said...

@Chien-I Liao

Thanks for the advice! You raise more questions though. The reality seems more complex than original blog post implies.

Would it be possible to give a follow-up post/article with examples of the possible treatments (because it seems there is more than one) of a simple three or four page dynamic site?

Drew said...

The Parameter Handling settings are under Site configuration -> Settings, at the bottom.

Darren said...

Hello

We use Helicon ISAPI URL which is a htaccess URL rewrite module for IIS. It converts a URL like product.asp?Category=1 to /books/horror - if we disable "?Category" as the parameter will Google still index our site - can you guys read HTaccess writes?

roulette said...

Did someone try this with a site in HTTPS ?

None of my HTTPS sites are able to use the parameter handling feature.

webmastertools said...

There is another important difference between parameter handling and canonical url. Canonical url does not influence indexing and crawling because search engines must visit all url on the other hand parameter handling give Google a directive to ignore the parameter.

Aryo Halim said...

just wondering why my site dont have google page rank?

Daniel said...

thank you very much. yesterday I defined 3 parameters and marked "Ignore". my question: how much time for the changes take affect. I am really with a lot of duplicated content due to this parameters :-(

Jonathan said...

@ellipsis

You say "once we crawl one URL with that parameter we won't crawl another equivalent URLs" - ok I get that.

Then you say "including the URL without that parameter." This is where you've lost me as it sounds like other URLs NOT containing that parameter wont be crawled.

Can you explain the last comment further?

ellipsis said...

@Jonathan

I never said that ... Chien-I Liao did. But I'll attempt to answer...

Even though "category" is set to be an ignored parameter, http://www.example.com/product.php?item=swedish-fish&category=gummy-candy may still be crawled. But, if it is, http://www.example.com/product.php?item=swedish-fish will not be crawled as it will be treated as having been crawled already (as http://www.example.com/product.php?item=swedish-fish&category=gummy-candy).

This is far from ideal, IMO.

Jonathan said...

@ellipsis - Thanks. Far from ideal indeed.

@Chien-I Liao - can you confirm?

2fer said...

Hello, I have a retail music website. our content for sale is now being completely ignored by google. a possible duplicate content issue.

Question: is it possible that google is confusing our browse pages '/browse/set?category=album' for the actual content ex. '/artist/name/release/number-name' ?

the way PHP sites list content pages is very similar to the way our browse urls are structured.

James Ryddel said...

I have implemented the ignore parameter feature for several parameters including SID, but several days after implementation and Google is STILL crawling the links in question.

I have also implemented canonical link refs in my header for each of the pages I want indexing.

SO how long does it take Googlebot to work through the changes? I have posted this question on Google webmaster tools help forum but no one has replied and I've been waiting for over 48 hours.

Alan said...

Hi. This tool is really great - i've been worrying about duplicate content since about 3 months back when we introduced the ability for users to get prices in different currencies and session ids (which we did to help individual users get consistent search results). But since then our indexed pages have rocketed up so duplicate content issues crossed my mind. I switched on parameter handling about 3 weeks ago for these parameters but using the "site:" command stills shows multiple url entries eg site:www.bluechipvacations.com/holiday-cottages/brixham/3-moorings-reach.html shows 9 urls listed, 4 with session ids appended and two with currency ids. Is there a set amount of time before the parameter handling kicks in? thanks

Marf said...

I'm using this on my blog to remove the showComment parameter. That is the single reason why I have reports of duplicate content on my blog.

So far I have noticed no change or results, but it's only been a few days.

webmastertools said...

My site gets suggestion from Google to ingore more than 15 parameters, I like to block these paramters but the maximum paramters to ignore is 15 according to the blog post.

Is it possible to ingore more than 15 parameters if they come as suggestion from Google in webmsterstools.

kevin said...

Can I use Parameter handling to have Google all of my duplicate https pages?

FreeRSBots.net said...

What if I want to remove my index.php?

For example, google is counting both my
http://www.wolfteamhacks.net/ and http://www.wolfteamhacks.net/index.php as different pages.

PixelShots said...

have the urls in default blogspot.com and another in redirecyion domain as .co.nr

how to stop google considering the co.nr domain url to my blog. seems ma blogspot orginal one iss getting considered duplicate now to the co.nr

sales said...

Hi there,

My site map urls and page links are created with parameters using the '&' instead of plain old simple '&'. It is my belief this is the technically compliant way to specify an url.
ie '?parameter1=value1&parameter2=value2'
When you visit the site, the ampersand displays in the url bar as '&' as expected.
However...and this is a concern...
Google Webmaster tools lists my parameters as 'amp;parameter2' (for example).
This looks like a potential problem to me...can you shed any light on this issue?
Thanks in advance.

Affordable web design, graphic design and seo company from Amsterdam, The Netherlands said...

To bad I cannot set a canonical tag in my blogger blog right? And what if I want to transfer oen article to another website or blog, how can I tell the search engine that the article on the new website is the (new) original. OR will I not be penalized if I have removed the article from the old blog?

Clarke Adom said...

Hi,

My webmasters account shows 300 duplicate description and some 200 duplicate title tag. This is primarily because of a parameter. I have set this to be ignored by Google but such pages still show in webmasters report.

Please help
thanks.

Herve said...

Hi everybody,

I have the same issue with the homepage of my website. I added the canonical tag since it exist and the Parameter Handling in the Google Webmaster Tools there is more than one year but the duplicate is already here.

Any idea about it ?

John said...

@Herve did you ever resolve this issue? If so, how ?

Google Webmaster Central said...

Hi everyone,

Since over a year has passed since we published this post, we're closing the comments to help us focus on the work ahead. If you still have a question or comment you'd like to discuss, free to visit and/or post your topic in our Webmaster Central Help Forum.

Thanks and take care,
The Webmaster Central Team