Google Webmaster Central Blog - Official news on crawling and indexing sites for the Google index

New parameter handling tool helps with duplicate content issues

Monday, October 05, 2009 at 12:33 PM

Duplicate content has been a hot topic among webmasters and our blog for over three years. One of our first posts on the subject came out in December of '06, and our most recent post was last week. Over the past three years, we've been providing tools and tips to help webmasters control which URLs we crawl and index, including a) use of 301 redirects, b) www vs. non-www preferred domain setting, c) change of address option, and d) rel="canonical".

We're happy to announce another feature to assist with managing duplicate content: parameter handling. Parameter handling allows you to view which parameters Google believes should be ignored or not ignored at crawl time, and to overwrite our suggestions if necessary.

The comments you read here belong only to the person who posted them. We do, however, reserve the right to remove off-topic comments.

24 comments:

HollowMarkeD said...

Great tool to help SEO of a site, particularly if we can't get access to URL rewrites - one question though - what happnes to links pointed at URLs with variables within them? If those variables are ignored, are the links treated like a canonical tag will behave, or are those links just not counted?

Simon Lynch said...

Great feature! Why the limit to 15 params? What about a switch to turn all params off for a site? As GoogleBot seems to be following more and more JavaScript, it would be helpful to be able to tell it to take the non-JavaScript pages and stop trying to do things which will only work for users in session with JS on. We have ended up moving a load of stuff out of having logical URLs to a directory which we disallowed with robots.txt. This not only doesn't make sense, but also means thousands of errors in Webmaster Tools.

1918 said...

Smart for all parties involved!

Sagar Kamdar said...

@HollowMarkeD, yes if we adhere to the hint provided all the links will be treated similarly to how we treat the canonical tag.

@Simon, can you provide your site and example on the forum so we can take a further look: http://www.google.com/support/forum/p/Webmasters?hl=en

ellipsis said...

Please clarify ... assuming the directive is followed

1) Will http://www.example.com/product.php?item=swedish-fish&category=gummy-candy be fetched as a) http://www.example.com/product.php?item=swedish-fish&category=gummy-candy or b) http://www.example.com/product.php?item=swedish-fish ?

2) Will http://www.example.com/product.php?item=swedish-fish&category=gummy-candy be indexed as a) http://www.example.com/product.php?item=swedish-fish&category=gummy-candy or b) http://www.example.com/product.php?item=swedish-fish ?

3) Will links to http://www.example.com/product.php?item=swedish-fish&category=gummy-candy be treated as links to a) http://www.example.com/product.php?item=swedish-fish&category=gummy-candy or b) http://www.example.com/product.php?item=swedish-fish ?

4) Does the answer to question 3 depend on whether the links are on http://www.example.com/ (i.e. the same domain) or on another domain or subdomain?

5) If the answer to question 3 is "b", will the full link weight be ascribed?

6) Is it worth adding all of the Google Analytics utm_ parameters to the list (costing many of the 15 entries), or are these automatically excluded anyway?

My assumed answers:
1) b
2) b
3) b
4) No
5) Probably about as much as is passed through a 301
6) Add them but cull them first if the 15 parameter limit is exceeded

Phil said...

Fantastic :)

DataPlus - Custom Data Services said...

Thanks for clearing this up. I saw the option in my Webmaster Tools but didn't know what tod o with it.

Admin said...

This is what I've been waiting for a full explanation of it... always to handling duplicate content...nice job

Chien-I Liao said...

@ellipsis

Once you make a parameter "IGNORE", we treat all URLs differ only on the value of that parameter equivalent. Therefore:
(1) Will Google still crawl URL containing "IGNORE" parameters?
YES! But once we crawl one URL with that parameter we won't crawl another equivalent URLs, including the URL without that parameter.
For example, if your site has English and Spanish version and you set "language" parameter to IGNORE, there is no guarantee whether we crawl English version or Spanish one, even the URL without that parameter defaults to English.
(2) We only follow links on the one URL we crawl, which is, a random one. If two pages have links to different content you might lose the crawl coverage (in a random way).

skiseo said...

I've got an issue with nested jsessionid's related to a 301 redirect such that

www.site-v1.com/product-detail.jsp

redirect points to

www.site-v2.com/product.jsp?product-detail

the resulting page is

www.site-v2.com/product.jsp;jsessionid=xxx.ecom105_main?product-detail

as you can see the redirect destination URL's session ID is appended to the middle of the destination path.

Question: If I define "jsessionid=*.ecom*_main" as a parameter to "Ignore" within the new tool, will this suggest the spiders consider said nested param URL as the URL less "Ignored" parameter?

ellipsis said...

@Chien-I Liao

Thanks for the advice! You raise more questions though. The reality seems more complex than original blog post implies.

Would it be possible to give a follow-up post/article with examples of the possible treatments (because it seems there is more than one) of a simple three or four page dynamic site?

Drew said...

The Parameter Handling settings are under Site configuration -> Settings, at the bottom.

Darren said...

Hello

We use Helicon ISAPI URL which is a htaccess URL rewrite module for IIS. It converts a URL like product.asp?Category=1 to /books/horror - if we disable "?Category" as the parameter will Google still index our site - can you guys read HTaccess writes?

roulette said...

Did someone try this with a site in HTTPS ?

None of my HTTPS sites are able to use the parameter handling feature.

webmastertools said...

There is another important difference between parameter handling and canonical url. Canonical url does not influence indexing and crawling because search engines must visit all url on the other hand parameter handling give Google a directive to ignore the parameter.

Aryo Halim said...

just wondering why my site dont have google page rank?

Jonathan said...

@ellipsis

You say "once we crawl one URL with that parameter we won't crawl another equivalent URLs" - ok I get that.

Then you say "including the URL without that parameter." This is where you've lost me as it sounds like other URLs NOT containing that parameter wont be crawled.

Can you explain the last comment further?

Daniel said...

thank you very much. yesterday I defined 3 parameters and marked "Ignore". my question: how much time for the changes take affect. I am really with a lot of duplicated content due to this parameters :-(

ellipsis said...

@Jonathan

I never said that ... Chien-I Liao did. But I'll attempt to answer...

Even though "category" is set to be an ignored parameter, http://www.example.com/product.php?item=swedish-fish&category=gummy-candy may still be crawled. But, if it is, http://www.example.com/product.php?item=swedish-fish will not be crawled as it will be treated as having been crawled already (as http://www.example.com/product.php?item=swedish-fish&category=gummy-candy).

This is far from ideal, IMO.

2fer said...

Hello, I have a retail music website. our content for sale is now being completely ignored by google. a possible duplicate content issue.

Question: is it possible that google is confusing our browse pages '/browse/set?category=album' for the actual content ex. '/artist/name/release/number-name' ?

the way PHP sites list content pages is very similar to the way our browse urls are structured.

Marf said...

I'm using this on my blog to remove the showComment parameter. That is the single reason why I have reports of duplicate content on my blog.

So far I have noticed no change or results, but it's only been a few days.

Alan said...

Hi. This tool is really great - i've been worrying about duplicate content since about 3 months back when we introduced the ability for users to get prices in different currencies and session ids (which we did to help individual users get consistent search results). But since then our indexed pages have rocketed up so duplicate content issues crossed my mind. I switched on parameter handling about 3 weeks ago for these parameters but using the "site:" command stills shows multiple url entries eg site:www.bluechipvacations.com/holiday-cottages/brixham/3-moorings-reach.html shows 9 urls listed, 4 with session ids appended and two with currency ids. Is there a set amount of time before the parameter handling kicks in? thanks

James Ryddel said...

I have implemented the ignore parameter feature for several parameters including SID, but several days after implementation and Google is STILL crawling the links in question.

I have also implemented canonical link refs in my header for each of the pages I want indexing.

SO how long does it take Googlebot to work through the changes? I have posted this question on Google webmaster tools help forum but no one has replied and I've been waiting for over 48 hours.

Jonathan said...

@ellipsis - Thanks. Far from ideal indeed.

@Chien-I Liao - can you confirm?