Google Webmaster Central Blog - Official news on crawling and indexing sites for the Google index

Reunifying duplicate content on your website

Tuesday, October 06, 2009 at 3:14 PM

Handling duplicate content within your own website can be a big challenge. Websites grow; features get added, changed and removed; content comes—content goes. Over time, many websites collect systematic cruft in the form of multiple URLs that return the same contents. Having duplicate content on your website is generally not problematic, though it can make it harder for search engines to crawl and index the content. Also, PageRank and similar information found via incoming links can get diffused across pages we aren't currently recognizing as duplicates, potentially making your preferred version of the page rank lower in Google.

Steps for dealing with duplicate content within your website
  1. Recognize duplicate content on your website.
    The first and most important step is to recognize duplicate content on your website. A simple way to do this is to take a unique text snippet from a page and to search for it, limiting the results to pages from your own website by using a site:query in Google. Multiple results for the same content show duplication you can investigate.
  2. Determine your preferred URLs.
    Before fixing duplicate content issues, you'll have to determine your preferred URL structure. Which URL would you prefer to use for that piece of content?
  3. Be consistent within your website.
    Once you've chosen your preferred URLs, make sure to use them in all possible locations within your website (including in your Sitemap file).
  4. Apply 301 permanent redirects where necessary and possible.
    If you can, redirect duplicate URLs to your preferred URLs using a 301 response code. This helps users and search engines find your preferred URLs should they visit the duplicate URLs. If your site is available on several domain names, pick one and use the 301 redirect appropriately from the others, making sure to forward to the right specific page, not just the root of the domain. If you support both www and non-www host names, pick one, use the preferred domain setting in Webmaster Tools, and redirect appropriately.
  5. Implement the rel="canonical" link element on your pages where you can.
    Where 301 redirects are not possible, the rel="canonical" link element can give us a better understanding of your site and of your preferred URLs. The use of this link element is also supported by major search engines such as Ask.comBing and Yahoo!.
  6. Use the URL parameter handling tool in Google Webmaster Tools where possible.
    If some or all of your website's duplicate content comes from URLs with query parameters, this tool can help you to notify us of important and irrelevant parameters within your URLs. More information about this tool can be found in our announcement blog post.

What about the robots.txt file?

One item which is missing from this list is disallowing crawling of duplicate content with your robots.txt file. We now recommend not blocking access to duplicate content on your website, whether with a robots.txt file or other methods. Instead, use the rel="canonical" link element, the URL parameter handling tool, or 301 redirects. If access to duplicate content is entirely blocked, search engines effectively have to treat those URLs as separate, unique pages since they cannot know that they're actually just different URLs for the same content. A better solution is to allow them to be crawled, but clearly mark them as duplicate using one of our recommended methods. If you allow us to crawl these URLs, Googlebot will learn rules to identify duplicates just by looking at the URL and should largely avoid unnecessary recrawls in any case. In cases where duplicate content still leads to us crawling too much of your website, you can also adjust the crawl rate setting in Webmaster Tools.

We hope these methods will help you to master the duplicate content on your website! Information about duplicate content in general can also be found in our Help Center. Should you have any questions, feel free to join the discussion in our Webmaster Help Forum.

The comments you read here belong only to the person who posted them. We do, however, reserve the right to remove off-topic comments.

42 comments:

Matt N said...

Great article, although duplicate content has become increasingly worst from those who steal content altogether. I really wish there was a specific way to report those people. I suppose the Google report spam is the best way.

Susan Moskwa said...

There is a way: http://www.google.com/dmca.html

Rob said...

Great article, although duplicate content has become increasingly worst from those who steal content altogether. I really wish there was a specific way to report those people. I suppose the Google report spam is the best way.

Wait a tick, that's duplicate content.

;)

RM

Michael Gray said...

so if I have a printer friendly version of my pages at http://example.com/page/print/ I would usually block the entire "print" directory in robots, now you say you want the rel cannocial tag and want to crawl it. I just don't see how that's a more efficient arrangement.

Ellithy said...

this is very nice
years searchin for somethin like this

John Mueller said...

@Michael Gray: If you let us crawl the printer pages & help us to find the canonical, in turn we can pass any information/signals (like PageRank) on to the canonical instead of leaving them on an unknown and uncrawled URL.

granyanella said...

Interesting comments there particularly regarding the robots.txt file. What about issues where incoming links could use or not the "www" - should we use the rel cannocial on the home page for example to show the preferred syntax of the URL?

ellipsis said...

@John Mueller Please clarify whether the printer friendly page would be crawled if the URL parameter handling tool had been used. I left a longer query on this at http://bit.ly/cAeVq

laszloberndt said...

Thanks for the great and very useful article.
I posted the link on my blog and the hungarian seo forum.

Preston said...

This is the clearest discussion of this subject I have read. But it prompts me to ask one question. How do we treat translated content? If I have a page that is presented in multiple languages do any of these tactics apply or are different languages treated as different content?

Robert said...

This is great for those people who have problems with internal duplicate content issues. What are we supposed to do if we have external issues?

People steal content. That's a fact.

Whether a link back is provided or not, Google is horrible at determining the original source. This also is a fact.

If enough articles are taken from one subsection of your site, Google will jettison the entire folder. This also is a fact.

Small, independently run sites suffer the most. This also is a fact.

What are we supposed to do? We work just as hard as, if not harder than, everyone else, and we try to create unique, relevant content. We rank well. Until people steal our content. Then our entire site gets trashed in Google for months.

The DMCA is nice, but it doesn't work for stuff hosted in foreign countries.

There also is this ridiculous presiding notion that anyone can take whatever they want as long as they provide a link back. It's a nice idea in theory, but when Google punishes the original source, it is a huge problem.

If you complain about people taking your content and request that they remove it -- or, at worst, file a DMCA notice against them -- they get all uptight and start copying MORE of your content in retaliation, putting it in places where you can't get it removed.

NONE OF THIS WOULD BE AN ISSUE IF GOOGLE WOULD GET THIS RIGHT!!! None of the other search engines have this problem.

(See http://www.seochat.com/c/a/Google-Optimization-Help/Duplicate-Content-Penalties-Problems-with-Googles-Filter/)

When will Google fix this?!?!?!

Susan Moskwa said...

@Preston: Different languages are different content. This is because they're not effectively interchangeable. If two URLs serve the same content, you'll get the same information regardless of which URL you go to. But if two URLs are in different languages, you won't get the same information from both because you won't be able to understand the one that's not in your language.

Carrie said...

What about slightly different versions of the same content? I work on software documentation, and we have a problem with Google returning results for the oldest versions of the software first. We can't delete the old versions of the documentation, but want users to visit the latest copy. Is there a way to do that?

Jonathan said...

After fixing all of these things on a site that hasn't been crawled in a couple of months, how long would one expect it to take for these changes to "show up" in increased rankings, etc.?

Ian M said...

I also really want to know the answer to ellipsis' question.

From what Chien-I Liao said in the announcement of the tool, Google doesn't strip out the URL parameters (like Yahoo!) but instead spiders anyway, and then won't spider the URL without it. This makes it useless for 80% of the potential types of URL parameter issues :(

ledona;d said...

I found one useful tool for checking duplicate content.It shows duplicate content from Google,MSN and Yahoo.

Dupeefree You can download and is free to use, avoid duplication and enjoy

Ian M said...

Google - in light of this change of policy on robots.txt, you might want to change this page in your webmaster guidelines:

http://www.google.com/support/webmasters/bin/answer.py?answer=35769

Specifically this text:
“Use robots.txt to prevent crawling of search results pages or other auto-generated pages that don't add much value for users coming from search engines.”

Tom Low said...

Thank you for the info especially on the robots.txt file and redirect 301

I will be careful to do the right method in case I encounter any duplicate content issues.

Singapore SEO

DataPlus - Custom Data Services said...

Thanks for this. I have been trying to figure out what is going on with my site. I will come up one day on page 1 position 3 for a keyword, the next day be nowhere and again come up position 3 before disappearing again. I'm puzzled.

FlemmingLeer, denmarkonline.dk said...

Irregularity concering robots.txt:

Hi,

I have an issue concerning the number of pages in the google index.

The number of links in one of my sites jumps from 13.000 (which I prefer) to 26.000 within an hour to 13.000 the next day.

I haven't edited the robots.txt file at alle. The site is using Drupal.org CMS and I have applied strong restrictions to avoid duplicate content.

What is going on ?

Thank you.

Vipin Kumar said...

Hi,

What if we owned 2 or 3 website with same content. how we can say to google this is my own 2-3 website?

John Mueller said...

@granyanella The best solution for www/non-www would be to use a 301 redirect, which most hosters can set up for you. Alternately, if you can't do a 301 redirect, you can somewhat avoid the issue by always using absolute URLs on your pages, including for the rel=canonical link elements on your pages.

John Mueller said...

@ellipsis I would say that your assumed answers are generally correct - however keep in mind that settings in the URL parameter handling tool will take some time to take affect and are not guaranteed to be followed. We're working on making it easier to keep the results clean of duplicate content from your own site, so we generally want to make it work in the way that you assumed!

John Mueller said...

@Preston Translated content is not seen as duplicate content. I think it's great to have translated content on your site, however there are three things I suggest watching out for:
- Don't use automatic translations unless you are blocking these from being indexed.
- Try to keep content on the pages limited to one language (make each language version a separate page).
- Make sure each language version has it's own URL (don't automatically show a different language version based on the user's browser).

John Mueller said...

@Carrie If you have outdated content online, I'd suggest either blocking it from getting indexed (using the "noindex" robots meta tag) or specifying the preferred version using the rel=canonical link element. Both will let you keep the content online for your users, but will prevent it from showing up in search results.

John Mueller said...

@Ian M - the suggestion to use robots.txt for those pages in the Webmaster Guidelines still stands. It's the easiest way to prevent crawling (and generally, indexing) of these pages. One problem with search results pages is that they generally provide a source of infinite URLs, which makes crawling a site properly very difficult.

John Mueller said...

@FlemmingLeer The number of URLs shown in the "of about" count is a very rough approximation and can vary for any number of technical reasons. If you want to have a more exact count, I would suggest submitting a Sitemap file with the URLs that you really want to have indexed and then checking the indexed URL count for those URLs in your Webmaster Tools account.

S. A. Rahman said...

Although a very fine article but still the issue is unsolved. I am still having problem with duplicate content within my blog. I have manually tried article comparison, each article is 60% duplicate to the other within the same category.

If there is another tool/software out there which I am not aware of please do let me know!

Regards,

ditto@progressnowcolorado.org said...

I'd like to see a canonical host option or syntax. A lot of the hosted eCRM software out there uses "page wrappers" that don't allow you to see the actual URL via PHP (the $_SERVER['REQUEST_URI'] variable contains the URL of the wrapper that the CRM called by URL to build the page, and it gets cached at that, so it's basically 100% meaningless)

So in our case we have all of our pages on http://www.ourmaindomain.org that are 100% duplicated on https://secure.ourmaindomain.org, because testing shows that having "secure" in the URL increases the donor conversion rate. But donation pages are only a small part of the application.

I can't tell from PHP within our CMS what page is being served by the CRM. So I'd like to see something like:

link rel="canonicalhost" href="http://www.mymaindomain.com"

or something like that. Or maybe an extension of the current standard to include a host plus a token for whatever the current URL-path is.

nestorbentancor said...

I ended up with 3 domains (pointing to the root of the same site) ranking, some of them better than my preffered main domain. I now this is very bad and I want to use a 301 redirect but they are all in the same site. Is creating a new account like this article says the best way to go? http://www.mcanerin.com/en/articles/301-redirect-add-domain.asp

Borzio said...

Hello
We are an electrical company with shops in four different counties in the UK each shop has much the same inventory although two also cater for trade counters.
Each shop is also run independently with each having a different county name e.g. Yorkshire Electricals - Lancaster Electricals Are we allowed to have have four separate websites ?
We are serving four separate areas but I cannot see Google knowing this and would not want to be in trouble ?

Sunayna said...

Great article

Thank you very much

elaine, nicholsonsjewellers.co.uk said...

I worry about duplicate content on my site (especially as I've just lost 1st page SERP position of one of my key terms. I have diamond wedding rings that appeal to those who just want a diamond ring -whether or not they're married! So some products are duplicated over two categories. I thought this was appropriate to visitors: those who want a diamond ring may not bother to visit the diamond wedding ring category and vice versa. Would Google punish me for this?

Vicky said...

How about duplicate content generated due to coding error.

i.e. wrong relative linking + loose mod_rewrite rules?

I think in this case; URL removal tool is the God!

advance said...

I've a confusion..in our site, we don't specify param name in most of the URLs instead we have a SEO URL mapper file to internally distinguish among the same. For example the pattern is
"/product_prodName_partNum_partType_Category"
"/product_Brake_12345_R_GRP1" something like that.

So my concern here is if the same falls under a different group say GRP2 then specifying ignoring category param will help Google to consider one out of two or not?

To summarize, I've two URLs for same prod falling under different categories and the URLs are as below..
/product_Brake_12345_R_GRP1
/product_Brake_12345_R_GRP2

Internally we map GRP1/GRP2 as categories. If I specify "category" keyword to ignore in Google parameter handling tool, will it serve the purpose?

Susan Moskwa said...

@advance: You can't use the parameter handling tool because you aren't using parameters. Parameters are indicated by name/value pairs, e.g. ?param=value&param2=value2. You're just glueing all your values together, making them look like a custom directory name. If you rewrote your URLs so that they actually use parameters (/?product=a&prodName=b&partNum=c&partType=d&Category=e) then you could use the parameter handling tool, but I'm not sure that's worth it for this type of case (Google should be able to handle a couple duplicates for each product).

Chris Thompson said...

I have a website which is translated into 32 different languages.

Some pairs of languages, such as Spanish/Portuguese and Malay/Indonesian, are similar enough to each other that the page titles and content for any particular page can be the same or similar in both languages.

I notice in my Webmaster Tools that the Indonesian version of my site has far fewer pages indexed than other languages (18 pages indexed against an average for other languages of 90-100 pages). The only reason I can think of is that it is being penalised for duplicate content as a result of its similarity to Malay.

This seems rather unfair to Indonesian users who might want to find my site.

Does Google take into account that multilingual websites might legitimately have some pages with similar content as a result of similarities between languages? If not, might I suggest that you adjust your algorithms which deal with duplicate content to take this into account?

Best wishes,
Chris

Totalclicks said...

can u plz help me my blog have a same prob now a days..it show
3 Duplicate meta descriptions
1 Short meta descriptions
13 Duplicate title tags

after that my blog is totally removed from yahoo search result

plz help me as soon as possible???

djdavidolam said...

Hello,

I have a question about duplicate content. I have a client who handles multiple cities and we want to create a new page for each city. However, we are looking at 8 new pages based off of one keyword in which we add the city at the end.

I can only come up with so much unique content on the subject so I wanted to just clone the main page and just change the city/location info.

Will this cause issues and a possible SERP slap?

Thanks!
Jeremy

Herve said...

Hi, I added the canonical tag on each of my pages and the Parameter Handling in the webmaster tool since more than one year and always see my homepage (normal version and with parameters) in the duplicate list of the google webmaster tool. Any idea about it ?

Universal Life Church said...

On my blog, I moved content from several old blogs onto one different one and it seems I ended up with duplicate content. There were some of the same posts on the various sites, so I need to know how to figure out which are which. Since many of the articles are on the same topics, they have some of the same subject lines, so it's hard to check. Any way to do a general run through for duplicate content?

Google Webmaster Central said...

Hi everyone,

Since over a year has passed since we published this post, we're closing the comments to help us focus on the work ahead. If you still have a question or comment you'd like to discuss, free to visit and/or post your topic in our Webmaster Central Help Forum.

Thanks and take care,
The Webmaster Central Team