Google Webmaster Central Blog - Official news on crawling and indexing sites for the Google index

Reunifying duplicate content on your website

Tuesday, October 06, 2009 at 3:14 PM

Handling duplicate content within your own website can be a big challenge. Websites grow; features get added, changed and removed; content comes—content goes. Over time, many websites collect systematic cruft in the form of multiple URLs that return the same contents. Having duplicate content on your website is generally not problematic, though it can make it harder for search engines to crawl and index the content. Also, PageRank and similar information found via incoming links can get diffused across pages we aren't currently recognizing as duplicates, potentially making your preferred version of the page rank lower in Google.

Steps for dealing with duplicate content within your website
  1. Recognize duplicate content on your website.
    The first and most important step is to recognize duplicate content on your website. A simple way to do this is to take a unique text snippet from a page and to search for it, limiting the results to pages from your own website by using a site:query in Google. Multiple results for the same content show duplication you can investigate.

  2. Determine your preferred URLs.
    Before fixing duplicate content issues, you'll have to determine your preferred URL structure. Which URL would you prefer to use for that piece of content?

  3. Be consistent within your website.
    Once you've chosen your preferred URLs, make sure to use them in all possible locations within your website (including in your Sitemap file).

  4. Apply 301 permanent redirects where necessary and possible.
    If you can, redirect duplicate URLs to your preferred URLs using a 301 response code. This helps users and search engines find your preferred URLs should they visit the duplicate URLs. If your site is available on several domain names, pick one and use the 301 redirect appropriately from the others, making sure to forward to the right specific page, not just the root of the domain. If you support both www and non-www host names, pick one, use the preferred domain setting in Webmaster Tools, and redirect appropriately.

  5. Implement the rel="canonical" link element on your pages where you can.
    Where 301 redirects are not possible, the rel="canonical" link element can give us a better understanding of your site and of your preferred URLs. The use of this link element is also supported by major search engines such as Ask.comBing and Yahoo!.

  6. Use the URL parameter handling tool in Google Webmaster Tools where possible.
    If some or all of your website's duplicate content comes from URLs with query parameters, this tool can help you to notify us of important and irrelevant parameters within your URLs. More information about this tool can be found in our announcement blog post.

What about the robots.txt file?

One item which is missing from this list is disallowing crawling of duplicate content with your robots.txt file. We now recommend not blocking access to duplicate content on your website, whether with a robots.txt file or other methods. Instead, use the rel="canonical" link element, the URL parameter handling tool, or 301 redirects. If access to duplicate content is entirely blocked, search engines effectively have to treat those URLs as separate, unique pages since they cannot know that they're actually just different URLs for the same content. A better solution is to allow them to be crawled, but clearly mark them as duplicate using one of our recommended methods. If you allow us to crawl these URLs, Googlebot will learn rules to identify duplicates just by looking at the URL and should largely avoid unnecessary recrawls in any case. In cases where duplicate content still leads to us crawling too much of your website, you can also adjust the crawl rate setting in Webmaster Tools.

We hope these methods will help you to master the duplicate content on your website! Information about duplicate content in general can also be found in our Help Center. Should you have any questions, feel free to join the discussion in our Webmaster Help Forum.

The comments you read here belong only to the person who posted them. We do, however, reserve the right to remove off-topic comments.

30 comments:

Matt N said...

Great article, although duplicate content has become increasingly worst from those who steal content altogether. I really wish there was a specific way to report those people. I suppose the Google report spam is the best way.

Susan Moskwa said...

There is a way: http://www.google.com/dmca.html

Rob said...

Great article, although duplicate content has become increasingly worst from those who steal content altogether. I really wish there was a specific way to report those people. I suppose the Google report spam is the best way.

Wait a tick, that's duplicate content.

;)

RM

Michael Gray said...

so if I have a printer friendly version of my pages at http://example.com/page/print/ I would usually block the entire "print" directory in robots, now you say you want the rel cannocial tag and want to crawl it. I just don't see how that's a more efficient arrangement.

Ellithy said...

this is very nice
years searchin for somethin like this

John Mueller said...

@Michael Gray: If you let us crawl the printer pages & help us to find the canonical, in turn we can pass any information/signals (like PageRank) on to the canonical instead of leaving them on an unknown and uncrawled URL.

granyanella said...

Interesting comments there particularly regarding the robots.txt file. What about issues where incoming links could use or not the "www" - should we use the rel cannocial on the home page for example to show the preferred syntax of the URL?

ellipsis said...

@John Mueller Please clarify whether the printer friendly page would be crawled if the URL parameter handling tool had been used. I left a longer query on this at http://bit.ly/cAeVq

laszloberndt said...

Thanks for the great and very useful article.
I posted the link on my blog and the hungarian seo forum.

Preston said...

This is the clearest discussion of this subject I have read. But it prompts me to ask one question. How do we treat translated content? If I have a page that is presented in multiple languages do any of these tactics apply or are different languages treated as different content?

Robert said...

This is great for those people who have problems with internal duplicate content issues. What are we supposed to do if we have external issues?

People steal content. That's a fact.

Whether a link back is provided or not, Google is horrible at determining the original source. This also is a fact.

If enough articles are taken from one subsection of your site, Google will jettison the entire folder. This also is a fact.

Small, independently run sites suffer the most. This also is a fact.

What are we supposed to do? We work just as hard as, if not harder than, everyone else, and we try to create unique, relevant content. We rank well. Until people steal our content. Then our entire site gets trashed in Google for months.

The DMCA is nice, but it doesn't work for stuff hosted in foreign countries.

There also is this ridiculous presiding notion that anyone can take whatever they want as long as they provide a link back. It's a nice idea in theory, but when Google punishes the original source, it is a huge problem.

If you complain about people taking your content and request that they remove it -- or, at worst, file a DMCA notice against them -- they get all uptight and start copying MORE of your content in retaliation, putting it in places where you can't get it removed.

NONE OF THIS WOULD BE AN ISSUE IF GOOGLE WOULD GET THIS RIGHT!!! None of the other search engines have this problem.

(See http://www.seochat.com/c/a/Google-Optimization-Help/Duplicate-Content-Penalties-Problems-with-Googles-Filter/)

When will Google fix this?!?!?!

Susan Moskwa said...

@Preston: Different languages are different content. This is because they're not effectively interchangeable. If two URLs serve the same content, you'll get the same information regardless of which URL you go to. But if two URLs are in different languages, you won't get the same information from both because you won't be able to understand the one that's not in your language.

Carrie said...

What about slightly different versions of the same content? I work on software documentation, and we have a problem with Google returning results for the oldest versions of the software first. We can't delete the old versions of the documentation, but want users to visit the latest copy. Is there a way to do that?

Jonathan said...

After fixing all of these things on a site that hasn't been crawled in a couple of months, how long would one expect it to take for these changes to "show up" in increased rankings, etc.?

Ian M said...

I also really want to know the answer to ellipsis' question.

From what Chien-I Liao said in the announcement of the tool, Google doesn't strip out the URL parameters (like Yahoo!) but instead spiders anyway, and then won't spider the URL without it. This makes it useless for 80% of the potential types of URL parameter issues :(

ledona;d said...

I found one useful tool for checking duplicate content.It shows duplicate content from Google,MSN and Yahoo.

Dupeefree You can download and is free to use, avoid duplication and enjoy

Ian M said...

Google - in light of this change of policy on robots.txt, you might want to change this page in your webmaster guidelines:

http://www.google.com/support/webmasters/bin/answer.py?answer=35769

Specifically this text:
“Use robots.txt to prevent crawling of search results pages or other auto-generated pages that don't add much value for users coming from search engines.”

Tom Low said...

Thank you for the info especially on the robots.txt file and redirect 301

I will be careful to do the right method in case I encounter any duplicate content issues.

Singapore SEO

DataPlus - Custom Data Services said...

Thanks for this. I have been trying to figure out what is going on with my site. I will come up one day on page 1 position 3 for a keyword, the next day be nowhere and again come up position 3 before disappearing again. I'm puzzled.

FlemmingLeer, denmarkonline.dk said...

Irregularity concering robots.txt:

Hi,

I have an issue concerning the number of pages in the google index.

The number of links in one of my sites jumps from 13.000 (which I prefer) to 26.000 within an hour to 13.000 the next day.

I haven't edited the robots.txt file at alle. The site is using Drupal.org CMS and I have applied strong restrictions to avoid duplicate content.

What is going on ?

Thank you.

Vipin Kumar said...

Hi,

What if we owned 2 or 3 website with same content. how we can say to google this is my own 2-3 website?

John Mueller said...

@granyanella The best solution for www/non-www would be to use a 301 redirect, which most hosters can set up for you. Alternately, if you can't do a 301 redirect, you can somewhat avoid the issue by always using absolute URLs on your pages, including for the rel=canonical link elements on your pages.

John Mueller said...

@ellipsis I would say that your assumed answers are generally correct - however keep in mind that settings in the URL parameter handling tool will take some time to take affect and are not guaranteed to be followed. We're working on making it easier to keep the results clean of duplicate content from your own site, so we generally want to make it work in the way that you assumed!

John Mueller said...

@Preston Translated content is not seen as duplicate content. I think it's great to have translated content on your site, however there are three things I suggest watching out for:
- Don't use automatic translations unless you are blocking these from being indexed.
- Try to keep content on the pages limited to one language (make each language version a separate page).
- Make sure each language version has it's own URL (don't automatically show a different language version based on the user's browser).

John Mueller said...

@Carrie If you have outdated content online, I'd suggest either blocking it from getting indexed (using the "noindex" robots meta tag) or specifying the preferred version using the rel=canonical link element. Both will let you keep the content online for your users, but will prevent it from showing up in search results.

John Mueller said...

@Ian M - the suggestion to use robots.txt for those pages in the Webmaster Guidelines still stands. It's the easiest way to prevent crawling (and generally, indexing) of these pages. One problem with search results pages is that they generally provide a source of infinite URLs, which makes crawling a site properly very difficult.

John Mueller said...

@FlemmingLeer The number of URLs shown in the "of about" count is a very rough approximation and can vary for any number of technical reasons. If you want to have a more exact count, I would suggest submitting a Sitemap file with the URLs that you really want to have indexed and then checking the indexed URL count for those URLs in your Webmaster Tools account.

S. A. Rahman said...

Although a very fine article but still the issue is unsolved. I am still having problem with duplicate content within my blog. I have manually tried article comparison, each article is 60% duplicate to the other within the same category.

If there is another tool/software out there which I am not aware of please do let me know!

Regards,

ditto@progressnowcolorado.org said...

I'd like to see a canonical host option or syntax. A lot of the hosted eCRM software out there uses "page wrappers" that don't allow you to see the actual URL via PHP (the $_SERVER['REQUEST_URI'] variable contains the URL of the wrapper that the CRM called by URL to build the page, and it gets cached at that, so it's basically 100% meaningless)

So in our case we have all of our pages on http://www.ourmaindomain.org that are 100% duplicated on https://secure.ourmaindomain.org, because testing shows that having "secure" in the URL increases the donor conversion rate. But donation pages are only a small part of the application.

I can't tell from PHP within our CMS what page is being served by the CRM. So I'd like to see something like:

link rel="canonicalhost" href="http://www.mymaindomain.com"

or something like that. Or maybe an extension of the current standard to include a host plus a token for whatever the current URL-path is.

nestorbentancor said...

I ended up with 3 domains (pointing to the root of the same site) ranking, some of them better than my preffered main domain. I now this is very bad and I want to use a 301 redirect but they are all in the same site. Is creating a new account like this article says the best way to go? http://www.mcanerin.com/en/articles/301-redirect-add-domain.asp