Google Webmaster Central Blog - Official news on crawling and indexing sites for the Google index

Duplicate content summit at SMX Advanced

Wednesday, June 13, 2007 at 7:59 AM

Last week, I participated in the duplicate content summit at SMX Advanced. I couldn't resist the opportunity to show how Buffy is applicable to the everday Search marketing world, but mostly I was there to get input from you on the duplicate content issues you face and to brainstorm how search engines can help.

A few months ago, Adam wrote a great post on dealing with duplicate content. The most important things to know about duplicate content are:
  • Google wants to serve up unique results and does a great job of picking a version of your content to show if your sites includes duplication. If you don't want to worry about sorting through duplication on your site, you can let us worry about it instead.
  • Duplicate content doesn't cause your site to be penalized. If duplicate pages are detected, one version will be returned in the search results to ensure variety for searchers.
  • Duplicate content doesn't cause your site to be placed in the supplemental index. Duplication may indirectly influence this however, if links to your pages are split among the various versions, causing lower per-page PageRank.
At the summit at SMX Advanced, we asked what duplicate content issues were most worrisome. Those in the audience were concerned about scraper sites, syndication, and internal duplication. We discussed lots of potential solutions to these issues and we'll definitely consider these options along with others as we continue to evolve our toolset. Here's the list of some of the potential solutions we discussed so that those of you who couldn't attend can get in on the conversation.

Specifying the preferred version of a URL in the site's Sitemap file
One thing we discussed was the possibility of specifying the preferred version of a URL in a Sitemap file, with the suggestion that if we encountered multiple URLs that point to the same content, we could consolidate links to that page and could index the preferred version.

Providing a method for indicating parameters that should be stripped from a URL during indexing
We discussed providing this in either an interface such as webmaster tools on in the site's robots.txt file. For instance, if a URL contains sessions IDs, the webmaster could indicate the variable for the session ID, which would help search engines index the clean version of the URL and consolidate links to it. The audience leaned towards an addition in robots.txt for this.

Providing a way to authenticate ownership of content
This would provide search engines with extra information to help ensure we index the original version of an article, rather than a scraped or syndicated version. Note that we do a pretty good job of this now and not many people in the audience mentioned this to be a primary issue. However, the audience was interested in a way of authenticating content as an extra protection. Some suggested using the page with the earliest date, but creation dates aren't always reliable. Someone also suggested allowing site owners to register content, although that could raise issues as well, as non-savvy site owners wouldn't know to register content and someone else could take the content and register it instead. We currently rely on a number of factors such as the site's authority and the number of links to the page. If you syndicate content, we suggest that you ask the sites who are using your content to block their version with a robots.txt file as part of the syndication arrangement to help ensure your version is served in results.

Making a duplicate content report available for site owners
There was great support for the idea of a duplicate content report that would list pages within a site that search engines see as duplicate, as well as pages that are seen as duplicates of pages on other sites. In addition, we discussed the possibility of adding an alert system to this report so site owners could be notified via email or RSS of new duplication issues (particularly external duplication).

Working with blogging software and content management systems to address duplicate content issues
Some duplicate content issues within a site are due to how the software powering the site structures URLs. For instance, a blog may have the same content on the home page, a permalink page, a category page, and an archive page. We are definitely open to talking with software makers about the best way to provide easy solutions for content creators.

In addition to discussing potential solutions to duplicate content issues, the audience had a few questions.

Q: If I nofollow a substantial number of my internal links to reduce duplicate content issues, will this raise a red flag with the search engines?
The number of nofollow links on a site won't raise any red flags, but that is probably not the best method of blocking the search engines from crawling duplicate pages, as other sites may link to those pages. A better method may be to block pages you don't want crawled with a robots.txt file.

Q: Are the search engines continuing the Sitemaps alliance?
We launched sitemaps.org in November of last year and have continued to meet regularly since then. In April, we added the ability for you to let us know about your Sitemap in your robots.txt file. We plan to continue to work together on initiatives such as this to make the lives of webmasters easier.

Q: Many pages on my site primarily consist of graphs. Although the graphs are different on each page, how can I ensure that search engines don't see these pages as duplicate since they don't read images?
To ensure that search engines see these pages as unique, include unique text on each page (for instance, a different title, caption, and description for each graph) and include unique alt text for each image. (For instance, rather than use alt="graph", use something like alt="graph that shows Willow's evil trending over time".

Q: I've syndicated my content to many affiliates and now some of those sites are ranking for this content rather than my site. What can I do?
If you've freely distributed your content, you may need to enhance and expand the content on your site to make it unique.

Q: As a searcher, I want to see duplicates in search results. Can you add this as an option?
We've found that most searchers prefer not to have duplicate results. The audience member in particular commented that she may not want to get information from one site and would like other choices, but for that case, other sites will likely not have identical information and therefore will show up in the results. Bear in mind that you can add the "&filter=0" parameter to the end of a Google web search URL to see additional results which might be similar.

I've brought back all the issues and potential solutions that we discussed at the summit back to my team and others within Google and we'll continue to work on providing the best search results and expanding our partnership with you, the webmaster. If you have additional thoughts, we'd love to hear about them!
The comments you read here belong only to the person who posted them. We do, however, reserve the right to remove off-topic comments.

16 comments:

JLH said...

Thank you for nailing down the duplicate content and supplemental connection.

Ross Dunn said...

Love live Buffy! It was good meeting you Vanessa.

Kimber Cook said...

This is great information, Vanessa, thanks for sharing.

What if I have 3 different domains, all with the same exact content? Am I still safe? Will G just choose one?

Spanish speaker said...

Hello.
I believe that a good system would be to send the seekers by means of the url, the autentificación of the content.

For example, I have a new article, the public thing in m i web and do a call to Google from the URL for example like that:
Google.com/? Urlautority=http: // www.thedomain/article1.php
Hereby Google will know that I have written this article.

Initially only some few ones you would know this way of doing it, but soon extendería to all the webmasters and software created. In less than two months it would be a common practice for all the communities of webmasters.

Spanish speaker said...

Hello.
I believe that a good system would be to send the seekers by means of the url, the autentificación of the content.

For example, I have a new article, the public thing in m i web and do a call to Google from the URL for example like that:
Google.com/? Urlautority=http: // www.thedomain/article1.php
Hereby Google will know that I have written this article.

Initially only some few ones you would know this way of doing it, but soon extendería to all the webmasters and software created. In less than two months it would be a common practice for all the communities of webmasters.

silver said...

I would like to mention duplicated contents where it occurs between domains, e.g. example.com and example.co.uk - where the pages are the same.

For www.example.com and example.com there is the preferred domain setting but there is no way in webmaster tools to pick which is the preferred domain or any way to link www.example.com and www.example.co.uk even when both sites have exact same content and are both authenticated.

I have now added a 301 re-direct but not sure I like the outcome yet as things don't look good..

S

Susan Moskwa said...

silver:
For now, adding a 301 redirect is the best way for you to let us know that your US and UK sites contain the same content. It may take a bit of time for us to pick up the redirect, but rest assured that you're taking the right approach.

Steve said...

We run a number of regional English language domains that are targeted at different regional markets. These naturally have our product description content duplicated verbatim, with very small variations in the surrounding information due to the region. Because our .com is the oldest, most linked to domain, if you are searching in the UK or Australia our .com domain wins every time because Google is preferring this over our other duplicates. We really need a way to flag the regional relevance and our preferred domain in such situations.

Thanks in advance,

Steve

DTSL Williams said...

I've found several sites that contain massive duplicate content specifically for search engine spamming. I've also found one person who's created at least 8 variations of the same site to clog up the results in google- but there is no effective means of reporting this.

:(

Tim_Myth said...

So, if I'm reading this right, I can create a website full of billions and billions of pages that are exactly the same (oh, we might vary the keywords in our title but definately leave the same body), and all Google will do is pick the one it likes best? Even though the officially published guidelines (http://www.google.com/support/webmasters/bin/answer.py?hl=en&answer=66359) tell me otherwise? Sweet! Search Engine SPAM HERE I COME!

Seriously though, are the guidelines simply out of date with regards to this new information, or did you leave out a few bits?

John said...

This isn't a penalty, it's being outranked.
Sites outrank other sites when they scrape their content because they have more links than the original site.

Silverstall said...

search engines have become the beauty and the beast.
The beauty being the ease and speed in which you can publish content viewable by the public.
The beast being how easy it is for that content to be stolen/copied.
I think the balance needs to shift towards making it less easy to publish and expect to be indexed. An online declaration of content ownership similar to that employed by wickopedia whilst requiring a huge amount of resources in the short term would i believe drastcially reduce scraped content so that in the long term it would free up rersources.

babycare said...

i have had some idiot copy every bit of text off one of my pages that is at number 1 on google. he is now at number 4. if my content is 6 years old and he copied mine 2 months ago. how comes google is listing it?

brickleyparker said...

I have a site and we duplicated our home page 5 times to use as landing pages for different ad campaigns.
We just wemt from 10 to 50 in the new Google search.

Is this the reason?

Ireneusz said...

Hi,
I think if would be added new tag "<noindex>...</noindex>" (as analogy to "<noscript>") we have no problem with duplicate contents, syndication, etc. Everything between this tag wouldn't be visible for robots, and wouldn't be indexed. Part of page could be normally indexed, and other part could contain repeated text or whatever webmaster wants.
Best regards
Ireneusz Dybczyński

Google Webmaster Central said...

Hi everyone,

Since several months have passed since we published this post, we're closing the comments to help us focus on the work ahead. If you still have a question or comment you'd like to discuss, free to visit and/or post your topic in our Webmaster Help Group.

Thanks and take care,
The Webmaster Central Team