Thursday, September 01, 2011 at 7:23 AM
Webmaster level: AllOur mission is to organize the world’s information and make it universally accessible and useful. During this ambitious quest, we sometimes encounter non-HTML files such as PDFs, spreadsheets, and presentations. Our algorithms don’t let different filetypes slow them down; we work hard to extract the relevant content and to index it appropriately for our search results. But how do we actually index these filetypes, and—since they often differ so much from standard HTML—what guidelines apply to these files? What if a webmaster doesn’t want us to index them?
Google first started indexing PDF files in 2001 and currently has hundreds of millions of PDF files indexed. We’ve collected the most often-asked questions about PDF indexing; here are the answers:
Q: Can Google index any type of PDF file?
A: Generally we can index textual content (written in any language) from PDF files that use various kinds of character encodings, provided they’re not password protected or encrypted. If the text is embedded as images, we may process the images with OCR algorithms to extract the text. The general rule of the thumb is that if you can copy and paste the text from a PDF document into a standard text document, we should be able to index that text.
Q: What happens with the images in PDF files?
A: Currently the images are not indexed. In order for us to index your images, you should create HTML pages for them. To increase the likelihood of us returning your images in our search results, please read the tips in our Help Center.
Q: How are links treated in PDF documents?
A: Generally links in PDF files are treated similarly to links in HTML: they can pass PageRank and other indexing signals, and we may follow them after we have crawled the PDF file. It’s currently not possible to "nofollow" links within a PDF document.
Q: How can I prevent my PDF files from appearing in search results; or if they already do, how can I remove them?
A: The simplest way to prevent PDF documents from appearing in search results is to add an X-Robots-Tag: noindex in the HTTP header used to serve the file. If they’re already indexed, they’ll drop out over time if you use the X-Robot-Tag with the noindex directive. For faster removals, you can use the URL removal tool in Google Webmaster Tools.
Q: Can PDF files rank highly in the search results?
A: Sure! They’ll generally rank similarly to other webpages. For example, at the time of this post, [mortgage market review], [irs form 2011] or [paracetamol expert report] all return PDF documents that manage to rank highly in our search results, thanks to their content and the way they’re embedded and linked from other webpages.
Q: Is it considered duplicate content if I have a copy of my pages in both HTML and PDF?
A: Whenever possible, we recommend serving a single copy of your content. If this isn’t possible, make sure you indicate your preferred version by, for example, including the preferred URL in your Sitemap or by specifying the canonical version in the HTML or in the HTTP headers of the PDF resource. For more tips, read our Help Center article about canonicalization.
Q: How can I influence the title shown in search results for my PDF document?
A: We use two main elements to determine the title shown: the title metadata within the file, and the anchor text of links pointing to the PDF file. To give our algorithms a strong signal about the proper title to use, we recommend updating both.
If you want to learn more, watch Matt Cutt’s video about PDF files’ optimization for search, and visit our Help Center for information about the content types we’re able to index. If you have feedback or suggestions, please let us know in the Webmaster Help Forum.
Posted by


21 comments:
Yes PDF Files very important of web content.
Free translation in Spanish of this post:
http://www.miguelgf.com/2011/09/pdfs-en-resultados-de-busqueda-de.html
Thank You. Very much awaiting for this PDF feature. A great task. With Regards.
I am really happy with this kind of development. What I can think now is to upload a PDF file to my Google Docs account, publish and share it to the world.
Could you tell us more about the limit size for a PDF to be indexed? For exemple, it seems that, on our web site, no PDF with a size superior to 10 Mo is indexed since 2007. For exemple, the following book http://archimer.ifremer.fr/doc/1991/rapport-639.pdf (24 Mo) was indexed before 2007 but not after.
i like it
Thanks for the info. Many of our business clients have a lot of PDF content and helping them optimize it has been challenging. We’ll certainly pass along these tips.
it is really helpful
Question: being that there are plugins that allow bloggers to turn posts into pdf files, can that improve rankings?
hi
i think is best choice and best engine .i really like it.
Regard
mohsin qureshi-23227
http://www.businessideas.pk
Does Google have a policy on whether they prefer PDFs over .DOCs or .XLSX etc?
I've been changing my .DOC's to .PDF's cause some web browser don't display .DOC's unless you have some kind of Office App.
What I did makes sense to me. But I wonder if I should've converted my .DOC's to something rather than PDF's.
Do you have any insights?
Hei..
Salam kenal untuk smua.
Blog baru yg masih sederhana dan baru belajar
Hei..
Salam kenal untuk smua.
Blog baru yg masih sederhana dan baru belajar
If I don't have a server and just subscribe to shared web hosting, I can't make rel=canonical for pdf.
I'd love Google add such option in webmasters tools (in other words to add an option of specifying rel=canonical or noindex for non-html files via webmaster tools).
Yes.. PDF file very important and much used by government..
I WILL TRY PDF ON MY SITE
Will having a PDF and using the "noindex" option as described above still prevent my PDF from being indexed, if another website links to it?
Thought you may also want to check out a different .pdf finding tool http://findthatpdf.com it searches millions of .pdf's from the web and ftp as well as searching the full content of them.
How do I get the URL to a PDF that appears in the search results? If I "copy URL", I get something like this:
https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=3&cad=rja&ved=0CEUQFjAC&url=http%3A%2F%2Fwww.cycling-embassy.org.uk%2Fsites%2Fcycling-embassy.org.uk%2Ffiles%2Fdocuments%2Fhealth_impact_helmet_laws.pdf&ei=cAR4Uc_vAo_0qwGgpoGoBg&usg=AFQjCNHJIvkR56zGePSiVcZdsM_EvIftWA&sig2=AeMqsqiz2ZSmZh8EhZlD7w
...whereas what I *should* get is something more like this:
www.cycling-embassy.org.uk/.../cycling.../health_impact_hel...
with the ...s filled in with correct values.
I don't see how to get the link without a bunch of editing on the mess that Google returns.
Post a Comment