Send this page





FEBRUARY 2006

Make your PDFs work well with Google (and other search engines)
by Duff Johnson, CEO, Document Solutions, Inc.


During any given business day, I use Google hourly, if not more often. I also search local and network hard drives looking for proposals, client files and so on.  Whether I think about it or not, full-text search is a big part of how I do my job.

On many of my searches, naturally enough, lots of PDF files come up in the search results. This makes sense — Google does index PDF files, and PDFs represent a large volume of the pages actually accessed online. So far, so good.

Now for the problem.  In Google's search results, and in the results of most other search engines, the listings of most PDF files appear at best unprofessional, and at worst, downright embarrassing.


How bad is the problem

I performed a simple experiment.  Your mileage may vary, but I doubt the results will be significantly different.

I conducted 10 more-or-less random searches. Google's search results included an average of 4.3 PDF files on the first search results page of each search. Of those PDFs, an average of 60 percent were displayed with totally meaningless Titles.

Let’s look at why that happens, and how you can fix this problem with PDFs you make available for indexing and searching online.


The anatomy of Google’s search results

The blue underlined text in Google’s search results comes from one of two places in a PDF.  First, Google looks in the “Title” document information field. While it is simple for document creators to add this information to their PDFs, real-world search results demonstrate that most PDF Title fields are either empty, bogus or otherwise malformed.  To make things worse, many authoring applications place nonsensical information in, or even discard data from, the document information fields, creating a search-results “look and feel” that can range from confusing to totally meaningless.

(To check a PDF’s Title information in Acrobat, use the Control-D keyboard shortcut or go to File > Document Properties, then click the Description tab. You can add or correct the document’s title, author, and other fields as desired. But Title is essential!)

Be sure your PDF's document information fields
correctly represent your document!
Zoom imageSee larger image


If Google finds nothing in a PDF file's Title field, the second place it looks is more or less the first chunk of text it encounters in the document.  This might be the title (if that’s the first text on the page), but it’s just as likely to be unhelpful code or simply miscellaneous text from somewhere on the first page of the document.  Google uses this text to as a "stand in" for the Title for use in search results – an approach that fails far more often than it succeeds.

When you fail to ensure a valid Title in a PDF, search results won’t show the vital information that can assist users in choosing the correct document to open. The result is slower, less-reliable searches for every user, every time they search. 

Other considerations for optimizing PDFs for search-engine use

PDF Specification:  As of this writing (January 2006), it appears that Google doesn't index Specification 1.6 PDF files, the latest version fully compatible with Acrobat 7.0.  To solve this problem, use PDF Optimizer in Acrobat 7.x Professional (Advanced > PDF Optimizer…) to set your PDF version to 1.5 or 1.4 and make your file’s contents available to Google’s indexing engine.


To ensure that all search engines can index your content,
be sure to Optimize your file to the 1.4 specification -
ie, full compatibility with Acrobat 5.0.
Zoom imageSee larger image


File-size limits:  Google does not index every word in every PDF. There’s a size limit – variously reported as being between 100 and 500kb - to the text that Google will attempt to extract and index from any given file.  If you are posting large PDFs and it’s critical that Google indexes all of the content, consider posting documents by chapter or use another natural breaking point. This way, Google is less likely to stop indexing at, say, page 57 of a 112-page document.

Content Reading Order:  If controlling and optimizing the way search engines index your PDFs matter to you, you’ll eventually want to get familiar with the content reading order — the order in which search engines extract text from the document for indexing. Content ordering is not a casual process, but it can result in dramatically improved search results, especially for search engines that display search terms in context.

To begin defining content order in Acrobat Professional, first find out whether your file is Tagged. (Control-D keyboard shortcut, then check the “Description” tab).…


This little tell-tale is prima-facie evidence of inaccessible content.
No only should Tags say "Yes", but the tags should be validated too.
Zoom imageSee larger image


If your PDF isn’t tagged, you can quickly tag it using the Advanced > Accessibility > Add Tags to Document command. To view how the content is currently ordered, open the Touch Up Reading Order Tool, in the same Accessibility menu item. This image is pretty optional, I think.


Get the reading-order right, and so will Google.
Zoom imageSee larger image



Conclusion

Most organizations posting documents to their intranet or Internet file servers want those documents to be found by other people. Corporate intranets rely on search engines to index and retrieve all manner of internal documents for use everyday.  To the extent that PDF files comprise a meaningful volume of your searchable content (and you wouldn’t have read this far unless they do), you owe it to yourself to make sure your PDFs will look their best under the relentless gaze of the search engines.

Key Take Aways:

  • Check each PDF file's "Description" (in Document Properties) before posting.
  • Break large PDFs into chapters before posting to ensure Google indexes all the content.
  • Add structure to PDF files so Google indexes the content you want displayed in search results.

Article Feedback

Share your thoughts. Tell us what you think about this article.

OCTOBER 18, 2006
i use acrobat professional 7.0 our pdfs are all secure, but some are indexed by third party search engines like google and some do not. these pdfs do index compatibility set for acrobat 5.0 or earlier versions password security – settings > compatibility: acrobat 5.0 and later (encrypt all document contents) i don’t understand how third party search engines like google read the title in the document properties when these security settings indicate encrypt all document documents. these pdfs do not index compatibility set for acrobat 6.0 or earlier versions password security – settings > compatibility: acrobat 6.0 and later (encrypt all document contents except metadata (acrobat 6 and later compatible) all contents of the document will be encrypted but search engines will still be able to access the document’s metadata – and yet google does not index these pdfs. i don't understand the section of this article called other considerations for optimizing pdfs for search-engine use i followed the steps and could not change to 1.5
— DianaMacD

OCTOBER 19, 2006
thanks for the comment. your question prompted a little checking around... and indeed, it looks like things have changed.

first, google now appears to "respect" 1.4 security, which was (as you note) claims to prevent search engines from indexing the document metadata.

cursory testing indicates that google does now index 1.5 specification pdfs. so, if you aren't seeing those pdfs included in google searches, my first question would be: are you sure that google's "spider" has been over your 1.5 specification files? it could be that google simply hasn't gotten around to indexing newly posted files on your site.

i'm guessing that your problem with the optimizer is due to a "save" operation performed after optimization. in acrobat 7, whenever you save a pdf, it "upgrades" it to 1.6 (yes, without asking). thus, to ensure 1.5 (or other) specification level, the optimizer is the last thing you need to do in the file preperation process.
— DuffJohnson

OCTOBER 23, 2006
the 1.5 files have been on our site for a year and still do not index. yes the 1.6 files happen when you make any changes to the pdf, save or optimize so we always start from the original document and make a new pdf to make sure the version is 1.5. where can i find more information on pdf version, compatibility, and encryption issues? thanks
— DianaMacD

NOVEMBER 07, 2006
there are a few other possibilities you should check out, many of which i've not put to the test in some time. but i'll be fascinated to learn the results of your research! here we go: is it the case that _all_ of your 1.5 files are unindexed... or are _some_ of them indexed, but not others? i really can't believe that 1.5 files are, in fact, excluded for you, but not for me. for the files that you don't believe are indexed, are you sure it is the whole file, or could it be only part of the file is unindexed? to test this, on longer files, try searching for a text string that appears at the beginning of the file vs. at the end. see if you can determine whether fast web view is a factor. one of my intuitions (that i've yet to systematically evaluate) is that a combination of fast web view being not enabled and a large (say, >500 kb) file might "trick" the google spider into "abandoning" that pdf, and not indexing it. a fast web view file, however, is much more likely to permit the spider to start indexing the file before the download is complete. this last one is total guesswork on my part, but it makes some intuitive sense, if nothing else. report back, and tell us what you find out!
— DuffJohnson

JANUARY 30, 2007
this is a great question. thanks to everyone who has addressed it. i've spent several hours last night searching for an answer to this question, and i went to bed perplexed. this morning i opened a function restricted password protected .pdf in reader 7, and on the security window i saw, for the first time, the following: "all contents of the document are encrypted and search engines cannot access the documents metadata." i guess that is my answer. too bad.
— tomwfox

MAY 22, 2008
Is there an update on how Google now (May 2008) indexes PDF files? For example, is Google indexing 1.6 files?
— redcrew

Log in to leave comments


<< Back to Articles main menu.



AcrobatUsers.com  >>  User Groups • News • Events • Articles • Blogs • How To • Resources • Member Log in