Make your PDFs work well with Google (and other search engines)
by Duff Johnson, CEO, Document Solutions, Inc.
During any given business day, I use Google hourly, if not more often. I also search local and network hard drives looking for proposals, client files and so on. Whether I think about it or not, full-text search is a big part of how I do my job.
On many of my searches, naturally enough, lots of PDF files come up in the search results. This makes sense Google does index PDF files, and PDFs represent a large volume of the pages actually accessed online. So far, so good.
Now for the problem. In Google's search results, and in the results of most other search engines, the listings of most PDF files appear at best unprofessional, and at worst, downright embarrassing.
How bad is the problem
I performed a simple experiment. Your mileage may vary, but I doubt the results will be significantly different.
I conducted 10 more-or-less random searches. Google's search results included an average of 4.3 PDF files on the first search results page of each search. Of those PDFs, an average of 60 percent were displayed with totally meaningless Titles.
Let’s look at why that happens, and how you can fix this problem with PDFs you make available for indexing and searching online.
The anatomy of Google’s search results
The blue underlined text in Google’s search results comes from one of two places in a PDF. First, Google looks in the “Title” document information field. While it is simple for document creators to add this information to their PDFs, real-world search results demonstrate that most PDF Title fields are either empty, bogus or otherwise malformed. To make things worse, many authoring applications place nonsensical information in, or even discard data from, the document information fields, creating a search-results “look and feel” that can range from confusing to totally meaningless.
(To check a PDF’s Title information in Acrobat, use the Control-D keyboard shortcut or go to File > Document Properties, then click the Description tab. You can add or correct the document’s title, author, and other fields as desired. But Title is essential!)

Be sure your PDF's document information fields
correctly represent your document!
See larger image
If Google finds nothing in a PDF file's Title field, the second place it looks is more or less the first chunk of text it encounters in the document. This might be the title (if that’s the first text on the page), but it’s just as likely to be unhelpful code or simply miscellaneous text from somewhere on the first page of the document. Google uses this text to as a "stand in" for the Title for use in search results an approach that fails far more often than it succeeds.
When you fail to ensure a valid Title in a PDF, search results won’t show the vital information that can assist users in choosing the correct document to open. The result is slower, less-reliable searches for every user, every time they search.
Other considerations for optimizing PDFs for search-engine use
PDF Specification: As of this writing (January 2006), it appears that Google doesn't index Specification 1.6 PDF files, the latest version fully compatible with Acrobat 7.0. To solve this problem, use PDF Optimizer in Acrobat 7.x Professional (Advanced > PDF Optimizer…) to set your PDF version to 1.5 or 1.4 and make your file’s contents available to Google’s indexing engine.

To ensure that all search engines can index your content,
be sure to Optimize your file to the 1.4 specification -
ie, full compatibility with Acrobat 5.0.
See larger image
File-size limits: Google does not index every word in every PDF. There’s a size limit variously reported as being between 100 and 500kb - to the text that Google will attempt to extract and index from any given file. If you are posting large PDFs and it’s critical that Google indexes all of the content, consider posting documents by chapter or use another natural breaking point. This way, Google is less likely to stop indexing at, say, page 57 of a 112-page document.
Content Reading Order: If controlling and optimizing the way search engines index your PDFs matter to you, you’ll eventually want to get familiar with the content reading order the order in which search engines extract text from the document for indexing. Content ordering is not a casual process, but it can result in dramatically improved search results, especially for search engines that display search terms in context.
To begin defining content order in Acrobat Professional, first find out whether your file is Tagged. (Control-D keyboard shortcut, then check the “Description” tab).…

This little tell-tale is prima-facie evidence of inaccessible content.
No only should Tags say "Yes", but the tags should be validated too.
See larger image
If your PDF isn’t tagged, you can quickly tag it using the Advanced > Accessibility > Add Tags to Document command. To view how the content is currently ordered, open the Touch Up Reading Order Tool, in the same Accessibility menu item. This image is pretty optional, I think.

Get the reading-order right, and so will Google.
See larger image
Conclusion
Most organizations posting documents to their intranet or Internet file servers want those documents to be found by other people. Corporate intranets rely on search engines to index and retrieve all manner of internal documents for use everyday. To the extent that PDF files comprise a meaningful volume of your searchable content (and you wouldn’t have read this far unless they do), you owe it to yourself to make sure your PDFs will look their best under the relentless gaze of the search engines.![]()
Key Take Aways:
- Check each PDF file's "Description" (in Document Properties) before posting.
- Break large PDFs into chapters before posting to ensure Google indexes all the content.
- Add structure to PDF files so Google indexes the content you want displayed in search results.
<< Back to Articles main menu.









Article Feedback
Share your thoughts. Tell us what you think about this article.Log in to leave comments