Step 1: Check us out

You don't have to be a member to look at any content on the site. Increase your expertise with our helpful tutorials, videos, forums, and sample PDFs.

Step 2: Sign up for a free account

Like what you see? Take the next step and become a member. Register now to get discounts, attend eSeminars, ask questions and more.

Step 3: Start participating

Get the most out of your membership. Post in the forums, create your profile, submit to the gallery, attend a user group meeting.
Log In now.

Duff Johnson's Blog

Duff Johnson's picture
Syndicate content
Posted: 2008-11-20

Google WILL index your scanned PDFs!

The lords of search over at Google recently announced an interesting new feature for PDFs created from scanned pages.

Searchable PDF files are nothing new - and neither are searchable PDF files produced from scanned pages.  Simply run OCR and voila - your scanned PDFs are now searchable.

But let's say you didn't OCR your files.  Maybe you didn't want to take the time, maybe its impractical, or maybe you didn't even WANT your files to be searchable (my legal friends should take note here).

Too bad!

Post those PDFs on a publicly accessible site and now Google will OCR and index them for you, no extra charge.

I'm sure there are some limits here.  Google isn't saying, but I'm guessing it won't download a 500 MB PDF just to discover that there's no text to index.

I'm also unsure as to the quality of the OCR.  I'd have to believe that it's super-quick, and therefore, less than super-accurate, but then again, Google has computing resources that defy my paltry imagination, so no bets there either.

I'll be running some tests before long, but I'm curious to know what you think.

Do you WANT your scanned PDFs indexed by Google?  Are you tempted to post oceans of scanned content online?  Or is this a big yawn, something you thought Google was doing all along, so what's the big deal?

Comments (2)   Permalink

Comments

Anonymous

Rowan,

I've been toying with the question of whether or not this development will eventually make 'local' OCR obsolete. After all, with server-side OCR, the more accurate it gets, the better search results will get, and no-one (except Google) has to do a thing except feed scans onto the web.

Hmm.

That's a good point about Google Webmaster. Google should make clearer exactly what their policy is on processing PDFs; size limitations (if any), text-volume limitations (if any), extent to which structure is used, and so on. At the same time, they should tell users how to "block" their PDFs from indexing, their policy on PDF metadata, and so on. There are many ways in which PDFs could really be handled "right" - I'd like to see it.

Anonymous

In the past week I have seen at least one or two PDFs showing up in my search results. Generally I have found the experience to be pretty good -- the PDFs are usually white papers, academic papers, data sheets, etc and so contain useful information.

It should be a little bit concerning for less tech savvy people who maybe aren't sure how Google's indexing works. It would be nice if Google were to add something to Google Webmaster that told you if any PDFs (or other file types) from your site were being indexed.

Acrobat Job Board

Looking for a job or seeking to fill a job? Check out the new Acrobat job board.

Job Board >

Membership

Sign up for your free membership today and save up to 40% on books, training, and more.

Join for free >

Tech Talks

Go deeper into Acrobat through a new series of informal technical talks by Acrobat experts.

Tech Talks >