The lords of search over at Google recently announced an interesting new feature for PDFs created from scanned pages.
Searchable PDF files are nothing new - and neither are searchable PDF files produced from scanned pages. Simply run OCR and voila - your scanned PDFs are now searchable.
But let's say you didn't OCR your files. Maybe you didn't want to take the time, maybe its impractical, or maybe you didn't even WANT your files to be searchable (my legal friends should take note here).
Too bad!
Post those PDFs on a publicly accessible site and now Google will OCR and index them for you, no extra charge.
I'm sure there are some limits here. Google isn't saying, but I'm guessing it won't download a 500 MB PDF just to discover that there's no text to index.
I'm also unsure as to the quality of the OCR. I'd have to believe that it's super-quick, and therefore, less than super-accurate, but then again, Google has computing resources that defy my paltry imagination, so no bets there either.
I'll be running some tests before long, but I'm curious to know what you think.
Do you WANT your scanned PDFs indexed by Google? Are you tempted to post oceans of scanned content online? Or is this a big yawn, something you thought Google was doing all along, so what's the big deal?



Comments
Rowan,
I've been toying with the question of whether or not this development will eventually make 'local' OCR obsolete. After all, with server-side OCR, the more accurate it gets, the better search results will get, and no-one (except Google) has to do a thing except feed scans onto the web.
Hmm.
That's a good point about Google Webmaster. Google should make clearer exactly what their policy is on processing PDFs; size limitations (if any), text-volume limitations (if any), extent to which structure is used, and so on. At the same time, they should tell users how to "block" their PDFs from indexing, their policy on PDF metadata, and so on. There are many ways in which PDFs could really be handled "right" - I'd like to see it.
In the past week I have seen at least one or two PDFs showing up in my search results. Generally I have found the experience to be pretty good -- the PDFs are usually white papers, academic papers, data sheets, etc and so contain useful information.
It should be a little bit concerning for less tech savvy people who maybe aren't sure how Google's indexing works. It would be nice if Google were to add something to Google Webmaster that told you if any PDFs (or other file types) from your site were being indexed.