These forums are now Read Only. If you have an Acrobat question, ask questions and get help from one of our experts.

Obtaining OCRs from century-old Google Book

ITGreybeard
Registered: Feb 19 2010
Posts: 5

I have downloaded some English books dating back into the 1800's. They are quite legible to me, but they have been constructed by the Google Books scanning process. That means that the pages are images only, and those images are of pages more than 100 years old. The images therefore suffer from the book's aging process, with attendant staining and blotching of pages. Likewise, the text has been blurred somewhat, and may be comprised of fonts and styles that are not in common use these days.

Does anyone have recommendations for PDF Optimizer settings to achieve best results from these semi-ancient volumes?

Many thanks.

p.s. A link to one of these on Google Books is:

http://books.google.com/books?id=cdJwdgeKjdAC&printsec=frontcover&dq=editions:STANFORD36105014201441&lr=&as_drrb_is=q&as_minm_is=0&as_miny_is=&as_maxm_is=0&as_maxy_is=&num=30&as_brr=1#v=onepage&q=&f=false

p.p.s. Acrobat Pro version is 9.3, not the 9.2 allowed as the latest version in the provided product description dropdown.

My Product Information:
Acrobat Pro 9.2, Windows
daka630
Expert
Registered: Mar 1 2007
Posts: 1420
Hi,
Optimizer does not do anything to OCR output.
It provides selections associated with downsampling and compression of images; but these don't address staining/blotching that is reflected in the image.
Acrobat's Scan dialog provides an "Options" button. This opens the Optimization Options (for scan) dialog. Here, you can establish custom settings for Filtering. These might help.
But, as the file is been "done" by Google you are not doing the scan.
You could try exporting to a photo editing application. Try some clean up there then output a new PDF.

Be well...

Be well...

ITGreybeard
Registered: Feb 19 2010
Posts: 5
Thanks for the information. I too had come to the conclusion that the optimizer wasn't involved in the OCR portion of the procedure.

And given that the books run into the many hundreds of pages, I am not likely to take each page image and clean it up.

But actually I am quite pleasantly surprised at how well the OCR works with the text images. It's not perfect, but it's better than the OCR that I remember from the 90's. And it's great to have the batch processing function available.

Thanks again.

Mel Ivey