These forums are now Read Only. If you have an Acrobat question, ask questions and get help from one of our experts.

Fiddly query about Batch Processing, OCR and file shrinking.

Tentacular
Registered: May 26 2008
Posts: 3

Apologies for this being long and quite possibly stupid, but I'm green,
I've looked all over and can't find this precise information. Any help very
gratefully received.

I'm converting a bunch of my old paperbacks to ebooks (to be read
online and on an ebook reader), scanning the pages into PDF, then using
Acrobat Pro 8 on a MacBook (os 10.5) to OCR them so the text is
searchable, and saving them at a reduced size without too much quality
loss (closer to 'print' than 'web' quality). I'm dealing with several of these
files at a time so obviously want to use Batch Processing if possible.
However, I'm not techy or designy and don't understand the various
options available in the (far too complicated) 'PDF Optimization' menu.

The files when they emerge from initial scanning are huge, between 40Mb and
100Mb. When I batch-process them to OCR and put PDF Optimization on
for output, using the default settings, they (eventually) emerge as still-huge files, usually *growing*, not shrinking. Plus the fonts look awful.

However, after loooong experimentation, I've found that a good
compromise in terms of size and quality is to *first* run them through the
'Reduce File Size' command on the 'Document' menu, then OCR the
resulting file through the 'Recognize Text using OCR' subcommand on the
'OCR Text Recognition' command on the same menu. The end result of
that is it reduces their size very substantially (down to 6ish Mb), they
look perfectly ok and they're text-searchable. I've found that doing the
two steps - OCR & Shrink - the other way round doesn't work so well. I've no idea
why.

I have been told that the 'Reduce File Size' command is just a simplified
version of the 'Optimize Scanned PDF' command with certain settings
inbuilt. But I cannot find what those setting are. I've also
been told that those settings are the same as the defaults in PDF
Optimization, so if I simply run that command (or run PDF Optimization
as an output option on Batch Processing), the result should be the same.
I've tried this and it just is not even close to the same: the files are huge
and the final fonts don't look the same: 'Reduce File Size' almost
always results in a better finish for me, and, frustratingly, that command is not
available in 'Batch Processing'.

So my question is: is there a way to organize a 'Batch Processing'
sequence that will take a bunch of files and for all of them exactly
mimic the result, in file-size and aesthetic results, of taking each file separately
and first i) clicking the 'Reduce File Size' command, then ii) clicking
'Recognize Text using OCR'? If there is, I would be so grateful if
someone can tell me how to create it. At the moment I can batch-process OCR a bunch of shrunk files, but cannot shrink them except one at a time.

Thanks so much for any help.

teledu
Registered: May 10 2007
Posts: 42
I sympathise with your comments: only the really tecchie types (not I) seem to be able to make sense of which settings can reduce file sizes of specific files (knowing at what settings they were scanned at) while retaining adequate resolution of text or graphics for the purpose being aimed at. Can't help you either with explaining the differences in result between the two Acrobat functions...
For batch processing OCR though you really might consider using proper OCR software like Abbyy Finereader or ReadIiris etc. They are geared for batch work, let you define areas to OCR, remove spine and punch hole images, correct OCR errors easily etc, etc. Output can still be to pdf, with adjustable settings for dpi and jpeg quality.

...if the paperbacks weren't written by you, watch out for copyright!
dbaker
Expert
Registered: Feb 10 2006
Posts: 413
Hi --

Interesting questions. By the way, reducing file size is really different from Optimizing PDF.

Here's the short answer to what you want to do:

1. Build a batch sequence, which you have already been doing.

2. First item you want to add is Preflight. Once the Preflight option is added to the sequence, double-click it to open the Preflight:Batch Sequence Setup dialog.

3. Click the "Run Preflight check using" dropdown arrow to show the list of Preflight profiles available. Fortunately they are in alphabetical order - scroll down and select Online publishing (optimize for quality).

4. You can ignore the remainder of the dialog at this point, and click Save. The dialog closes, and takes you back to the Edit Sequence dialog.

5. Add the Recognize Text Using OCR command as you have been doing.

6. Continue with the rest of the process as before.

When you run the batch, Acrobat first applies the Preflight profile. The Online publishing (optimize for quality) profile discards content not needed for an ebook, while maintaining a decent image resolution at 144ppi. You'll find it is quite similar in its output to the Reduce File Size results you've had (possibly slightly larger depending on the image load).

If you want to minimize file size even more, use the Online publishing (optimize for size) Preflight profile. Again, it removes content you won't need in an ebook, and downsamples the image to 96ppi, which is perfect for onscreen use.

donna.

A prolific author and writer of many Acrobat books, as well as books on graphic and Web design software.
Donna lives on a lakeshore in central Canada, where all manner of wildlife from muskrats to coyotes come to call.

Tentacular
Registered: May 26 2008
Posts: 3
Thanks so much for the advice teledu, and thanks Donna, that sounds absolutely perfect. Unfortunately, I don't have 'Online Publishing' as an option in my 'Run Preflight Check Using' dropdown menu! It's just not there at all. Goes straight from 'Newspaper Ads' to 'Remove 3D Annotations'. I am trying to work out a way to add that option to the list, but am at sea. I could use 'Digital Printing' instead but that seems like something rather different...
fwsgis
Registered: Aug 4 2010
Posts: 2
using ocr on document but keep getting error Unable to process the page because the Paper Capture recognition service experienced and error. (5) This document has 26 pages and this is the only page I get error. This page isn't any different than the other pages. What can I do to fix this.
teledu
Registered: May 10 2007
Posts: 42
Hi fwsgis:
In short, I don't know, but you really should post this as a new topic, as its not related to the original poster's question (even if you did find it with a search for ocr). That way you are more likely to get a wider viewing.
You could try re-scanning and creating a new pdf of the page that is giving the problem.