These forums are now Read Only. If you have an Acrobat question, ask questions and get help from one of our experts.

OCR and Clearscan

irap
Registered: Aug 6 2008
Posts: 56

Hi,

I have two issues relating to OCRing files.

The first is it possible to get a report of the problem files the OCR process detects? The reason is we have about 25,000 file to process and it would be helpful to be able to track them after processing.

The second has to do with the errors we received on a small sample of 100 files. About half couldn't be OCR'ed.

One group uses embedded subset Identity-H fonts. This generates the message that the PDF has renderable text. However when I try to search for the text in Acrobat it doesn't find it.

The second group does not have fonts in it (according to document properties). These files generate a message that it has "graphics other than images or text." I would categorize these PDFs as cover spreads that have text and graphics.

The first group represents 10 out 100 files. The second group represents about 40 out of 100 files.

Any suggestions on how to deal with these issues?

Thanks.

Ira

My Product Information:
Acrobat Pro Extended 9.0, Windows
franz
Registered: Feb 17 2011
Posts: 2
Hello,

I usually prefer Clearscan, but have encountered the following problem: when copying and pasting the ocr-output into a word processor, I get different results for the same page depending on whether Clearscan was run on just this single page, or whether it was run on the complete (many-paged) pdf (no change of other parameters). In the single-paged version the resulting text is fine, but when I try to ocr the complete pdf, the resulting text is teeming with words extended by open spaces. I could accept Clearscan to have problems in recognizing certain texts properly, but don't understand why it should produce so divergent results of exactly the same text consequent upon just the length of the ocr'd pdf.


fjl

Ehrman
Registered: Jul 14 2011
Posts: 2
Hi Franz, I'm encountering the same bug with Clearscan OCR and the arbitrary spaces inside words. Do you know of any solution or official explanation for this?

Best,
Ehrman
franz
Registered: Feb 17 2011
Posts: 2
Dear Ehrman,

sorry for the late answer. The only solution I have is to split up the pdf into parts not exceeding 15 pages each (Document/Split document ...), perform a batch ocr on them, and then reassemble the parts into one single pdf (Combine/Merge files into a single pdf ...). And no, I havn't received any official explanation for this. I'll soon install Acrobat 10 pro, and can only hope that it will work better.

Best,
fjl

fjl