Hi,
I have two issues relating to OCRing files.
The first is it possible to get a report of the problem files the OCR process detects? The reason is we have about 25,000 file to process and it would be helpful to be able to track them after processing.
The second has to do with the errors we received on a small sample of 100 files. About half couldn't be OCR'ed.
One group uses embedded subset Identity-H fonts. This generates the message that the PDF has renderable text. However when I try to search for the text in Acrobat it doesn't find it.
The second group does not have fonts in it (according to document properties). These files generate a message that it has "graphics other than images or text." I would categorize these PDFs as cover spreads that have text and graphics.
The first group represents 10 out 100 files. The second group represents about 40 out of 100 files.
Any suggestions on how to deal with these issues?
Thanks.
Ira
I usually prefer Clearscan, but have encountered the following problem: when copying and pasting the ocr-output into a word processor, I get different results for the same page depending on whether Clearscan was run on just this single page, or whether it was run on the complete (many-paged) pdf (no change of other parameters). In the single-paged version the resulting text is fine, but when I try to ocr the complete pdf, the resulting text is teeming with words extended by open spaces. I could accept Clearscan to have problems in recognizing certain texts properly, but don't understand why it should produce so divergent results of exactly the same text consequent upon just the length of the ocr'd pdf.
fjl