These forums are now Read Only. If you have an Acrobat question, ask questions and get help from one of our experts.

How best view edit correct text from Scan and OCR?

FUBARinSFO
Registered: Apr 11 2007
Posts: 61

Hi:

What is considered the best or usual way of viewing, and editing/correcting the "invisible" text behind the bitmap image after OCR on a scanned document?

It appears that you can view the underlying text via Document | Examine Document | Hidden Text (Preview), but this doesn't seem to be very practical. Here's what I see (copied into Word for pasting here):

pennittedtobeproduced,therewouldbenowaytomarkthethousandsofdocumentswithintheCDforfuturemotions.Intheeventoffuturemotions,theCDwouldneedtobeproducedaspartofthem

Not very helpful, to say the least.

What I've done in the past is to copy the whole doc (Ctrl-A Ctrl-C) and paste it into Word (Ctrl-V) and then run a spell-check on it. But while that yields a correct document from a spelling point of view (or can, if you have the patience to make the corrections), it doesn't correct the original PDF to improve its searchability.

What I'm doing now is running the scan-OCR with Abbyy Finereader 9. It allows correction of the underlying OCR text, while presenting the original scanned bitmap to the user (or vice versa per an option). If anybody has a better suggestion, please let me know.

-- Roy Zider

My Product Information:
Acrobat Pro 8.1.2, Windows
daka630
Expert
Registered: Mar 1 2007
Posts: 1420
Roy,
Would it not be less time intensive to have the source file used to output a PDF?
If the document authors do not have Adobe products installed then, perhaps they
could obain and use the free Cute PDF product.
The output PDFs are most basic but do contain renderable text which supports
find/search in an adequate manner.

Be well...

Be well...

FUBARinSFO
Registered: Apr 11 2007
Posts: 61
There is no source document -- this is scan and OCR. The problem is with the poor OCR in Acrobat, that I want to at least correct the worst part of the recognition so that the doc can be found later.
daka630
Expert
Registered: Mar 1 2007
Posts: 1420
Roy,
Just some of my natterings here.

Having had the opportunity (or misfortune, depending on one's perspective) to OCR ream upon ream of scanned hardcopy
in days gone by I've found the two most significant variables that affected OCR output from OCR engines have been the
quality of the source paper and the resolution of the scan.

Regarding the hardcopy.
Text on/over grid lines results in less accurate OCR output.
Fold lines can block OCR recognition of text out to 0.5 to 0.75 inches on either side of the "line".
Poor black/white contrast and/or poor character density result in less accurate OCR output.
A copy of a copy of a copy... that is scanned results in poor OCR output.

Regarding resolution.
An effective resolution of less than 300 ppi results in less accurate OCR output.
The lower the scan resolution is the less accurate the OCR output.
Adobe recommends a minimum of 300 ppi for scanned textual content that is to be OCR'd
with Acrobat's OCR engine.

In order to obtain better OCR output I have often had to recopy the hardcopy with the contrast and toner denisty
settings bumped up. Then scan at 400 ppi and even 600 ppi.

With Acrobat 8 you could OCR using [i]Formatted Text and Graphics[/i]; however, while identifing
suspects that you can correct for spelling, the OCR output replaces the image. This may not be desired.

Aside from hardcopy quality and resolution used there are some other variables to consider such as
scanner lamp intensity and platten cleanliness.

Having used a mix of consumer and production grade scanners and a mix of OCR engines I've found that,
most times, it comes down to the hardcopy quality/content composition and resolution as being the critical
variables.

If it starts to get too much of a brown study consider a short get away to 12.30 N, 69.58 W .Be well...

Be well...