First post here. :)
I am scanning a very large collection of papers, most of them 25 to 40 years old.
I am creating searchable image-exact pdf files.
My question is this:
Once OCR has been done, is there a way to "correct" the text hidden behind the image?
For example, say I have a letter where the typewriter didn't make the letter 'n' very clear. If I search for the word [i]census[/i] it won't find it, because OCR didn't know how to interpret that messed up 'n'. It will find "sus" or "ce" but not "census."
Can I correct this, and if so, how?
That, of course, is what is desired with Searchable Image (Exact).
The image is intact and can become a legal or life record.
A PDF Output of Formatted Text & Graphics, basically, dumps the image and leaves your with "OCR Suspects" that you can edit.As you have a large collection of legacy hard copy any OCR will be hit or miss at best. Going with the second PDF Output mentioned and performing the manual edits would keep someone busy for a long time.
Having been involved in scanning legacy and current hard copy into PDF and doing the OCR via Acrobat's OCR engine, Adobe Capture Cluster, and AdLib I have observed that no OCR gives 1:1 to what's on the hardcopy. So, it is what it is. With that said, a cataloged index (or several of them) provides invaluable assistance when you need to locate information from the documents. The end-user has to supply the intelligence by using variations of the topic query and by making a study of the Acrobat Help on performing advanced searchs. With that, solid results are obtained.
Be well...