These forums are now Read Only. If you have an Acrobat question, ask questions and get help from one of our experts.

"Fixing" a searchable image (Exact)

Matt_G
Registered: Jan 16 2008
Posts: 6
Answered

First post here. :)

I am scanning a very large collection of papers, most of them 25 to 40 years old.
I am creating searchable image-exact pdf files.

My question is this:
Once OCR has been done, is there a way to "correct" the text hidden behind the image?
For example, say I have a letter where the typewriter didn't make the letter 'n' very clear. If I search for the word [i]census[/i] it won't find it, because OCR didn't know how to interpret that messed up 'n'. It will find "sus" or "ce" but not "census."
Can I correct this, and if so, how?

My Product Information:
Acrobat Pro 8.1.1, Windows
daka630
Expert
Registered: Mar 1 2007
Posts: 1420
Quote:
Can I correct this,...
With a PDF Output of Searchable Image (Exact), the answer is no.
That, of course, is what is desired with Searchable Image (Exact).
The image is intact and can become a legal or life record.

A PDF Output of Formatted Text & Graphics, basically, dumps the image and leaves your with "OCR Suspects" that you can edit.As you have a large collection of legacy hard copy any OCR will be hit or miss at best. Going with the second PDF Output mentioned and performing the manual edits would keep someone busy for a long time.

Having been involved in scanning legacy and current hard copy into PDF and doing the OCR via Acrobat's OCR engine, Adobe Capture Cluster, and AdLib I have observed that no OCR gives 1:1 to what's on the hardcopy. So, it is what it is. With that said, a cataloged index (or several of them) provides invaluable assistance when you need to locate information from the documents. The end-user has to supply the intelligence by using variations of the topic query and by making a study of the Acrobat Help on performing advanced searchs. With that, solid results are obtained.

Be well...

Matt_G
Registered: Jan 16 2008
Posts: 6
Thanks for the response, even though it wasn't what I wanted to hear.
Looks like I may have to use Abbyy FineReader 8 in conjunction with Acrobat to pull this off.
In case you are wondering, the data in question is genealogical information.
Much of it is hand written, which I knew going in, OCR would never be able to decipher.
After playing around this afternoon, it looks like Abbyy can "hide" text behind the image when it sends it to Acrobat. (Just like Acrobat does)
The difference is I can correct it before sending it to Acrobat.
I figure I can fix the [b]major[/b] stuff like the names and dates, to heck with the rest of it.
It's still going to be a long drawn out process though.

One more question:
I need to send copies of all these files to an associate. Can I still use a cataloged index?
I think I remember reading somewhere that once the catalog was made the folder structure could not change.
I obviously need to do some reading on this, but maybe you could give me a pointer or two?
daka630
Expert
Registered: Mar 1 2007
Posts: 1420
One file or many, the PDFs can be cataloged.
For keeping collections portable (CD-ROM, DVD, moves to different network locations),
the collections folder/sub-folder layout & associated cataloged indexes would have
to be kept intact. However, it is possible to park an index or collection of indexes
separately on a LAN. Just don't want to change their location or the location of PDF
collections once built.

I'd say the most practicable approach, for you, is to put it to CD-ROM or DVD-R.
Burn from a staging directory on you HDD. In this directory, at the root,
drop in an autorun.inf that will pull up a "start.pdf" (also at the root).
The PDF files, *pdx, and cataloged index sub-folder are all in a folder...
say "files" (creative - eh?).
Associate the "start.pdf" to the *.pdx file in the "files" directory.
Your associate drops the disc into the drive and shortly thereafter
the start.pdf file is rendered. Commence search from there.

An autorun.inf that will work in XP:
[Autorun]
shellexecute="startfile.pdf"
ICON=AutoPlay\Books.ico
shell\readit\command=notepad readme.txt
shell\readit=Read &MeDrop the "ICON" line unless you want an icon to show by the drive letter in
Windows Explorer. If you do want an icon it could just as well go to the root.

The line to the "readme.txt" &, with it, the last line can be dropped.
If you want them, then have a readme.txt (or a filename of you choice) in
the root. Using the two lines lets you right click on the drive letter and
see, on the menu, an entry for the readme file.
It might be a handy way to park some comments/observations
in an alternative manner than putting them into the PDFs. Although,
the comment functionality (making/retrival) of Acrobat is impressive.

btw, you may want to consider adding some amount of metadata to your PDFs.
Some simple "flags" can help catagorize your files.

Be well...

Matt_G
Registered: Jan 16 2008
Posts: 6
Thanks a bunch for all that information.
I really appreciate it.

I will mess around with catalogs, perhaps a DVD-R would be the best way for me to go.
I have some learning to do! :)

Thanks again.
daka630
Expert
Registered: Mar 1 2007
Posts: 1420
The learning is the journey & the journey, often, is more entertaining than the destination

Be well...

michaelejahn
Registered: Apr 26 2006
Posts: 232
@ Matt G

You might want to look into ELAN GMK's Capture product - or if the text is mostly hand written, ELAN ELIP, which uses about 3 or 4 OCR and ICR engines - companies like Northrup Grumman use it to convert hand drawn mecaniacl engineering drawings into search able archives.

http://www.elan-gmk.com/

Good luck - if you want to ask me direct question, feel free to call me at 805 527 8130 anytime

Michael Jahn
Application Support Specialist
Compose Systems Inc, USA.
4740 Northgate Blvd. Suite 100
Sacramento, CA 95834
Tel: (916) 920-3838 ext 102
Fax: (916) 923-6776
Email: michaelejahn [at] composeusa [dot] com
Web: www.composeusa.com