These forums are now Read Only. If you have an Acrobat question, ask questions and get help from one of our experts.

OCR problem

Balaji MG
Registered: Jul 6 2010
Posts: 22

Hi,
 
I have a PDF with vector images. As per the previous threads in this forum, I have converted the pdf page to image then OCR-ed that image using Acrobat 8.0 Pro. Now the image is rasterized and the text also displaying properly. But the only concern is, if I export the pdf to text most of the ligatures are not coming properly.
Eg: "Of" is exported as "Ol", "different" is exported as "dillerent".
 
Can anyone please suggest how to resolve this issue.
 
Thanks,
Balaji

My Product Information:
Acrobat Pro 8.0, Windows
daka630
Expert
Registered: Mar 1 2007
Posts: 1420
Balaji,
Having used OCR on the image held by the PDF you obtained a "hidden layer" of characters.
The characters exported are the output of the OCR. With any OCR process there is no 100% accurate capture of 100% of the characters 100% of the time. So, in your case, OCR "sees" character 'l' not character 'f'.


You may want to print as image to 400 ppi and even 600 ppi as trials. Perform OCR on each. Export the OCR to a text file to observe what OCR provided on each. Sometimes the bump up helps; sometimes not.


With Acrobat 8 Pro you could also try OCR of Formatted text and Graphics.
Then use Acrobat to locate 'suspects' which you can correct.
However, this mode will replace the original image with the OCR output.


Be well...

Be well...

Balaji MG
Registered: Jul 6 2010
Posts: 22
Thanks for your reply.

I have tried with Formatted text and Graphics in Acrobat 8 Pro, but the images are getting cut. So i cant use this option. The main reason im going with this process is, i need to convert the vector images to raster.

Is there any other way to get it done.

Thanks,
Balaji
DaveyB
Registered: Dec 10 2010
Posts: 70
While I don't pretend to aspire to the level you guys are working at, I can see where the problem lies, and I'm just wondering if taking a step back might help.

The problem lies in the fact that the OCR'd images of the original PDF files are rendering the fonts in what is probably, in the image, a non-native size. For this reason, the OCR is seeing ff as ll ... when the font is reduced sufficiently, the two characters are so close in shape that it is difficult to tell them apart! I am naturally long sighted, and have had to use reading glasses for years, so I am very familiar with this problem, trust me!

Could the solution lie in changing the font in the PDF before the images are created, or alternatively converting the text to upper-case (or small-caps) to increase the accuracy of the OCR? This last would mean that the text would appear as FF as compared to II, which the OCR would have an easier time with.

Just a thought, and hopefully it is useful to you!

DaveyB

LiveCycle Designer 8.0
"Genius is one percent inspiration, ninety-nine percent perspiration." ~~ Thomas Edison
"If at first you don't succeed, get a bigger hammer." ~~ Alan Lewis
"If the conventional doesn't work, try the unconventional" ~~ DaveyB

Danny101
Registered: Dec 17 2010
Posts: 1
http://www.boilerinstallation-boilerrepairs.co.uk
Balaji MG
Registered: Jul 6 2010
Posts: 22
Hi DaveyB,

Thanks for your options. I will try this.

Thanks,
Balaji