Hi,
I have a PDF with vector images. As per the previous threads in this forum, I have converted the pdf page to image then OCR-ed that image using Acrobat 8.0 Pro. Now the image is rasterized and the text also displaying properly. But the only concern is, if I export the pdf to text most of the ligatures are not coming properly.
Eg: "Of" is exported as "Ol", "different" is exported as "dillerent".
Can anyone please suggest how to resolve this issue.
Thanks,
Balaji
Having used OCR on the image held by the PDF you obtained a "hidden layer" of characters.
The characters exported are the output of the OCR. With any OCR process there is no 100% accurate capture of 100% of the characters 100% of the time. So, in your case, OCR "sees" character 'l' not character 'f'.
You may want to print as image to 400 ppi and even 600 ppi as trials. Perform OCR on each. Export the OCR to a text file to observe what OCR provided on each. Sometimes the bump up helps; sometimes not.
With Acrobat 8 Pro you could also try OCR of Formatted text and Graphics.
Then use Acrobat to locate 'suspects' which you can correct.
However, this mode will replace the original image with the OCR output.
Be well...
Be well...