Hello,
I am trying to convert a "Scanned and OCRed" PDF document into HTML.
For conversion we have used a product called "PDT to HTML Converter Pro".
http://www.intrapdf.com/convert_pdf_to_html.htm
It works well for most PDFs, but not with pdf's that are scanned and then OCR'ed. When we try to do the conversion the ocr'ed versions either fail to convert at all or convert with blank pages.
I believe it is related to the hidden text behind the image. We are using Acrobat 9 Standard 9.3.0 and we have tried doing the OCR as Searchable Image, Searchable Image (Exact), and ClearScan, but have had similar issues each time.
The OCR conversion runs well, but when we try to convert to HTML the process fails, or produces blank pages.
Does anyone have experience with this? I could post some sample pages if that would help.
We could of course successfully convert the PDF to HTML prior to running the OCR, but that basically creates an HTML page with just an image and make it difficutl for us to use the text in the document for hyperlinking, search engine indexing, etc.
If you have an suggestions for how we can convert the scannned ocr version file to html I would most appreciate it.
Thanks.
Regards,
Andrew
http://www.anypdftools.com/pdf-to-html-converter.php#201
Just try it and hope it helps.
I will try my best to help you in PDF converison fields, objectively and Neutral.