These forums are now Read Only. If you have an Acrobat question, ask questions and get help from one of our experts.

how to convert scanned and ocr'ed pdfs into html

green_mountain
Registered: Feb 21 2010
Posts: 2

Hello,

I am trying to convert a "Scanned and OCRed" PDF document into HTML.

For conversion we have used a product called "PDT to HTML Converter Pro".
http://www.intrapdf.com/convert_pdf_to_html.htm

It works well for most PDFs, but not with pdf's that are scanned and then OCR'ed. When we try to do the conversion the ocr'ed versions either fail to convert at all or convert with blank pages.

I believe it is related to the hidden text behind the image. We are using Acrobat 9 Standard 9.3.0 and we have tried doing the OCR as Searchable Image, Searchable Image (Exact), and ClearScan, but have had similar issues each time.

The OCR conversion runs well, but when we try to convert to HTML the process fails, or produces blank pages.

Does anyone have experience with this? I could post some sample pages if that would help.

We could of course successfully convert the PDF to HTML prior to running the OCR, but that basically creates an HTML page with just an image and make it difficutl for us to use the text in the document for hyperlinking, search engine indexing, etc.

If you have an suggestions for how we can convert the scannned ocr version file to html I would most appreciate it.

Thanks.

Regards,
Andrew

TonyPotter
Registered: Feb 1 2010
Posts: 85
It is really a problem. I know how to convert OCRed PDF to Text but HTML. I have used a good PDF to HTML Converter before, which may help you.
http://www.anypdftools.com/pdf-to-html-converter.php#201
Just try it and hope it helps.

I will try my best to help you in PDF converison fields, objectively and Neutral.