Hi,
I have many files which are PDF Normal - they have a text layer. Most times, this text layer does not include all of the text, only a logo etc. I need to OCR these files, but there is a problem when the file is PDF Normal.
I would like to only save the image with no text layer, as PDF, so that I can batch OCR these files.
I have tried the examine document function in Acrobat Pro 9, removed everything, confirmed by re-opening the file - but the text is still selectable, and the OCR process fails because the file is still "PDF Normal"
Any ideas would be great.
Thanks
(ps - I have access to acrobat pro 7.0.5 and 9.3.1)
A PDF 'Normal' (older terminology) is a PDF containing PDF page content provided by some authoring application's file which was converted to PDF.
The text is not a 'layer'; rather, it is an inherent part of the PDF page content.
As such Examine Document cannot remove it.
Nor can OCR process a PDF page containing such renderable text.
With Acrobat 8 or 9 Pro you could use the redaction tool to completely remove desired PDF page content.
You could print the PDF through Adobe Printer and, in the Print dialog, use the Advanced button to enter a dialog in which you can select Print as Image and select a desired resolution.
But, the text content of the original PDF will be present in the image held in the new PDF.
You could now OCR that text.
But why? You already have renderable, searchable text.
Now, if all you want are the logos, then redaction of text could give you that.
However, logos are typically graphic objects and such typically do not provide OCR with something to process for OCR character output.
Be well...
Be well...