These forums are now Read Only. If you have an Acrobat question, ask questions and get help from one of our experts.

Tagging an OCR'd document

dafinga
Registered: Sep 11 2007
Posts: 6

Hi folks,

I am trying to write a script that will search through a pdf and if the pdf was created via OCR, the script does nothing. Are there any Adobe attributes that can be searched for to determine the provenance of a document?

Thanks,

Pete

gkaiseril
Online
Expert
Registered: Feb 23 2006
Posts: 4307
I just ran a test with a scanned text with the following results. Prior to performing the OCR with Acrobat, no text could be found and JavaScript could not count any words. After performing the OCR with Acrobat, words could be located with the "Find" and JavaScript could counted 181 words.

The above leads me to wonder if the PDF had been OCR'd and not just scanned.

George Kaiser

dafinga
Registered: Sep 11 2007
Posts: 6
You are saying that I could use the find command as a test and if Javascript can find words then I can assume that the file has been OCR'd rather then scanned?
gkaiseril
Online
Expert
Registered: Feb 23 2006
Posts: 4307
One can not get an OCR'd PDF without first getting some initial scanning. Then to become OCR'd additional processing the document is required. If you are using a product other than Acrobat that product might not output an OCR's PDF but just the image of the document as a PDF. Acrobat actually adds an invisible layer containing the text, but only the document image is seen.

So your scanned image may need additional processing to create an OCR'd PDF.

I should note that Acrobat Professional can perform this task, but a high quality image will be needed.

George Kaiser

dafinga
Registered: Sep 11 2007
Posts: 6
Here is the scenario,

Some documents have been scanned to non-OCR .pdfs. Other will have undergone an Acrobat OCR process. I need to write a script that will look at these documents and if they have already been OCR'd the document will be skipped. If the document has not been OCR'd, we need to run the OCR process.

It appears from your comment that we can use the javascript word count to see if it finds any words. If we run word count on a non-OCR's doc, we should see a zero result and if we run it on an OCR'd document, we should get a word count.

Does that sound right?

Thanks

Pete
gkaiseril
Online
Expert
Registered: Feb 23 2006
Posts: 4307
That is the easiest way I have found.

George Kaiser