These forums are now Read Only. If you have an Acrobat question, ask questions and get help from one of our experts.

OCR

JBM1982
Registered: Apr 3 2007
Posts: 37
Answered

Is there any way to search a large collection of PDFs and find all of the documents that haven’t been OCR’d?

My Product Information:
Acrobat Pro 8.1.2, Windows
gkaiseril
Expert
Registered: Feb 23 2006
Posts: 4307
You could create a Batch Process that uses JavaScript to count the words in a PDF and then print the PDF name or path to the JS console or a PDF report that list PDFs that have or have not been OCR'd. An OCR'd file will have visible or hidden text.

George Kaiser

JBM1982
Registered: Apr 3 2007
Posts: 37
I am not so familiar w/JavaScript, could you help in providing some sort of code to start with to do this?
daka630
Expert
Registered: Mar 1 2007
Posts: 1420
G'day JBM1982,

You can accomplish this with a Batch Sequence that runs a Preflight "Custom check" for "Invisible text objects".
The custom check finds text objects which use text rendering mode 3 (invisible text).

First you build the Preflight Profile.
Second you build the Batch Sequence.
Configure the Batch Sequence as desired for success/error report generation if desired.
Run the Batch Sequence.
The Summary report will identify errors, warnings, and notifications.
In the Preflight Profile you'll have a choice of which of these alerts you want.
Note that a Preflight Profile "Information" alert is a "Notification" in the summary report.

For PDF file pages having only an imported scanned image having zero notification(s) indicates OCR has not been applied.

For a tutorial on Preflight Profiles see Donna Baker's article.
[url]http://www.acrobatusers.com/tutorials/2007/digging_into_preflight/[/url]

For tutorials about Batch Processing go to
[url]http://www.acrobatusers.com/tutorials/[/url]
Under Advanced search, for List tutorials by areas of interest, select Batch Processing from the drop down menu.

To play right away...
Links to test files, batch sequence file, and preflight profile file are below.
Because the batch sequence file is set up for the trials on my Windows box you'll have to create some folders on yours if you want this to work "out of the box".
Create a folder "_01a" in the root of c:\
In "_01a", create "JBM1982".
In "JBM1982", create folders "success_folder" and "error_folder".

Down load file "Locate_OCR_Text.kfp" and place in folder "JBM1982".
You can have the *.kfp file anywhere (you'll be importing it into Acrobat) but, for convenience, drop it into "JBM1982".

Download the following files to folder "JBM1982":
summary_01.pdf
--| Note that this file has links to the PDFs in the "success_folder")
--| Once all is assembled (and batteries installed) view this file.
ScanNoOCR.pdf
ScanYesOCR.pdf

Download the following files to folder "success_folder":
ScanYesOCR_report.pdf
ScanNoOCR_report.pdf
--| Look over each. In the Layers pane you will see empty square(s). Click in them.

Download the Batch Sequence file (Preflight_Htext3.sequ) to:
C:\Documents and Settings\[user name]\Application Data\Adobe\Acrobat\8.0\Sequences\
When you open Acrobat this file will be "known" to the application.

From the command menu,
Advanced > Document Processing > Batch Processing
Scroll down to select the "Preflight_Htext3.sequ" sequence.

BUT, don't run the sequence just yet. You'll need to import the Preflight Profile first.
Select Advanced > Preflight
From the Options drop down, select "Import Preflight Profile..."
Browse to where you downloaded the *.kfp file (Locate_OCR_Text.kfp)
(In c:\_01a\JBM1982\ ...no?)
Click Open.
The profile will be in the "Imported profiles" in the Preflight dialog window.
Close the Preflight dialog window.

After looking over the PDF files you downloaded copy some of your PDFs into JBM1982.
Run the Batch Sequence, select these files, check out the summary report.

Play with the Batch Sequence configuration to customize where reports go, etc.

To get the files:
summary_01.pdf
[url]http://daka630.tripod.com/pdf/jbm1982/summary_01.pdf[/url]
ScanNoOCR.pdf
[url]http://daka630.tripod.com/pdf/jbm1982/ScanNoOCR.pdf[/url]
ScanYesOCR.pdf
[url]http://daka630.tripod.com/pdf/jbm1982/ScanYesOCR.pdf[/url]
Locate_OCR_Text.kfp
[url]http://daka630.tripod.com/pdf/jbm1982/Locate_OCR_Text.kfp[/url]
Preflight_Htext3.sequ
[url]http://daka630.tripod.com/pdf/jbm1982/Preflight_Htext3.sequ[/url]
ScanNoOCR_report.pdf
[url]http://daka630.tripod.com/pdf/jbm1982/ScanNoOCR_report.pdf[/url]
ScanYesOCR_report.pdf
[url]http://daka630.tripod.com/pdf/jbm1982/ScanYesOCR_report.pdf[/url]

Whew! I think I've got it all there .Be well...

Be well...

JBM1982
Registered: Apr 3 2007
Posts: 37
Thanks for the helpful information provided. I will give it a go and see how I make out with this.

J
JBM1982
Registered: Apr 3 2007
Posts: 37
daka630,

upon importing "Locate_OCR_Text.kfp" I received a error that import “unable to import file” is it because while in the process of saving the file it wanted to save it as an XML file? I believe I saved it as an XML then changed it after it was done downloading.
daka630
Expert
Registered: Mar 1 2007
Posts: 1420
Hi JBM1982,
*.kfp/*.xml...

As my grandson says, "oops"...
Sorry, I failed to mention that this would be expected behavior.

Go ahead and, on download, save the *.kfp as *.xml; then, with Windows Explorer, browse to the file and rename it with the *.kfp extension.

Be well...

Be well...

JBM1982
Registered: Apr 3 2007
Posts: 37
I will give it another shot and let you know the outcome.
Thanks
JBM1982
Registered: Apr 3 2007
Posts: 37
daka630,

I think I know why the import error message shows up, I do not have full rights to this workstation, is the correct? If so, then I'll have to talk to my System Administrator. But yeah, still getting that error message and that is the only think I can think of that could be the reason.
daka630
Expert
Registered: Mar 1 2007
Posts: 1420
JBM1982,
I suspect that you are correct. It is fairly typical for companies to incorporate some means of preventing indescriminate downloading of files to inside the company firewall.

With that said, let's try this -
[url]http://daka630.tripod.com/pdf/jbm1982/Locate_OCR_Text.zip[/url]
The kfp compressed into a zip file.

Some alternatives for obtaining the kfp.
--| Download at home, put on a USB stick. Bring that in and copy the file over.
--| See if IT will download/scan/put on your desktop.

Fall back:
I'll cobble together a step-by-step for building the preflight profile & post it to this thread.Be well...

Be well...

daka630
Expert
Registered: Mar 1 2007
Posts: 1420
Hello JBM1982,
I suspect it has rough edges but perhaps this PDF will help (it is a "how-to").

[url]https://share.acrobat.com/adc/document.do?docid=05f2aaa7-b542-4edd-8a80-1e12e0a414b4[/url]

Be well...

Be well...

JBM1982
Registered: Apr 3 2007
Posts: 37
Thanks for the instructional guide. I will give this a try and let you know how it works out. have a nice weekend.
JBM1982
Registered: Apr 3 2007
Posts: 37
That seemed to do the trick.
Thanks so much for your help!!!
daka630
Expert
Registered: Mar 1 2007
Posts: 1420
That is good news!

[url=http://img337.imageshack.us/my.php?image=face02biscb3.png][img]http://img337.imageshack.us/img337/7809/face02biscb3.th.png[/img][/url]

Be well...

Be well...

techiewriter_H
Registered: Sep 19 2008
Posts: 5
Hi,

I'm new to using acrobat. Is there a spell check feature for acrobat (PDF image over text/PDF with hidden text) and can you correct it?

Thank you.