These forums are now Read Only. If you have an Acrobat question, ask questions and get help from one of our experts.

How to RUN OCR programmatically using vb.net or c#.net

sarathy56
Registered: Jun 20 2010
Posts: 4

Hi to All,

I have a scanned PDF, so i just want to enable the editing mode. hence i want to run OCR programmatically instead of running it manually through acrobat. kindly help me in this scenario.

Rgds,

Parthasarathy.S

My Product Information:
Acrobat Pro 7.0.5, Windows
thomp
Expert
Registered: Feb 15 2006
Posts: 4411
The OCR capability in Acrobat is accessed through menu items that display an interactive dialog. Basically this means that it can't be easily automated from an external program though the IAC.

OCR is however one of the batch process commands. So if you have a lot of scanned files you can do them all at once with a batch sequence. But other than this it would take a lot of jumping through hoops to automate OCR. For example, since OCR is a batch command it has an AVCommand struture, so it can be automated from a plug-in. You could write a plug-in that allows you to run OCR from an external VB program.

Thom Parker
The source for PDF Scripting Info
[url=http://www.pdfScripting.com]pdfscripting.com[/url]

The Acrobat JavaScript Reference, Use it Early and Often
[url=http://www.adobe.com/devnet/acrobat/javascript.php]http://www.adobe.com/devnet/acrobat/javascript.php[/url]

Then most important JavaScript Development tool in Acrobat
[url=http://www.pdfscripting.com/public/34.cfm#JSIntro][b]The Console Window (Video tutorial)[/b][/url]
[url=http://www.acrobatusers.com/tutorials/2006/javascript_console][b]The Console Window(article)[/b][/url]

Thom Parker
The source for PDF Scripting Info
www.pdfscripting.com
Very Important - How to Debug Your Script

thomp
Expert
Registered: Feb 15 2006
Posts: 4411
A batch process will run on the files you specified when setting it up. If the input was a folder then it will run on all files in the folder. To avoid processing the same file the batch needs to save the results to a different folder. Then you move or remove the processed files from the "In" folder.

Thom Parker
The source for PDF Scripting Info
www.pdfscripting.com
Very Important - How to Debug Your Script

lacro
Registered: Jul 21 2010
Posts: 3
I am not sure I understand. According to what I read, I would have to re-ocr already ocr'd files?

Lets assume I have 5000 pdf files located in a directory with 50 subdirectories that have 20 files each. If I were to run a batch OCR on the directory it would OCR all of the files in the subdirectories as well. Now if I were to add some random, image only, pdf files to some of the subdirectories I would have to re-ocr EVERYTHING?

That might be fine for 50 files, but when you might have 10000 pdf files, it seems that there must be some way to identify which files have already been OCR'd and skip them and only OCR those files that have never been OCR'd.
SpinEcho
Registered: Aug 13 2010
Posts: 1
Hi, this is my first post on this site, and I hope I've found the right place. I own a copy of Acrobat Professional and have set up batch OCR. I direct the batch process to work on all pdf files contained in a given folder and everything works great until the end of each document is reached. At this point, I get the "Recognize Text - Settings" dialogue box and have to press OK each and every time. This is hardly the automated process I'd like since I can't walk away from my PC and let the process run by itself. Anyone have any suggestions?? Thanks in advance!
thomp
Expert
Registered: Feb 15 2006
Posts: 4411
In the Batch command list there is a check box next to each of the commands for enabling interactive operation. Uncheck it.

Thom Parker
The source for PDF Scripting Info
www.pdfscripting.com
Very Important - How to Debug Your Script

daka630
Expert
Registered: Mar 1 2007
Posts: 1420
lacro,

Just as with a paper process, you'd use an in-box and an out-box.
Unlike the paper process you'd not have to move out-box content to the file cabinets for storage/retrieval.
The electronic file cabinet (a share on a network) would be the out-box & storage/retrieval location.For the large population of scanned images in PDF a server product (rather than Acrobat) often is more cost effective.
AdLib or Abbey FineReader server products come to mind - but there are others that work well.
Provide the scan output to the server product which outputs OCRd PDF to an "out" location.

For existing PDF collection such as in your example, replicate the directory structure. Use Thom's example of using the Batch Sequence to OCR and move output PDF to replicated directories. Now, only new PDF lands in the original location.
For new input, run the Batch. New PDF, processed by Acrobat is processed and moved to the replicated location.
Replicated directory location becomes the storage/retrieval location.
This avoids running OCR over and over again on the same PDF files.

More and organization issue than a software issue.

Be well...

Be well...