I have read quite a bit about this and am truly astounded that I would have to re-ocr every file each time I run a batch OCR on a directory. I am posting here because I am hoping I missed something and that seems so fundamental would be skipped.
Lets assume I have 5000 pdf files located in a directory with 50 subdirectories that have 200 files each. If I were to run a batch OCR on the main directory it would OCR all of the files in the subdirectories as well. I know because I have done it. Now if I were to add some random, image only, pdf files to some of the subdirectories I would have to re-ocr EVERYTHING?
That might be fine for 50 files, but when you might have 10000 pdf files, it seems that there must be some way to identify which files have already been OCR'd and skip them and only OCR those files that have never been OCR'd. If that's the case, it would take far too long to ocr the files each time.
Ideas?
Thank you.
George Kaiser