These forums are now Read Only. If you have an Acrobat question, ask questions and get help from one of our experts.

Acrobat Professional-Removing OCR

jrempel
Registered: Feb 13 2008
Posts: 5

I have scanned several large documents and run the text recognition function on them to make them searchable. I have been very impressed with the accuracy of the OCR function however, the nature of the project has changed slightly and I now need the "original" files back that are only the scanned image without the OCR information. How can I remove this, very sizable, layer of information from my documents?

My Product Information:
Acrobat Pro 8.1.2
tplumer
Expert
Registered: Dec 1 2005
Posts: 122
There isn't any direct way that I know of. However, you can export the PDF file to TIFF files and open them back into Acrobat. The quickest way to do it would be to use Combine Files. One note, the text information should not be adding too significantly to the PDF. You might try the PDF Optimizer as a tool for shrinking the size of the file by down-samling and compressing the image data.

I am a long-time Acrobat user, an employee of Adobe Systems, and Maine native. I have created training videos for Total Training, consulted with people to help them better use Acrobat, and developed new business for Adobe as a Business Development Manager

jrempel
Registered: Feb 13 2008
Posts: 5
I was hoping the TIFF option wasn't the only way but I suspect you are quite correct. I have found that adding the OCR information makes my file about 10x larger. The optimizer helps a bit but it still works out to about 9x the size of the original file. However, I recently experimented with the ABBYY Find program to add the OCR information to my PDF files and that program created an even larger file so I suspect the file size is something I will just have to work with.
tplumer
Expert
Registered: Dec 1 2005
Posts: 122
Holy Cow! 10x larger is a big surprise to me. Open the PDF in the PDF optimizer and click Audit Space usage. I am really curious to know what is causing that. Are you using Searchable image exact as your output style?

I am a long-time Acrobat user, an employee of Adobe Systems, and Maine native. I have created training videos for Total Training, consulted with people to help them better use Acrobat, and developed new business for Adobe as a Business Development Manager

jrempel
Registered: Feb 13 2008
Posts: 5
I have been using the Searchable Image, not the Searchable Image (Exact) option. I suspect the big "problem" is that these are historic documents that have been produced with all kinds of fonts, including original handwriting and signatures, not to mention the cracks and other wear that have occurred over the years. My directive is to ensure that the image of the document is in no way altered by making it searchable.
daka630
Expert
Registered: Mar 1 2007
Posts: 1420
Quote:
...ensure that the image of the document is in no way altered by making it searchable.
Then you will want Searchable Image Exact.
Avoid downsampling and lossy compression schemes.
Both remove data from the image; thus altering it.
The image will no longer be a 1:1 representation of the original hard copy.


Having Acrobat Pro 8.1 you can remove the characters generated by OCR of the PDF by
using Examine Document...

Document > Examine Document
In the Examine Document dialog, leave Hidden text on pages checked.
Click the Remove all checked items > OK > Save.

Be well...

croybike
Registered: Sep 4 2009
Posts: 1
I have Acrobat Pro 8.1. I can't seem to get anything to remove the renderable text /OCR from a PDF doc.

DOESN'T WORK:
-Document, Examine Document, but only one box is checkable (deleted hidden page and image content); there is no ability to check the dehighlighted line of 'delete hidden text' as mentioned in this post
-File, Print, PDF – new file still had the OCR
-File, Print, ABXPDF writer (used for US patent office to scrub PDF files) – new file still had the OCR
-File, Export, PDF/A - did not work
-File, Create PDF from file: picked PDF doc w/OCR -did not work – nothing happened after selecting this option, and picking a PDF file with OCR.
-Batch Processing to remove all “Document Data and “User Data” , but this didn’t create any new file when run.
-Touchup Tool mentioned in another post herein- that just removed all the text... so didn't work.

SOMEWHAT WORKS:
-File, Export, TIFF works to create a Tiff file which can then be viewed in Acrobat/, but it creates one file for every page... that's a hassle for large docs to convert and merge all back into one PDF.

This shouldn't be so difficult I would think. Or I'm terribly Acrobat-incompetent. Any ideas?
daka630
Expert
Registered: Mar 1 2007
Posts: 1420
Hi croybike,
Just to clarify...

Renderable text does not equal OCR text.
Renderable text is text that is part of the PDF page content.
OCR text is a hidden font type that exists in a separate layer in the PDF page.
With Acrobat 8 or 9, Examine Document provides a means of removing the OCR text, not renderable text.

fwiw, I've used this feature to remove OCR on PDFs of scanned images of textual documents provided by these OCR "engines":
Acrobat, Cannon Image Runner units, Adobe Capture Cluster, and AdLib Server product.

Regarding renderable text, a PDF created with a reasonable degree of fidelity to the PDF References (now an ISO standard) will support use of Acrobat's touch up text tool.

Be well...

Be well...

tal
Registered: May 19 2010
Posts: 3
Hello,

Just reading over this thread and believe I am having the same problem.

My problem is that I OCR'd many pdfs so that I could search through them. It wasn't until I was done that I realized the mistake I made:

On output style I used: 'Searchable Image.' I did not realize this would change the look of the document- Adobe actually converted the text to a new font. The documents are old scans and the result looks terrible. Font/size are different from word to word in some cases.

I don't have access to the scans/original files- so I have ruined my only copy unless there is someway to revert to the original? I cannot present the PDF in its current form to my client.

Once I have it back in the original format on output style I will use 'searchable image exact' which will allow me to search the document but not alter the look of it, correct?

Any hints/help you may have would be appreciated! Thank-you

PS I tried to use the 'previous version' function on Windows but there weren't any available so that is not an option.
rbogie
Registered: Apr 28 2008
Posts: 432
tal wrote:
Once I have it back in the original format on output style I will use 'searchable image exact' which will allow me to search the document but not alter the look of it, correct?
answer: correct. "searchable image" applys fixes to the image bitmap, such as deskew. Such fixes are evidently not what you want.

Also, do not use "clearscan" if you want to retain the original image because it will erase all or portions of the image and deliver the OCR content in visible font, making the document appear crisper.
twesner
Registered: Jun 21 2011
Posts: 19
Hi,

I think I have a similar problem. I have a PDF of a textbook that I wrote. I don't know what the original software was that created the material, but I had the company download the book in PDF. This was in 2004 if that makes any difference.

I had the company, now out of business, download the 2 versions of the textbook. One is the student version and the other is annotated instructor's edition (AIE). In the AIE version, some of the pages do not show the annotations, which are for the most part answers in the exercise sets.

When I use Examine Document and view Hidden text > Show preview > Show both hidden and visible text, the page is shown the way the AIE should be other than color. What I would like to do is to save a copy of the AIE that show the hidden text when it is opened and will also print the hidden text.There are no layers and I have tried some different commands in preflight, but have not been successful. I could use some guidance on how to make the hidden show and stay shown.

If you would like to view a sample of the PDF that I am referring to, you can go to http://bjkp.com > eBooks and download the sample. Page 54 of the sample material would be a good example to look at. The document is not locked.Thanks, Terry

twesner
Registered: Jun 21 2011
Posts: 19
Hi,

This is what I have so far. Below are a few fixes that work on some of the pages. These fixes may be all that someone reading this thread needs. I can fix the documents in the section on http://bjkp.com labeled eBooks, but the documents in the sections Educational Software and Print Textbooks will not work with any of these methods. I can see the AIE text material, but I cannot show it. Does anyone have a fix for those pages?

The most perplexing pages are clean looking text in the Print Textbooks section. They have no buttons and, as with the others, the AIE is hidden below. You can go to http://bjkp.com > Print Textbooks > Elementary Algebra 4th Edition with Applications > El Alg 4th Textbook. Three examples to look at would be page 54 (78 of 208), 103 (127 of 208), or 127 (151 of 208). The document is not locked.You can also go to http://bjkp.com > Educational Software > Educational Software and download the sample entitled Beginning Algebra 5.0 Chapter 3. Two examples to look at would be page 206 (22 of 46) or 215 (31 of 46). The document is not locked. If you can get the AIE to show in the Print Textbooks section, I think the same steps would work for the Educational Software section. In fact, if you can get the print section to work, I would not need to use the Educational Software to make an AIE version.Some methods that work on the eBook section are:

Method 1 – In my situation, the AIE materials are covered with white-filled empty text annotations - so if I print the file with "Document" instead of "Document and markups" selected on the print dialog, the AIE materials all appear. If when I go to print, I change from a printer to Adobe PDF, the file is saved without the white boxes and the AIE showing.

Alternate to method 1 – The 'hidden' text is actually text with visible font that is masked by a white filled text box (a text box that has no text). Text boxes are comments. Open the comments panel where you may examine, sort, show (filter), search and delete any or all comments. If the goal is to delete all masking comments, with careful sorting and filtering you can identify and select the masking comments. You can delete them one by one or by subsets or by all at once. With a ctrl+a plus delete, you can select and delete all filtered and displayed comments. You will need to experiment and get the hang of filtering and sorting masking comments for deletion.

Alternate to method 1 – A fix that works on some of the pages is to select the Object Select Tool > Ctrl+A the fields get selected > delete, and then use the TouchUp Object tool to delete the buttons. Unfortunately, this only works on some of the pages.Method 2 – To remove all the annots from the PDF, use Preflight > single fixups > search for "annot" to pull up the "remove all annotations" fixup, and run it. This will also remove buttons and links, but you can create a new Profile that only reacts to text annotation types if you need to.Best regards, Terry