This tutorial shows you how to work with the features in Acrobat 9. See what the all-new Acrobat DC can do for you.
Download a free trial of the new Acrobat.
Scan to PDF and OCR seem like a straight-forward workflow and certainly can be. On the other hand, there are situations where a scanned document may be visually disappointing, and running the OCR process results in a confused and illegible clump of letters, symbols and strange character strings. In this article, I’ll show you the outcome of some experiments in digital capture and scan to PDF, as well as some scanning and OCR tips and techniques.
Not many people have the time to putter around with scan settings and images. You want the work done now, with perfect results. Fortunately, tinkering is what I do best, and this article takes puttering to a whole new level.
Although you’d normally use the scan optimization settings for generating a PDF file from a scanner, did you know you could use the settings for any image? Or how about a document where you want both perfect text and a perfect image without compromising either? You can do it. Have you ever considered scanning a negative or slide? Read on for some tips for that, too.
Note: For instruction and tips on how to perform a scan and choose filter and compression options, check out my January 2009 article, Troubleshoot Scanning and OCR.
Whether you’ve got a TIF image you're converting to PDF, or a photo you’re scanning to PDF, you can work with the settings separately from the scan process.
To scan a document into Acrobat in its raw state, choose Create > PDF From Scanner > Custom Scan to open the dialog box. Click Options to open the Optimization Options dialog box, choose Lossless compression, and turn off the filters (Figure 1). Then proceed with your scan.
Figure 1: Deselect the scan correction settings.
Keep in mind that if you’ve got a 2x3-inch original image at 200dpi (like my sample files), you’re not going to have outstanding PDF results. You might want to try to improve the source image in Photoshop using filters such as Unsharp Mask, Despeckle, and so on, as well as correcting features like contrast and levels. Although some of the available optimization settings in Acrobat can perform corrections such as noise removal (Despeckle), the settings aren’t as customizable as those in an image editor like Photoshop.
The left column in Figure 2 show the before images of a scanned paper document. The scan source was a century-old brochure proving provenance for an antique piece, so it wasn’t in pristine condition. Contrast that with the column at the right, where the uncorrected scan was adjusted and corrected in Photoshop.
Figure 2: The century-old original.
First of all, let’s check out how Acrobat deals with the original images. As you’d expect, the quality of the OCR matches that of the images—it’s not very good! In Figure 3, you’ll see the captured text using the Searchable Image Exact capture settings.
Figure 3: The poor-quality image produces poor-quality capture.
Acrobat interpreted the content on Page 1, but on page 2, only the text above the cabinet image and the label below were captured (Figure 4).
Figure 4: Composite image showing content captured from each page.
Now let’s see what happens when I capture the content from the copies of the image corrected in Photoshop (Figure 5). You’ll see most of the text has been captured (aside from the footer). However, the content on the page remains unusable.
Figure 5: Content captured from Photoshop-corrected pages.
For the final experiment, I’ll make changes to the images manually. Choose Document > Optimize Scanned PDF to open the dialog box. You’ll see the same list of filters as those offered when configuring the original scan (shown in Figure 1). The first step in optimizing involves defining the balance between file size and quality. Since my goal is to capture the maximum amount of detail, the slider is set at High Quality. Then, I chose options from the listed filters one by one, applied the setting and saved the file.
Figure 6: Sequentially apply and evaluate optimization settings.
Tip: Once you’ve applied an optimization setting, you can’t undo it. To preserve the sequence of corrections, save the file between each attempt so you can revert to the previous saved version if necessary.
After I ran the OCR and captured the text, the results were incomplete and the output garbled, as in previous attempts (Figure 7).
Figure 7: Manual optimization yields poor results.
So what has my experimentation shown us? First, the best appearance came from working with the correction and adjustment tools in Photoshop. And second, regardless of the file’s manipulation, the Acrobat OCR process can’t capture the content in a meaningful and useful way.
Granted, my experiment was destined to fail, since it’s unlikely the OCR process can capture content from such small, indistinct text. If I needed to save the content to be used for indexing or searching, the solution is to correct the images in Photoshop and save as PDF for the best display of the pages themselves. Then I have choices (and more experimentation) for associating the text with the images.
Open the file in Acrobat and run OCR. Then add tags. Open the Tags panel, and go through the tags typing in the Actual Text (Figure 8). Unfortunately, the results from this experiment were mixed, as sometimes locating a search time actually highlighted the word, while other times the entire tag’s area was highlighted.
Figure 8: Inserting actual text into the tags produces mixed results.
The only way to guarantee that you have the exact content of a document, as well as the best image of your file, is to recreate the content from the pages and insert it as a new layer. There’s no doubt it’s time consuming, but for special material, it’s worth the effort.
To start, rekey the text and save it as a PDF file. I used an InDesign file as the text source, with the page size and text location matching the layout of the document (Figure 9).
Figure 9: Reproduce the text content using the same configuration as the original image.
In Acrobat, open the Layers panel, and click Import as Layer to open the dialog box. Choose these settings:
Figure 10: Adjust the position of the new text layer.
It’s time to test the text. In the Find toolbar, type a word from the page. In the example, “completeness” shows a hit in the proper location on the page (Figure 11).
Figure 11: Adjust the position of the new text layer.
Notice the highlight location shows slightly off the visible characters, as the hidden text layer doesn’t use the same font as the original page.
My final topic for today doesn’t involve the same exhaustive experimentation. Instead, let’s take a look at scanning transparent documents.
A flatbed scanner is designed for reflective source material, that is, a document that bounces light back to the scanner sensors. A basic scanner isn’t designed to scan transmissive objects, or those that allow light to pass through such as negatives or slides.
That doesn’t mean you can successfully scan a transparent source file, but you do have to make some adjustments.
Here are a few tips you can use to to improve the quality of a scanned transparency:
Tip: A high-end professional drum scanner uses a high-power lens mode to increase the resolution as well as brighter lamps for better scanning. If you want to scan transparent source material such as negatives or slides on a regular basis yourself, look for a slide scanner.
|Scan and Optimize|
|Create PDF, convert scanned documents to PDFs, get started with Acrobat DC|