ClearScan - buggy OCR - arbitrary spaces in words

2011-05-31 10:28:54

larabenic

Registered: Feb 11 2010

Posts: 5

I've got 400 pages from a document, scanned @ 600 dpi color.With Acrobat 9 I converted them to 1 PDF, internally all pages stored as JPG2000, high quality. File size ~ 350 MB.

If I do OCR on this file (Acrobat 9 or 10) without ClearScan, everything is ok.

If I do OCR on this file (Acrobat 9 or 10) with ClearScan, the text looks perfect. But: If I copy and paste text (to the Windows Editor), most words contain spaces, I get "Abcd ef g hijklmnop" instead of "Abcdefghijklmnop". Of course, that also means, I can not find "Abcdefghijklmnop", if I try to search for it. Hence, over 90% of the OCR result is useless - because not searchable.

Now, for test purposes I exportet 1 page (from the original 350 MB non-OCR'd PDF) to a new PDF. I did OCR on this 1-page-file (with Acrobat 9) with ClearScan, and the above problem did not appear: No arbitrary spaces, text was ok and searchable.

Did nobody ever watch this behaviour, although it concerns both Acrobat 9 and 10 ?

My Product Information:
Acrobat Pro 10.0.2, Windows

2011-06-01 04:59:04

lkassuba

Registered: Jun 28 2007

Posts: 3636

What I believe you're experiencing is font issues when you copy/paste your text. ClearScan does not replace the font with your system fonts. Rather,a custom font is created to match the visual appearance of the pixels. Obviously this same font is not available to your Windows editor. Do you have any problems searching for words in your PDF? Have you tried doing a Save As > More Options > Text to see the result?

Lori Kassuba is an AUC Expert and Community Manager for AcrobatUsers.com.

2011-06-01 12:11:08

larabenic

Registered: Feb 11 2010

Posts: 5

----------------------------------------------------------------------------
Thank you Lori, you had a good nose... :-)) respect & thanks !I did some more experiments. Here the answeres to your questions and some amazing additional scientific findings about the perplexing behavior patterns of Acrobat:

(1)
(Q) a custom font is created to match the visual appearance of the pixels. Obviously this same font is not available to your Windows editor.
(A) I do understand rather well, what ClearScan does. ClearScan's synthetic fonts should be casted to an ASCII-value, when I paste a text into the windows editor, as the editor doesn't even know what "font" means. The problem is NOT, that ASCII-characters are not displayed in the editor or that they are displayed using a character that doesnt fit to what is displayed in Acrobat. But ............

(2)
............ the problem is: there are spaces in words (in windows editor or wherever else I paste the text). Spaces, that cannot be seen originally in the pdf. In other words: the words, the expressions, the strings are split into parts. Lets say, I can see a string "Abcdefg" in Acrobat. Represented by the synthetic truetype-fonts, that ClearScan generated. I mark "Abcdefg" with my mouse, copy it and paste it to an editor. What I can see in the editor is "Abcd e fg", obviously there are some spaces in the word after copy+paste. And this does not only concern copy+paste ......................

(3)
(Q) ............ Do you have any problems searching for words in your PDF?
(A) Yes. When I search for "Abcdefg" in Acrobat I dont't get any result. Again, I search for "Abcd e fg", and Acrobat does find it. Hence, The mechanism of parsing in Acrobat search is the same than the mechanism behind copy+paste.

----------------------------------------------------------------------------
I wonder what’s the source of this spaces.

(A)
I'm not sure if theese spaces do exist physically in the pdf (but are not visible while looking at the pdf) or if theese spaces are generated because of some parsing issue in Acrobat, during the process of copy+paste or search. When I set the cursor to somewhere in the text (in Acrobat) and press the cursor-right-button several times, i can't be sure, where (behind whitch letter) I'm exactly in this moment, but: I need to press it - let's say 10 times to pass a word with - let's say - 7 letters. Seems there are physcally some spaces there, but very thin spaces, not recognizable by eyes.

(B)
Going deeper: if I go to the Acrobat navigation panel I can take a look at the internal contents of each page of a pdf. When I open the tree-node "Text", normally I can see there many sub-nodes, each representing 1 word. In pdfs generated with ClearScan, most of these sub-nodes only contain parts of a word, that means (does it?), the words have been split into parts. I first thought, that exactly explains the origination of the spaces. Actually, if a word was split into 3 parts and stored in 3 sub-nodes there, I have spaces between the 3 parts. Bingo! Bingo? Not really: often I get spaces between letters of one of that parts, too. So the suspicious arrangement of words in the content-tree may be a useful finding or may be not.

(C)
Now, the final sensation.... Ladies and Gentlemen.... Acrobat switches from mad to not mad:
I tried Your ingenious suggestion and stored the Text-content with Save As > More Options > Text. This deliveres a .txt-file without that "spaces" in words ! Fine, so far. After that I went on doing some experiments and was flabbergasted: the "spaces-issue" was gone completely. Copy+paste did not produce any spaces in words, search was functioning. I repeated that several times. NO JOKE: AFTER saving the text (with Save As > More Options > Text), Acrobat changes its behavour and the suspicious spaces are gone. When I close the document and open it again, the bug is alive again. Well. Everytime I want to use (copy/paste, search) a document in the future, I first will have to store a .txt-file. Nice workaround. But rather ridiculous. BTW: Tried out the same with Acrobat 7; same result. Spaces issue disappeared after saving pdf as .txt !----------------------------------------------------------------------------

---->
Remainig questions: is the pdf, that was created with ClearScan, somehow buggy? (Same issues using Acrobat 9.3 and 10 for creation). Is it normal, that words appear as splitted parts in the text-tree, if one uses ClearScan? Or is this inevitable in this case and not a bug? If yes, does that mean, the reading/parsing-mechanism of Acrobat is buggy? Why does this mechanism change its behavour AFTER saving a .txt-file once? Why does this issue not appear in single-page-clearscan-pdfs ?

2011-07-14 13:02:34

Ehrman

Registered: Jul 14 2011

Posts: 2

Lara, in troubleshooting the same problem I came across this post and, following your description, was able to reproduce the exact problem and results. Have you learned any other way to avoid or correct this bug in the clearscan OCR? I can not use searchable image OCR for my needs.

Best,
Ehrman

2011-08-08 18:41:09

larabenic

Registered: Feb 11 2010

Posts: 5

Hello Ehrman,
unfortunately I didn't find "the" final solution :-(

Nevertheless some additional technical aspects:

(1)
The sub-nodes described in (B) of my posting often contain only a part part of a word. This is irrelevant concerning the "spaces" problem. Probably the separation of parts of words is just to allow different Td- and Tc-operators (you have to delve into pdf specifications for a deeper understanding).

(2)
The suspect spaces are definitely not physically existant in the content stream in terms of character codes.

(3)
The cause for the spaces is the mechanism for text extraction (also used during search). This mechanism is not trivial because of characteristics of the pdf standard. Sometimes it has to guess, if a space has to be set or not. For some reasons, this mechanism seems to fail very often in pdfs produced with ClearScan.

(4)
The problem did not appear e.g. in Foxit Reader as often as it appeared in Acrobat / Adobe Reader. Obviously the mechanism for text extraction is dependend of the implementation of the pdf reader used.

(5)
Storing the Text-content with Save As > More Options > Text actually seems to tag the pdf. Hence tagging the document has the same result as storing it as text. As far as I understood, tagging does not do any changes concerning the Tj-/TJ-elements in the pdfs, so the cause for that the problem disappears after tagging seems to be that some internal changes in text positioning are done. I don't know.(6)
Unfortunately during tagging at some pages I got the error message "Acrobat was able to make this document accessible but found the following oddities: Some difficult pages were encountered requiring all graphics on those pages to be labeled as figures". At the affected pages all characters were invisible. Nevertheless I could copy & paste the underlying text. On some other pages after tagging some text was not visible, because it was overlied by images.(7)
CONCLUSION:
My recommendation after 100s of tests: If you even use ClearScan, only do it for perfect scans with text only. No scans with speckles, only scans with best quality, 600 dpi color/grey, only documents with 99% text, ideally 1 font only (like in novels), no scans with difficult charts. Tag the pdf after using ClearScan, check the result manually. Don't trust ClearScan, check all pages for disappeared content. Don't use ClearScan if the result has to be 101% authentic. For this case + high compression use other techniques like mixed raster content (implemented e.g. in LuraDocument, FineReader, Omnipage). For my impression the idea behind ClearScan (font reconstruction) is brilliant but the implementation in Acrobat is halfhearted and not fully developed. What a pity that no Adobe developer seems to read that postings on acrobatusers.com.

Best
Lara

PS: leave some comment please!

These forums are now Read Only. If you have an Acrobat question, ask questions and get help from one of our experts.

ClearScan - buggy OCR - arbitrary spaces in words