Adobe Acrobat

2009-07-30 10:37:30

jeffb

Registered: Jul 30 2009

Posts: 11

Answered

When I try to select and copy text in the Calibri font from a PDF file and paste it into Microsoft Word or a text editor, the result is simply a series of garbled characters. I have encountered this problem with Acrobat Professional 8 on machines running Windows XP and Mac OS X (both 10.4 and 10.5). Also, if the PDF file contains the Calibri font, I am unable to save or export it as an MS Word file with Acrobat Pro. Thus far, I have encountered this issue only with Microsoft's Calibri font. Anyone else running into this problem?

My Product Information:
Acrobat Pro 8.1.2

2009-08-03 11:12:29

lkassuba

Registered: Jun 28 2007

Posts: 3636

Is this font embedded into your PDF? You can check under File > Properties > Font tab.

Lori Kassuba is an AUC Expert and Community Manager for AcrobatUsers.com.

2009-08-03 11:31:44

jeffb

Registered: Jul 30 2009

Posts: 11

Yes. The Calibri font is embedded in the problematic PDFs.

Since my original post, I've done more experimenting and discovered that the problem seems to occur only if the original PDF was created from the Macintosh version of Microsoft Word. Creating a PDF with the Calibri font using the Windows version of Word does not produce the problem.

2009-08-03 12:09:59

lkassuba

Registered: Jun 28 2007

Posts: 3636

Can you post a sample?

Lori Kassuba is an AUC Expert and Community Manager for AcrobatUsers.com.

2009-08-03 12:32:24

jeffb

Registered: Jul 30 2009

Posts: 11

Here's a short sample: [url=http://tinyurl.com/n3pba5]CalibriTest.pdf[/url]

2009-08-04 06:13:06

UVSAR

Registered: Oct 29 2008

Posts: 1357

It's down to the font encoding. Acrobat has a rather esoteric way of embedding fonts, one artifact of which is that it has to rebuild the toUnicode mapping table (the list of which outline shape is for which character code). If the original font has strange lookups or passes through a quirky Postscript stage before Adobe Distiller gets hold of it, the PDF conversion process can lose track of the associations. You'll find the "garbled" characters are not random - they each have a 1:1 association to the correct character, so if you copy the word "Microsoft" you'll always paste nine letters and 5 and 7 will be the same.

On screen, you're looking at the outlines and reading them visually, so have no interest in the actual data - but when you copy or export, Acrobat takes the underlying character codes and doesn't pass them back through any lookup tables. That thing on the screen which looks like an "s" is actually stored inside your PDF as "C", so it'll paste as "C". The other characters are mapped to very obscure accented characters, so depending on the font you're pasting into they'll probably appear as squares. If you select the line of text with the touchup tool, right-click to properties and change the font to something like Arial, you'll see what the stored codes actually are.

As far as Acrobat is concerned, it's actually doing what it's supposed to - the document displays correctly on screen and prints correctly, it's just it can't be taken apart again (so what? says Acrobat.. not my problem). Adobe intentionally allows these 'broken' embedding patterns so it can cope with fonts that normally won't pass through Postscript, buy rebuilding them into a subset that will.

One thing to try is to change the point where the font is embedded - rather than sending it inside the Postscript file, distill the PDF without embedding that font, then open it in Acrobat and embed the font using the Preflight fixup. Your document has passed through cgpdftops, and there's a known break in the lookup tables whenever a non-Postscript font passes through a Ghostscript-style interpreter. In essence, MacWord is exporting a native PDF (which is what Mac programs do, and it's probably OK in terms of the font at this stage), CUPS is converting it back into a Postscript file using cgpdftops, then Distiller is re-converting it into a PDF again. I'm surprised after all that you even get a file you can open, but it's certainly why the font encoding is scampering off into the bushes.

In this case you don't want it to happen, but we sometimes intentionally use 'broken' lookup tables as a security feature to prevent copy/paste operations - see http://www.acrobatusers.com/forums/aucbb/viewtopic.php?pid=39545

2009-08-04 06:49:48

jeffb

Registered: Jul 30 2009

Posts: 11

Thanks for the detailed response. It is unfortunate that Acrobat does not throw up a warning message in this situation that text in the resulting PDF file cannot be copied, especially when the problem occurs with the default Microsoft Office font of Calibri.

In my setting, users will often have access only to the PDF file and not the original MS Office document on which it was based. If Calibri font was used in the original and unavailable document (especially likely because Calibri is the default Microsoft Office font), it appears to act as a block to copying text, which is not the intention and creates problems.

At the very least, this limitation to copying with certain fonts should be included in the Acrobat Help guide and the problem should be posted in the Knowledgebase on the Adobe site. I searched in vain in these resources for an explanation before coming to the user forum.

2009-08-04 07:21:37

UVSAR

Registered: Oct 29 2008

Posts: 1357

I agree it's not well-known outside the developer community (though if you Google for PDF font embedding problems, you'll find mountains of people with the same issue, and it's not just Acrobat that does it). It's not a bug per se, but a side effect of an intentional feature (namely the ability to work with non-Postscript fonts inside a PDF, which is in effect just a Postscript document with a fancy name).

Most of the time Acrobat doesn't actually know it's happened - it has no idea what the visual letters on the page mean to a human, and the lookup table is lost in distilling before Acrobat gets to see it, so it has no clue the "garbage" isn't intentional! We've gotten into the habit of always copying a few bits of text from a PDF that has to be searchable (e.g. if it's going online) just to make sure it says what it looks like it says, but if you find it's broken and don't have access to the original Office document, there's no way to recreate the correct mapping. For most people it doesn't matter as they can read and print the PDF just fine, however as searching and bot-indexing is becoming more important, it's getting more annoying.

2009-08-04 07:41:37

lkassuba

Registered: Jun 28 2007

Posts: 3636

jeffb wrote:

Thanks for the detailed response. It is unfortunate that Acrobat does not throw up a warning message in this situation that text in the resulting PDF file cannot be copied, especially when the problem occurs with the default Microsoft Office font of Calibri.In my setting, users will often have access only to the PDF file and not the original MS Office document on which it was based. If Calibri font was used in the original and unavailable document (especially likely because Calibri is the default Microsoft Office font), it appears to act as a block to copying text, which is not the intention and creates problems.

At the very least, this limitation to copying with certain fonts should be included in the Acrobat Help guide and the problem should be posted in the Knowledgebase on the Adobe site. I searched in vain in these resources for an explanation before coming to the user forum.

The new Community Help system allows you to add comments to the online Help so I'll add reference to this forum post.

Lori Kassuba is an AUC Expert and Community Manager for AcrobatUsers.com.

2010-05-20 04:25:54

Ches

Registered: May 18 2010

Posts: 1

I have a problem similar to that of jeffb. USVAR explains very clearly the nature of the problem, but I don't agree with him, when he says "As far as Acrobat is concerned, it's actually doing what it's supposed to - the document displays correctly on screen and prints correctly, it's just it can't be taken apart again (so what? says Acrobat.. not my problem).".

I disagree because Adobe documentation says: "Why use Adobe PDF? ... electronic archives are difficult to search, take up space, and require the application in which a document was created. PDF files are compact and fully searchable". To be searchable is considered to be a key feature, but the feature doesn't seem to work for PDF documents with embedded fonts: these documents are not searchable and, as a secondary consequence (possibly not an Adobe problem) are not indexed by desktop or internet search applications.

The funny thing is that the search capability is still present in these documents. For instance: open with Adobe Reader the CalibriTest.pdf previously mentioned by jeffb (http://tinyurl.com/n3pba5); select the character "i" of the first word of the document (This); copy the character into the clipboard (Ctrl+C); activate the find function (Ctrl+F); paste the clipboard into the find field (Ctrl+V) and you’ll see two small squares displayed instead of the character "i". Now, if you hit repeatedly the "Find next" icon (or the F3 function key) all the "i" character of the document will be selected, one at a time. The problem is that, to get this result, instead of the character "i" you have to type an esoteric glyph.

I hope that Adobe will make searchable also PDF files with embedded fonts, fixing what IMHO is a bug, because the behavior doesn't correspond to what the documentation says.

2010-05-20 05:38:32

#10

UVSAR

Registered: Oct 29 2008

Posts: 1357

Ches:

Acrobat is processing the file perfectly correctly, according to the PDF specification - the reason some of these PDFs are acting this way is *not* an Acrobat problem, but due to something going wrong in the creation stage, where the original application outputs a PostScript file that Distiller then converts to PDF. If the font mapping tables aren't correct when Distiller gets its hands on the document, it will not only have no idea there's anything wrong, it will correctly convert the document *as-is*. The result is a mismatch between internally-stored character codes and the glyph shapes, but that's just the way fonts work. Acrobat can't detect if the glyph for "G" actually looks like a "G", nor would you want it to (think of what it'd complain about if you used Symbol or Wingdings!)

As I said in my first reply to this thread, the mapping error is always a 1:1 effect, so you can copy a chunk of characters and then search for them, just as you can in any other PDF. The fact the characters don't appear right when pasted is irrelevant to the search routine, as again it's not looking at the visible shape of the letters, only their codes.

These forums are now Read Only. If you have an Acrobat question, ask questions and get help from one of our experts.

Problem with Copying Calibri Text