Step 1: Check us out

You don't have to be a member to look at any content on the site. Increase your expertise with our helpful tutorials, videos, forums, and sample PDFs.

Step 2: Sign up for a free account

Like what you see? Take the next step and become a member. Register now to get discounts, attend eSeminars, ask questions and more.

Step 3: Start participating

Get the most out of your membership. Post in the forums, create your profile, submit to the gallery, attend a user group meeting.
Log In now.

Duff Johnson's Blog

Duff Johnson's picture
Syndicate content
Posted: 2007-03-25

Converting PDF to Word: Understanding the Problem

I hear some version of the following question over and over:

"Which software accurately converts PDF to Word?"

Converting PDF to Word (or other word-processing applications, HTML or whatever) is not a simple, push-button affair, as almost everyone who has ever tried it knows (thus the questions).

Even so, most people are looking for a simple, push-button way to get the contents of a PDF into a Word file.  What's the typical experience? Documents with layouts even slightly more complex than vanilla paragraphs routinely convert into junk. End-users expect this task to be pretty easy - which explains why the tone of the typical inquiry may be characterized as "pained".

Let\'s take a moment to understand why converting PDF to Word is so problematic.

The factors influencing the quality of conversion from PDF to Word are, in descending order of significance:

  1. The extent to which the document\'s logical structures are represented within the PDF (tagging)
  2. The complexity of the objects on the page (mathematics, charts, graphs, etc)
  3. The complexity of the document layout

Factor 1 is a property of the PDF file itself, not the software used to extract the contents to Word.  If the document is properly structured and tagged, predictable results may be had in converting to Word from Adobe Acrobat.

Beyond Factor 1, different software will guess at logical structure via analysis and assessment of the layout, fonts and objects on the page.  There is no magic bullet. The more complex the document, the lower the chance of high-quality output, no matter what software is used.

Comments (10)   Permalink

Comments

phoebe
Offline
Registered: Aug 4 2010
officeconvert.com/advanced-pd

officeconvert.com/advanced-pdf-to-word-converter.htm

http://www.officeconvert.com/

phoebe
Offline
Registered: Aug 4 2010
officeconvert.com/advanced-pd

officeconvert.com/advanced-pdf-to-word-converter.htm

http://www.officeconvert.com/

TIMBIM
Offline
Registered: May 19 2010
Upon Further Examination.

I am really confused now.....

I split a .pdf document up with Acrobat and converted each individual page with no problem.

When I try to convert the entire document, the result is unusable...

Maybe Adobe could add a setting 'process each page individually'

DuffJohnson
DuffJohnson's picture
Offline
Expert
Registered: May 30 2006
It's possible you have a

It's possible you have a memory limitation. Conversion of large documents is extremely taxing to the machine.

Your idea is a good one - to allow for processing of a given page-range. I will post it as a Feature Request.

Thanks!

Duff.

Appligent Document Solutions
http://www.appligent.com

TIMBIM
Offline
Registered: May 19 2010
Memory Issues

Thanks for your reply.

I do doubt that I am experiencing a memory issue. This was a 42 page document, with text size 14 or so (so not much on the page) and a few .jpg's strewn about. the .pdf is 2.3MB. I have 12GB RAM.

The conversion process (all pages) took about 4 seconds.
The result was that the pages 'exploded' with images and text split between pages. It is as though Acrobat was trying to keep text continuous (yep, settings were for layout over flowing text). Text bulletted with images, left the images on the previous page, etc....

It would be nice if Acrobat separated each page, converted it, then stitched them back together.

Just a thought! :-)

TIMBIM
Offline
Registered: May 19 2010
Vendors vendors vendors....

I don't mean to be rude.....

If conversion from .pdf to .doc is so unreliable, why is it offered as a function of acrobat? I am having very mixed results, most of them terrible.

:sigh:

DuffJohnson
DuffJohnson's picture
Offline
Expert
Registered: May 30 2006
It's not that the conversion

It's not that the conversion is unreliable per se - it's that the documents being processed can be (a) deeply unreliable themselves, and (b) technically very challenging to "unpick".

And yes - the software could be better. In general, the simpler the information it has to process, the better of a job it will do. Sometimes, it's easiest to scan the page and OCR in order to convert to Word.

Try tagging the document before extraction, as I suggest.

Are you trying to get plain text, or are you trying to preserve the original layout? Tagging is a great strategy if your focus is plain text. Otherwise, you have to consider the complexity of the layout, how well-organized is the PDF file, etc. and that's nasty stuff.

There are a number of variables, but the software is not a genius - it needs help to get optimum results.

Appligent Document Solutions
http://www.appligent.com

kevin88
Offline
Registered: Feb 3 2010
Convert pdf to word

I have downloaded a program called Advanced pdf to word 5.0. It,s pretty good. It is very good in preserving the formatting. You download it from: http://www.advancedpdfconverter.com/products/pdftoword.html

Anonymous

Biggest problems are caused because of these boxes which are just part of OCRing (every paragraph gets its own box in word and that's very hard to format).

When I was converting my website http://www.ad.com.hr which is on croatian, I had one similar problem, part of my testimonials were scanned in pdf, when i converted them, I couldn't use it at all because graphics were mixed with background on paper and handwriting from clients...

Thanks for your post!

Anonymous

[...] PS For some enlightenment on one button conversions to MS Word and the problems associated with simple exports from Acrobat, see Duff Johnsons Blog [...]

Membership

Sign up for your free membership today and save up to 40% on books, training, and more.

Join for free >

Acrobat Job Board

Looking for a job or seeking to fill a job? Check out the new Acrobat job board.

Job Board >

Tech Talks

Go deeper into Acrobat through a new series of informal technical talks by Acrobat experts.

Tech Talks >