These forums are now Read Only. If you have an Acrobat question, ask questions and get help from one of our experts.

Conversion to Word .docx format

Simon
Registered: Sep 17 2007
Posts: 7
Answered

I need to convert my publisher's 360 page book pdf into Word format. File Convert messes up the pae formatting. Do tell it not to remove tags or something? DoI use ocr?

I'm a novice and appreciate any help. How do i know if i even get a reply to this post?..

Thanks, Simon

My Product Information:
Acrobat Pro 9.2, Windows
lkassuba
ExpertTeam
Registered: Jun 28 2007
Posts: 3636
The results when exporting to Word are dependent on how the PDF was created initially. If it had no structure when the PDF was created, then your results may be subpar. You best bet is to work with the two options of "Retain Flowing Text" or "Retain Page Layout" in the Save As DOC setting dialog.

Lori Kassuba is an AUC Expert and Community Manager for AcrobatUsers.com.

Simon
Registered: Sep 17 2007
Posts: 7
Thanks. In the end I went with Plain Text option and am rebuilding the file in Word. FYI, Adbobe's phone support was a sad shadow of what it once was. After a half hour on hold, the rep came on the line, listened to my question, complained he couldn't hear me until suggesting I call back in, placing me back at the end of the line.

Tha't's one way to duck product support! Anyway, thanks for your helpful suggestion.

Simon
Simon
Registered: Sep 17 2007
Posts: 7
Your reply was a big help, and I converted the Adobe PDF of my manuscript to plain text, saved that as as a Word file in .DOC format, and am formatting each of its 360 pages for layout, so thank you for your great guidance to get this far.

There's one critical aspect I hope you can help me with.

In the Adobe manuscript PDF, the book is laid out by the publisher ready for the printer, with a hyphen at the end of many of the lines, breaking the word. In the plain text conversion, and the Word .DOC created from it, those hyphens still appear, except of course they're no longer at the end of a line, but right in the middle of words where they don't belong.


Is there any way you know that, in Word 2007, I can search and replace those hyphens with nothing--i.e. delete them. For example, change the word to-day" into "today"

I've tried the Special box, and selected "nonbreaking hyphen" and "Optional hyphen" but the search doesn't find these hyphens.

Do you know how I cam make these many hundreds of carried over hyphens from the Acrobat file disappear?

Thanks! My book means the world to me, and its information could help many, so really appreciate your guidance.

Simon, Los Angeles, California
rbogie
Registered: Apr 28 2008
Posts: 432
simon, plz let the forum know that (if) we solved your question.
Simon
Registered: Sep 17 2007
Posts: 7
Hello everyone,

I'm not that experienced with how forums work, so wanted to let rbogie and everyone in the forum that my question is now solved by your help! "rbogie", and therefore this forum, enabled me to convert my book file to Word, and I'm now working as quickly as I can to build it for the audiobook company to record for after my book comes out this summer.

I only had one remaining question. I wondered if Adobe allows for a search for a given font such as italics or bold.? I'm now doing this by simply checking through the pages, and I'm pretty sure the answer is "no" as I've explored the advanced Boolean search choices, but don't see a way to specify a font.

But this forum solved my main and critical question perfectly.

THANK YOU, especially "rbogie" and apologies it took this long to post. I wasn't sure how to get back to this forum page..



Simon




Many thanks to everyone
afablac
Registered: Mar 9 2010
Posts: 2
I need help converting a pdf doct in Acrobat 9 Reader to a word doc. I have read all of the instructions however, I do not have an "export" tab under file. Please help as I need instructions rather quickly. Thank you.
Simon
Registered: Sep 17 2007
Posts: 7
I'm not knowledgeable, and am under a deadline too, but here are my rough notes on how this forum helped me to get my pdf into Word.

Summary:

First I made several backups!! This was to make certain I had a safe copy of the pdf, and played only with the backups, not my original.

In Acrobat I saved the backup as a plain text file. For some reason, hyphens that appeared at the ends of lines came through into the plain text file as hyphens in the middle of lines. With great assistance from this forum's member, I went through Notepad and WordPad in order to search and replace all occurrences of "\-" with nothing at all, in order to strip out all the hyphens.

My document was 360 pages; if yours is short you might just go through deleting the hyphens.

Here are the detailed notes of what I did, but please understand I'm not an expert and am under a deadline, so only try this if you're skilled or have someone skillful to answer questions that relate to your own configuration.

Good luck!

Simon

Details:

1, 2. File>Save as, select plain text so becomes a .txt file. Accept the default settings. It takes a very long time to make it, but should come up with no formatting at all in that folder with .txt extension. It's all single lined, and all end of line hyphens get pulled in; there's no italics, page headers, bold, etc any more. Those will have to be rebuilt in Word.3. In MS Word, select File > Open.It knows it's txt file (stored in Notepad) and automatically opens a file conversion window. Accept the default (Windows OS), where in the lower window the file will be legible.

4. Save as a .docx file.


[If you see lots of hyphens and want to automate their removal]

II. Remove the Hyphens

A. Make a safety copy of the file("S02") in *DOC* format (not DOCX; in DOCS format, the encoding shows up in the WordPad and messes up this process. It must be in Word's compatible DOC.)


1. Open SO2.DOC using Word Pad (Start All Programs>Accessories> WordPad> filename)2. Save SO2 in RTF on Desktop. Make sure the desktop icon (a "W") is followed by .rtf, namely SO2.rtf. I may need to do this step twice in order to get SO2 into RTF.




3. RMB SO2.rtf file > Open With> WordPad (not Notepad, which has very little formatting. Wordpad has more formatting.)(In WordPad, the hyphens should disappear)

4. Save File as "3" in RTF.

5. Open SO3 in Notepad. Do this by Start >All Programs >Accessories > Notepad, and when you have that empty window open, drag SO3 into it. The top ribbon in the window should read: SO3.rtf-Notepad and I see all the text with many back slashes.6. In Notepad:

7. Select dropdown menu: Edit >Replace (or control H);Search for "\-" [use the hyphen key to right of the zero, but think either hyphen key would work);

Leave the Replace box empty, in order to replace with nothing at all;

8. Select Replace All

9. When done, click "Cancel" to close the window.


10. If it's worked, the next step is File > Save As SO4.txt, accepting "format" txt format and "encoding" with ANSI.11. On desktop, RMB SO4.txt > Open with >Microsoft Office Word.. and with luck your file is open and correct in Word. Presto!Now you have to add headers, italics, bold and other heading formats, and likely line breaks may occur in the wrong places..


Hope this helps, but this is all the help I've time to offer..

Best wishes, Simon
afablac
Registered: Mar 9 2010
Posts: 2
Simon,
Thank you for taking the time to reply. I will give it a try.
Simon
Registered: Sep 17 2007
Posts: 7
good luck, and feel free to ask this forum if you need input. Someone helped me, after all. It's just that I'm not very knowledgeable, and am under a deadline.

Only try it with a backup..

Simon
nlee
Registered: Mar 11 2010
Posts: 3
Hi~

I created a Word{2003} doc with transparent backgrounds in some of the shape boxes and text boxes...and they show up as opaque white when I convert the doc to a pdf file...and even tho they show up as transparent in my print preview they also print as opaque white...

Can anyone help with this???

thank you~
~Nlee
TonyPotter
Registered: Feb 1 2010
Posts: 85
afablac wrote:
I need help converting a pdf doct in Acrobat 9 Reader to a word doc. I have read all of the instructions however, I do not have an "export" tab under file. Please help as I need instructions rather quickly. Thank you.
Acrobat Reader is designed for viewing PDF, so you can not export PDF to DOC. If you use Win OS, I will introduce you a free desktop [url=http://www.anypdftools.com/pdf-to-word.php#201]PDF to Word Converter[/url],by which you can easily convert PDF to Word with the original text, hyperlinks, graphics and layouts preserved perfectly.Hope it helps!

I will try my best to help you in PDF converison fields, objectively and Neutral.

redcrew
Registered: Nov 7 2006
Posts: 83
I have a PDF that was originally converted from Word. No one has the original Word document. The PDF file has text, numbered lists, bulleted lists, and headings.

When I view the File > Properties, it shows the file is tagged.In Windows Acrobat Professional 9.3.1, I've tried the following steps to convert the PDF to Word:

1. File > Save As > Word document > Settings > Retain flowing text
2. File > Save As > Word document > Settings > Retain page layout
3. File > Save As > RTF > Settings > Settings > Retain flowing text
4. File > Save As > RTF > Settings > Settings > Retain page layout
5. File > Export > Word document > Settings > Retain flowing text
6. File > Export > Word document > Settings > Retain page layout
7. File > Export > RTF > Settings > Retain flowing text
8. File > Export > RTF > Settings > Retain page layout
9. File > Export > Word> Settings > Retain flowing text
10. File > Export > Word> Settings > Retain page layoutAll resulting in a Word file with text boxes, not straight text that can easily be formatted. The formatting for the headings is maintained. The bullets on the list items seem to be converted to checkboxes that aren't clickable.

Is there some option I'm missing that will keep the text in the PDF from converting to a text box?
Each heading has a text box around it.
Each list item has a text box around it.
Each paragraph has a text box around it.

Advice?
daka630
Expert
Registered: Mar 1 2007
Posts: 1420
Hi redcrew,

That a PDF shows "tagged" in its document properties provides little confidence that the PDF actually has a well-formed structure tree.
You may want to do a complete top down walk of all entries in the PDF's Tags panel.
From the Options menu, select "Highlight ..."

As you walk the tree observe where the highlight goes/shows.

From what you have described I suspect you have, a best, a poorly tagged PDF.

If it is something that can be shared via an acrobat.com link it'd be interesting to give the PDF a look-see.

Be well...

Be well...

redcrew
Registered: Nov 7 2006
Posts: 83
Hi daka630,

Thanks for the comment about the Tags panel. I didn't view the Tags panel, just was curious if I missed a setting/option when trying to save a PDF to Word.
redcrew
Registered: Nov 7 2006
Posts: 83
I did additional testing with a Word document that was:

1. created with styles
2. converted to PDF using Word 2007 PDFMaker
3. Tags panel checked
3. Exported as Word document in Acrobat 9.3.1

When I exported from Acrobat 9.3.1 to Word and selected "Retain flowing text", the resulting document did not retain the correct style settings for all the styles, but did retain the text as:

1. individual paragraphs
2. bulleted list items
3. numbered list items
4. headings

The document has a "squished" look, spacing between text has been deleted. The font type and font styles (bold) are retained in the text, but styles associated with headings, lists, and paragraphs are no longer part of the document.

When I exported from Acrobat 9.3.1 to Word and selected "Retain page layout", the resulting Word file displayed the text exactly as it displayed in the original Word 2007 file. Headings, paragraphs, footer, header, and lists displayed correctly with spacing, font styles and font type.

However, all the text:

1. headings
2. paragraphs
3. individual bulleted list items
4. individual numbered list items
5. footer
6. header

are in text boxes. Additional styles, with names beginning with the letters "CM", were added to the styles for the document, and used for the styling of the text boxes.

Interesting experiment, but not the results I had hoped for.