These forums are now Read Only. If you have an Acrobat question, ask questions and get help from one of our experts.

Using MS Access VBA to save PDF as Text

Ancalima
Registered: Oct 8 2008
Posts: 2

Ok well what I am trying to do is use VBA code to save an Acrobat Version 7.0 PDF into a text file so it can be pulled into a database using monarch. The actual saving of the file as text works, and here is the code I have used to accomplish that:

Dim AcroXApp As Object
Dim AcroXAVDoc As Object
Dim AcroXPDDoc As Object

Set AcroXApp = CreateObject("AcroExch.App")
AcroXApp.Hide

Set AcroXAVDoc = CreateObject("AcroExch.AVDoc")
AcroXAVDoc.Open PDF_PATH & filename, "Acrobat"

AcroXAVDoc.BringToFront

Set AcroXPDDoc = AcroXAVDoc.GetPDDoc

Dim jsObj As Object
Set jsObj = AcroXPDDoc.GetJSObject

jsObj.SaveAs OUTPUT_PATH & OutputFile, "com.adobe.acrobat.plain-text"

AcroXAVDoc.Close False
AcroXApp.Hide
AcroXApp.Exit

Now as I said the code itself runs perfectly and saves the pdf as a plain text file. The problem I am having is that the original pdf is in neat column format, but the resulting text file loses all the whitespace and creates only a single space between column fields. This is insufficient for my needs as some of the fields have spaces in them, so i cannot use a space-delimited method to import them.
My current work-around is that rather than save the file as text, i save it as a html version 3.20 file by changing "com.adobe.acrobat.plain-text" to "com.adobe.acrobat.html-3-20 "and i use monarch to anaylze that file. This works mostly, but it has had many bugs to work out because of inconsistencies in the way acrobat saves the file as html.
The application i've built does work for the most part, but i am concerned that if it receives some odd data that it will break because as it is built it is somewhat volatile. I need 7 monarch models as it is just to gather all of the data correctly.
I was wondering if there was any way to save the pdf file as a text file but still retain the column format and whitespace from the original pdf. This would alleviate most all of my problems. Also I use Acrobat version 6.0 standard, Monarch Pro 6.00 and MS Access 2000. I appreciate any help anyone could give me.

thomp
Expert
Registered: Feb 15 2006
Posts: 4411
Very clever code!! But you are running up against a limitation caused by the internal organization of data in the PDF. You see, the PDF format has no concept of text formatting. Internally there are no columns, paragraphs, etc. PDF places strings of text at exact coordinates. In fact, the ordering of the text is completely arbitrary. So when you save the PDF as text it simply writes out the individual words as it sees them. which doesn't include formattting.

There is hope. You have a couple of options. First, there are several 3rd party tools for converting PDF into other formats. I'm sure that you could find one that extracts table data.

Second, as you've already figured out, Acrobat will save to other formats, perhaps "Rich Text", or the Word format would be better suited to your purpose?

Third, and this is the fun one. You could write an Acrobat automation script with JavaScript to extract the row and colum values and then pass this information back to your VBA script for writing directly into the DB. This technique is the most direct, and contained, of all the strategies, since data flows directly from the PDF into your Acess DB.


Thom Parker
The source for PDF Scripting Info
[url=http://www.pdfScripting.com]pdfscripting.com[/url]

The Acrobat JavaScript Reference, Use it Early and Often
[url=http://www.adobe.com/devnet/acrobat/]http://www.adobe.com/devnet/acrobat/[/url]

Thom Parker
The source for PDF Scripting Info
www.pdfscripting.com
Very Important - How to Debug Your Script