These forums are now Read Only. If you have an Acrobat question, ask questions and get help from one of our experts.

Extract selected text from a PDF document.

abhijeetz
Registered: Jul 9 2008
Posts: 11

I want to extract selected text from a PDF document. I have checked the material that is available online. Examples available there are first creating the text selection (by passing hard-coded values) on a PDF document and then retrieving it.

Here is the code I am using

Dim nElement As Integer
Dim gAcroRect As Acrobat.CAcroRect
Dim pdTextSelect As Acrobat.CAcroPDTextSelect

Try
gApp = CreateObject("AcroExch.App")
gAvDoc = gApp.GetAVDoc(docNo)
gAvDoc.BringToFront()
gPdDoc = gAvDoc.GetPDDoc()
gAcroRect = CreateObject("AcroExch.Rect")

‘AS YOU CAN SEE FOLLOWING COORDINATES ARE HARD-CODED

gAcroRect.bottom = 380
gAcroRect.Top = 400
gAcroRect.Left = 100
gAcroRect.right = 500

pdTextSelect = gPdDoc.CreateTextSelect(2, gAcroRect)
gAvDoc.SetTextSelection(pdTextSelect)
gAvDoc.ShowTextSelect()

For nElement = 0 To pdTextSelect.GetNumText() - 1
MessageBox.Show("Text # " & nElement & " ---> '" & pdTextSelect.GetText(nElement) & "'")
Next

Catch ex As Exception
End Try

In the code above, object gAcroRect contains the hard-coded parameters to select the specific area. I want to extract just this information when user selects the specific area on a PDF document (I want to pass these parameters dynamically as per the selection user makes on the PDF document).

I tried another way of coping the selected text from the PDF, but I need to retain/ save the word number and page number that belongs to the current selection.

Your suggestions will be highly appreciated.

-Abhijeet

My Product Information:
Acrobat Standard 8.1.2, Windows
thomp
Expert
Registered: Feb 15 2006
Posts: 4411
What you want is a "GetTextSelection()" function. Unfortunately no such function exists. You cannot extract an existing text selection from an Open PDF using the Acrobat IAC. Only an Acrobat plug-in can do this sort of thing. So you could write a plug-in to provide your VB app with this info.

Thom Parker
The source for PDF Scripting Info
[url=http://www.pdfScripting.com]pdfscripting.com[/url]

The Acrobat JavaScript Reference, Use it Early and Often
[url=http://www.adobe.com/devnet/acrobat/]http://www.adobe.com/devnet/acrobat/[/url]

Thom Parker
The source for PDF Scripting Info
www.pdfscripting.com
Very Important - How to Debug Your Script

abhijeetz
Registered: Jul 9 2008
Posts: 11
Hi Thom,

I have seen function "AVDocGetSelection" in Adobe SDK ,however i did not find the way to call this using c#.net using adobe library.

It would be great if you put some light on this.
thomp
Expert
Registered: Feb 15 2006
Posts: 4411
The Acrobat SDK fucntions are only available to plug-ins. They are not visible to the IAC:( So they cannot be used from an external application.

Wouldn't it be nice if they were.

You write plug-ins using C++. A plug-in is a DLL that Acrobat Loads when it starts up. So the plug-in operates inside the Acrobat application space. Its very tightly integrated with Acrobat so it has accesss to all kinds of functionality.

The IAC is an interface Acrobat presents to the outside world. It looks similar to the SDK in a lot of ways, but it's really a completely different animal. There is no relationship between the two. The IAC only supports functionality listed in the IAC Reference.

Thom Parker
The source for PDF Scripting Info
[url=http://www.pdfScripting.com]pdfscripting.com[/url]

The Acrobat JavaScript Reference, Use it Early and Often
[url=http://www.adobe.com/devnet/acrobat/javascript.php]http://www.adobe.com/devnet/acrobat/javascript.php[/url]

Thom Parker
The source for PDF Scripting Info
www.pdfscripting.com
Very Important - How to Debug Your Script

rgagnon
Registered: Aug 31 2010
Posts: 1
I would like to use the PDETextSelect.GetText call in a plugin as described in the SDK 9 help file. A prototype for this function or anything similar does not appear in any of the header files. Another supporting call, PDETextSelect.GetTextNum seems to be missing as well as a mentioned sample program named TextExtraction.

Can anyone help with extracting single words or sentences from a PDF using a plugin? The SDK documentation of function calls does not seem to match the library. I have the plugin up and running fine built with VS2005 and MFC.

Roger

brettville
Registered: Dec 20 2010
Posts: 2
I'm not sure about getting the selected text, but you can easily read text from a PDF file:

Dim app As New AcroApp
Dim avDoc As New AcroAVDoc
Dim pdDoc As AcroPDDoc
Dim jso As Object
Dim iCount As Integer
Dim iWord As Integer
Dim iPage As Integer
iPage = 0 'First page
app.Show
avDoc.Open Path, ""
avDoc.BringToFront
Set pdDoc = avDoc.GetPDDoc
Set jso = pdDoc.GetJSObject
iCount = jso.getPageNumWords(iPage)
For iWord = 0 To iCount - 1
word = jso.getPageNthWord(iPage, iWord, True)
If VarType(word) = vbString Then
Debug.Print word
End If
Next

Let me know if there are any other questions
Brett Sanders