These forums are now Read Only. If you have an Acrobat question, ask questions and get help from one of our experts.

Regular Expression Search and create link

EricFan
Registered: Sep 1 2008
Posts: 13
Answered

Hi gurus.

I need to search thousands of PDF files by regular expression and create links on each match dynamically.

In order to use regular expression, I change PDF to plain text by extract every word into a string. But after this step, the index information of each word is lost and I can not call getPageNthWordQuats to add link

Can anybody help me on this?

Thanks
Eric

thomp
Expert
Registered: Feb 15 2006
Posts: 4411
Don't convert the file to text. Use "getPageNthWord" to find the text to link. There's no reason you can't use a regular expression this way. If you need it to match across multuple strings then stack the words into an array and join for the pattern testing. Pop words off the top of the stack to keep it a size that's large enough for the match, but small enough not to cause problems.

Thom Parker
The source for PDF Scripting Info
[url=http://www.pdfScripting.com]pdfscripting.com[/url]

The Acrobat JavaScript Reference, Use it Early and Often
[url=http://www.adobe.com/devnet/acrobat/]http://www.adobe.com/devnet/acrobat/[/url]

Thom Parker
The source for PDF Scripting Info
www.pdfscripting.com
Very Important - How to Debug Your Script

EricFan
Registered: Sep 1 2008
Posts: 13
Hi Thomp.

Thanks for your reply. Based on your hint, I eventually work it out.

The only small issue left is that I can't make the link I added looks like a normal http hyperlink. Doc.addlink can only add a box around the string. AddAnnot can highlight it but still looks strange.


Following is the solution to add link according to Regex matches.

1. Get word one by one by getPageNthWord,
2. Put all words of one page into an array, then join them to a string but insert a punctuation(a whitespace) between each word
3. Apply Regex to the string
4. Use string.search to get start position of each match
5. Calculate whitespace before start position, consecutive whitespaces are counted as one.
6. Then I can get the index of the first word, get index of last word as well if it is a cross words match.
7. Get quads of start word and end word
8. Combine quad
9. Call addlink


Eric
thomp
Expert
Registered: Feb 15 2006
Posts: 4411
Excellent Solution!!

Thom Parker
The source for PDF Scripting Info
www.pdfscripting.com
Very Important - How to Debug Your Script