Text matching with regular expressions

Thom Parker – February 12, 2009

Rating: button_print

Scope: Acrobat 5.0 and later
Category: Automation
Skill Level: Intermediate and Advanced
Prerequisites: Basic Acrobat JavaScript Programming

Regular expressions are an ancient and powerful technique for finding patterns in text. Ancient, that is, in computer years. They are so useful that practically every computer language in use today provides support for using regular expressions. And, of course, this includes JavaScript. In fact, in Core JavaScript, regular expressions are a built-in data type, and are employed in several of the string operations.

Creating and using regular expressions is a large and complex topic. There are several books and websites devoted to it. This article provides an introduction to using regular expressions in Acrobat JavaScript and pointers to information for further study.

What’s a regular expression?

A regular expression is simply a string of characters that represent a text pattern. It is a mixture of both regular text characters and characters with special meaning. A regular expression is always enclosed in forward slashes, “/”. Here’s a simple example:

/dog/

This regular expression matches the word “dog.” The expression does not contain any special characters (only standard-text characters), and it matches these verbatim. It will not match “Dog” or “DOG.” Also, it will only match the first occurrence of “dog” in the text it’s used against. For example, the following sentence has two occurrences of “dog” in it. The regular expression above will find only the first one.

My dog smells worse than your dog.

We can modify the original expression to match all variations mentioned. We can set it up to be case insensitive, match special variations on the word “dog,” and match all occurrences in a piece of text. This is the power of regular expressions, i.e., the ability to tailor the expression to match exactly what we want in a string of text. We’ll explore several variations here, but first we need to see how regular expressions are used in JavaScript.



Regular expressions in Core JavaScript

Regular expressions don’t have anything to do specifically with Acrobat. They are a feature of Core JavaScript, which means the examples shown here will work in all JavaScript environments.

In JavaScript, a regular expression is represented with the “RegExp” object. There are two ways to create a new regular expression variable- with the literal notation and with the object notation. The two lines of code below create exactly the same object:

var myRegExp = /dog/;
// Literal Notation
var myRegExp = new RegExp("dog");
// Object Notation

I find the literal notation much easier and intuitive to work with than the object notation, so all the examples in this article will use the literal notation. Also, the examples are intended to be run in the Acrobat JavaScript Console. If you haven’t used the JavaScript Console before, then you need to. It is a vital tool for script development and debugging in Acrobat.

One of the most common uses of regular expressions in JavaScript is for testing a string for the existence of a pattern using the test() function. The test() function is a member of the RegExp Object and we use it like this:

var myRegExp = /dog/;
var myText = " My dog smells worse than your dog ";
if(myRegExp.test(myText)) app.alert("Found a dog!!",2)

Try this code out in the JavaScript Console. It will display a popup-alert box because the regular expression finds a match in the text. Keep this script up in the console window. It’s a great one for experimenting expressions. We’ll run through several variations on this script using different modifications of both the regular expression and the text.

For the first variation, change the string to:

var myText = "My doggie smells worse than your pooch";

Even though both occurrences of “dog” have changed, the test will still return true. That’s because the regular expression doesn’t care what’s in front of, or behind, the three letters “d”, “o”, “g.” It’s just looking for the three letters. To find the individual word “dog,” the expression needs to be modified to look for word boundaries.

var myRegExp = /\Wdog\W/;

Now things are starting to look cryptic. That’s regular expressions. Remember, they date to the stone age of computing. The “\” character is called an “escape,” and it tells us the next character has a special meaning. The Escape is used a lot. In the code above, the capital “W” matches any non-word character. Things like spaces, new lines, and punctuation.

The current string and regular expression, as we’ve just modified them, will fail the test because the word “dog” does not exist by itself anywhere in the text. Now let’s change the text to:

var myText = "My pooch smells worse than your dog.";

This text will pass the test and display the alert, because the word “dog” is preceded by a space, and followed by a period. The period is a non-word character.

Let’s make this more complex. Change the text to capitalize “Dog.”

var myText = "My pooch smells worse than your Dog!";

The test will now fail, because the upper case “D” in “Dog” doesn’t match the lower case “d” used in the regular expression. To make the expression match both “dog” and “Dog” change it like this:

var myRegExp = /[Dd]og/;

This expression uses square brackets “[ ]” to enclose a list of acceptable variations in a single-character match. As many characters can be put in square brackets as needed to cover all variations needed for the match. For example:

var myRegExp = /[Ddlgm]og/;

This expression matches “Dog,” “dog,” “log,” “gog” and “mog.”

But to get back on track, let’s say the match must be completely case insensitive. We don’t care which letters are capitalized. In this case, use this variation:

var myRegExp = /dog/i;

The “i” following the end of the expression is called an attribute. There are only a few of these attributes and they are generally for more advanced features. But this one is easy and makes the match case insensitive. Try it with this text:

var myText = " My pooch smells worse than your DOG!";

For the final example, we’ll change the expression to match multiple characters.

var myRegExp = /do+g/;

The “+” symbol means match one or more occurrences of the preceding thing. In this case, the “+” is preceded with the single “o” character so it will match “dog,” “doog,” “dooog” or any number of “o”s in the word “dog.” Try it with this sentence:

var myText = "My pooch smells worse than your Doooog!";

A short reference

It would be impossible to provide a complete reference for using regular expressions here. They are just too rich for one article. Table 1 and 2 below show a short list of commonly used special pattern-matching characters.

Table 1 - Character Matching

Special Character

Meaning

\d

Matches 0-9

\D

Matches anything but 0-9

\s

Matches white space, includes spaces, tabs, and new lines

\S

Matches anything but white space

\w

Matches word characters a-z, A-Z, 0-9, and the underscore

\W

Matches anything but a word character

.

Matches any character

^

Matches the beginning of a line

$

Matches the end of a line

Table 2 - Character Repetition

Special Character

Meaning

?

Match 0 or 1 occurrence of the previous item

*

Match 0 or more occurrences of the previous item

+

Match 1 or more occurrences of the previous item

The special characters in Table 2 and the last three in Table 1, as well as other special characters-- like the square brackets and parentheses (which weren’t discussed)— can’t be used to match their respective characters in a text string. Because, of course, they are themselves special characters. The way to get around this limitation is to prefix them with the escape character, “\.” Here’s an example that matches dollar amounts:

var myRegExp = /\$\d?\d\.\d\d/;
var myText = "The hot dog cost $1.75!";

This expression will match the dollar sign, followed by one or two digits, followed by the decimal point (i.e., period), followed by two digits. From Table 2, you can see the “?” character means match 0 or 1 of the preceding item. In this expression, it means match zero or one digits.

More information

What I’ve shown in this article represents the simplest and most common usage of the regular expression. There is much, much more. For instance, the “RegExp.match()” function can be used to find and extract multiple substrings from a piece of text. The “String.search()” and “String.replace()” functions use regular expressions as input to do advanced searches and string replacement. You’ll find more information on these functions in any JavaScript reference. My favorite is the “The Definitive JavaScript Guide” from O’Reilly. The official Core JavaScript web reference is here:

http://developer.mozilla.org/en/docs/ Core_JavaScript_1.5_Reference

There are entire books covering the subject of regular expressions and there’s a vast library of information available on the web. Just do a search for “Regular Expression.” One of the best sites is this one:

http://regexlib.com/

It has a library of cut-and-paste regular expressions, as well as tools for building and testing regular expressions.

Topics: JavaScript

Interested in trying this for yourself?
Get Acrobat X trial now.

0 comments

Leave a reply:

Commenting is not available in this channel entry.
Ask a question on this topic