Text matching with regular expressions using JavaScript

Learn how to code Acrobat JavaScript to support using regular expressions.

By Thom Parker – October 29, 2013

 

Scope: All Acrobat versions
Skill Level: Intermediate
Prerequisites: Familiarity with the Acrobat JavaScript Console

Regular expressions are an ancient and powerful technique for finding patterns in text. They have been tested by the ages and been found to be so useful that practically every computer language in use today provides support for using regular expressions. And, of course, this includes Acrobat JavaScript. In fact, in Core JavaScript, regular expressions are a built-in data type and are employed in several of the standard string operations.

Regular expressions are a large and complex topic. There are several books and websites devoted to it. However, they don't have to be difficult. With just a little knowledge it is easy to create some very useful pattern matching expressions. This article will ease you into the rich and powerful world of regular expressions for Acrobat JavaScript through some surprisingly simple examples, and will also point you to resources for further study.

What is a Regular Expression?

A regular expression is simply a string of characters that represent a text pattern. It is a mixture of both regular text characters and characters with special meaning, enclosed in forward slashes, "/". These forward slashes are the syntax that indicates (delimits) a Regular Expression. Here's a simple example:
/dog/ 

This regular expression matches the word "dog." The expression does not contain any special characters (only standard-text characters). It is case sensitive and it matches the specified characters verbatim, nothing more and nothing less. It matches them in the order and case in which they are written. It will not match "Dog" or "DOG" or "doog." Also, it will only match the first occurrence of "dog" in the text to which it is applied. For example, the following sentence includes two occurrences of "dog." The regular expression above will find only the first one.

My dog smells worse than your dog.
The original Regular Expression can be easily modified to be case insensitive and to match all occurrences through the addition of some special characters.
/dog/ig

In the following text, we'll discuss the details of this and many other simple variations that can be made to tailor the base expression to match nearly any criteria that is necessary. The great power of regular expressions is that they are flexible, i.e., they have the ability to match a wide range of strings, from specific words to general patterns. To understand how this is done, we first we need to see how regular expressions are used in JavaScript.

Regular expressions in Core JavaScript

Regular expressions don't have anything to do specifically with Acrobat. They are a feature of Core JavaScript, which means the examples shown here will work in all JavaScript environments. In JavaScript, a regular expression is represented with the "RegExp" object. There are two ways to create a new regular expression variable- with the literal notation and with the object notation. The two lines of code below create exactly the same object:
var myRegExp = /dog/;             // Literal Notation 
var myRegExp = new RegExp("dog"); // Object Notation

I find the literal notation much easier and intuitive to work with than the object notation, so all the examples in this article will use the literal notation. Also, the examples are intended to be run in the Acrobat JavaScript Console. If you haven't used the JavaScript Console before, then you need to read the linked article. It is a vital tool for script development and debugging in Acrobat.

One of the most common uses of regular expressions in JavaScript is testing a string for the existence of a pattern with the "test()" function. The "test()" function is a member of the RegExp Object and we use it like this:

var myRegExp = /dog/;
var myText = " My dog smells worse than your dog";
if(myRegExp.test(myText))
    app.alert("Found a dog!",2)

Try this code in the JavaScript Console. It will display a popup-alert box because the regular expression finds a match in the text. Keep this script displayed in the console window. It's a useful piece of code for experimenting with expressions. We'll run through several variations on this script using different modifications of both the regular expression and the text.

For the first variation, change the string to:

var myText = " My doggie smells worse than your pooch";

Even though both occurrences of "dog" have changed, the test will still return true. That's because the regular expression doesn't care what's in front or behind the pattern. It's just looking for the three letters, exactly how they are written in the expression. To find the individual word "dog," the expression needs to be modified to look for word boundaries.

var myRegExp = /\Wdog\W/; 

Now things are starting to look cryptic. That's one of the main characteristics of regular expressions, they can look scary. Remember, regular expressions date to the stone age of computing, but they are not as bad as they look. With a little knowledge, writing these expressions will seem easy in just a short time. For example, the "\" character in the expression above is called an "escape," and it tells us the next character has a special meaning. The Escape is used a lot. It gives regular characters special meaning and turns special characters into regular characters. The special meaning of the "W" is to match any non-word character. Things like spaces, new lines, and punctuation.

The current string and regular expression, as we've just modified them, will fail the test because the word "dog" does not exist by itself anywhere in the text. Now let's change the text to:

var myText = " My pooch smells worse than your dog.";

This text will pass the test and display the alert because the word "dog" is preceded by a space, and followed by a period. The period is a non-word character.

Let's make this more complex. Change the text to capitalize "Dog."

var myText = " My pooch smells worse than your Dog!";

The test will now fail, because the upper case "D" in "Dog" does not match the lower case "d" used in the regular expression. To make the expression match both "dog" and "Dog" change it like this:

var myRegExp = /[Dd]og/; 

This square brackets "[ ]" enclose a list of acceptable variations in a single-character match. As many characters can be put in square brackets as needed to cover all variations needed for the match. For example:

var myRegExp = /[Ddlgm]og/; 

This expression matches "Dog," "dog," "log," "gog" and "mog."

But to get back on track, let's say the match must be completely case insensitive. We don't care which, if any, letters are capitalized. In this case, use:

var myRegExp = /dog/i; 

The "i" following the end of the expression is called an attribute. There are only a few attributes and they are generally for more advanced features. But this one is easy, it makes the match case insensitive. Try it with this text:

var myText = " My pooch smells worse than your DOG!";

For the next example, we'll change the expression to match multiple characters.

var myRegExp = /do+g/;

The "+" symbol means match one or more occurrences of the preceding thing. In this case, the "+" is preceded with the single "o" character so it will match "dog," "doog," "dooog" or any number of "o"s in the word "dog." Try it with this sentence:

var myText = " My pooch smells worse than your Doooog!";

Detecting No Text

Now let's take a small diversion and look at one of the most common regular expressions that I use, the empty test. I use it mostly to detect empty form field values and empty string variables.

var rgEmpty = /^\s*$/;

This expression looks very cryptic because it is composed entirely of special characters, but it is much simpler than it first appears. The caret symbol, "^" matches the beginning of the text and the dollar sign "$" matches the end of the line. Using these special characters means the rest of the pattern must match the entire line of text verbatim, i.e. from the beginning to the end. The rest of the pattern is composed of two elements, the "\s" special character and the asterisk "*" special character. The "\s" matches any white space. White space is anything you can't actually see but has an effect on the text, such as spaces, tabs, and new lines. The "*" symbol means match zero or more occurrences of the preceding thing. So this pattern matches either nothing (an empty string) or a string of blanks.

Text Replacement

Regular Expressions are used in many different ways within the Core JavaScript model, but one of the most useful applications is text replacement. In the following example, the expression replaces the word "dog" with the word "pooch".

var myText = " My dog smells worse than your dog"; myNewText = myText.replace(/dog/,"pooch");  

Notice that the "replace()" function is a member of the String Object, not the Regular Expression Object. The regular expression is the first argument to this function. When this code is run, the result is placed in the variable "myNewText." Try it, and you'll see that only the first occurrence of "dog" is replaced. To replace all occurrences the regular expression will need to be modified like this.

myNewText = myText.replace(/dog/g,"pooch");  

Notice the "g" attribute added to the expression. It means global, so the pattern is applied globally to the text string.

A short reference

It would be impossible to provide a complete reference for using regular expressions here. They are just too rich for one article. Table 1 and 2 below show a short list of commonly used special pattern-matching characters.

Table 1 - Character Matching

Special Character Meaning
\d Matches 0-9
\D Matches anything but 0-9
\s Matches white space, includes spaces, tabs, and new lines
\S Matches anything but white space
\w Matches word characters a-z, A-Z, 0-9, and the underscore
\W Matches anything but a word character
. Matches any character
^ Matches the beginning of a line
$ Matches the end of a line

Table 2 - Character Repetition

Special Character Meaning
? Match 0 or 1 occurrence of the previous item
* Match 0 or more occurrences of the previous item
+ Match 1 or more occurrences of the previous item

The special characters in Table 2 and the last three in Table 1, as well as other special characters-- like the square brackets and parentheses (which weren't discussed)-- can't be used to match their respective characters in a text string. Because, of course, they are themselves special characters. The way to get around this limitation is to prefix them with the escape character, "\." Here's an example that matches dollar amounts:

var myRegExp = /\$\d?\d\.\d\d/; var myText = " The hot dog cost $1.75!";

This expression will match the dollar sign, followed by one or two digits, followed by the decimal point (i.e., period), followed by two digits. From Table 2, you can see the "?" character means match 0 or 1 of the preceding item. In this expression, it means match zero or one digits.

More information

What is shown in this article represents the simplest and most common usage of the regular expression. There is much, much more. For instance, the "RegExp.match()" function can be used to find and extract multiple substrings from a piece of text. The "String.search()" and "String.split()" functions use regular expressions as input to do advanced searches and string splitting. You'll find more information on these functions in any JavaScript reference. My favorite is "JavaScript: The Definitive Guide" by David Flanagan, published by O'Reilly. The official Core JavaScript web reference is here:

http://developer.mozilla.org/en/docs/Core_JavaScript_1.5_Reference

There are entire books covering the subject of regular expressions and there's a vast library of information available on the web. Just do a search for "Regular Expression." One of the best sites is this one:

http://regexlib.com/

It has a library of cut-and-paste regular expressions for all kinds of common tasks (such as validating a telephone or social security number), as well as tools for building and testing regular expressions.



Products covered:

Acrobat XIAcrobat XAcrobat 9

Related topics:

JavaScript

Top Searches:


0 comments

Comments for this tutorial are now closed.

Comments for this tutorial are now closed.