Acrobat User Community

Extracting pages from a PDF with Acrobat JavaScript

By Thom Parker – February 12, 2009

Scope: Acrobat 5.0 and later
Category: Automation
Skill Level: Intermediate and Advanced
Prerequisites: Basic Acrobat JavaScript Programming

Imagine receiving a large, automatically generated report in PDF that needs to be sliced and diced so different parts can be sent to clients or other departments. Not an uncommon activity, and one that’s possible to do manually with Acrobat Professional. Now imagine having to do this every week to a document that needs to be split 100 different ways. That’s a big task, one prone to human error. Fortunately, this can be easily automated with Acrobat JavaScript.

About page extraction

Page extraction is performed with the doc.extractPages() function. This function takes three input arguments: The page numbers for the beginning and end of the extraction, and a path to a PDF file where the extracted pages are saved.

This is a simple function to use, especially since all the input arguments are optional. But it does have a couple restrictions. First, page extraction cannot be done in the free Adobe Reader; this can only be done with Acrobat Professional or Standard. Second, due to security restrictions in Acrobat scripting, the path input can only be used if this function is called from a privileged context. This means the path input cannot be used if this function is run from a script in a PDF file. Extracting pages is for automation, not document interactivity. Automation scripts include JavaScript code run from the JavaScript Console, a Batch Process, or a Folder Level Script.

All the examples in this article will be run from the Acrobat Console Window, which is a privileged context and also very handy for running quick cut-and-paste automation scripts. I’ve made up an example file for testing. Download this file and save it to a local folder on your system:

Example file
NelsonsInc_Employee1040s.pdf

This file was generated from the accounting mainframe at Nelson’s Buggy Whips. In 1864, Nelson’s provided all its employees with filled-out 1040s to make it easier for them to file taxes. The sample above is a single file with all employees’ 1040s included. It was generated for print, but now needs to be split and e-mailed to the individual employees.

We’ll start off with some simple examples before getting into the full automation script.

Open the example file in Acrobat Professional, then open the JavaScript Console by pressing Ctrl+J on Windows, or Command+J on Mac.

To extract a single page from the document, specify only the nStart input. Run the following code in the JavaScript Console:

this.extractPages({nStart:5});

If your screen isn’t large enough to accommodate both the Console Window and Acrobat, close the Console Window. Notice Acrobat has created a new temporary file with a single page (page six) from the original document. It’s very important to remember that page numbers in JavaScript are zero-based, i.e., page zero in JavaScript is page one in the Acrobat viewer.

Notice also Acrobat created a temporary document to place the extracted page. This is because the path input, cPath, was not specified. Look back in the Console Window (Figure 1). The return value from running the code printed out the text [object Doc]. If you are using Acrobat 7 or earlier, the output will be slightly different. For Acrobat 7, the output will be [object Global].

Figure 1 – Document object returned from extractPages() function.

The extractPages() function returns a pointer to the newly created document object with the extracted pages. If this code was part of a larger script, then the document pointer would be critical for actually doing something with the extracted pages. We’ll get to this in a later example.

Delete the temporary PDF. Note: be sure to do this for every example that creates a temporary PDF so you don’t get mixed up about which document you are working on.

Let’s do this again, using a simple path argument:

this.extractPages({nStart:5, cPath: "TestExtract1.pdf"});

This time, the extractPages() function returns null, and no temporary PDF is created. Look in the folder where you saved the example file. There will be a new file in that folder named "TestExtract1.pdf.” Acrobat saved the extracted page, so there was no need to return a document pointer.

Before we move to the next example, it’s worthwhile to point out the notation used to pass the arguments into the function. This “Object Style” notation is an Acrobat DOM feature, not a core JavaScript feature. It only works on functions that are part of the Acrobat JavaScript Model. It’s useful because it eliminates having to specify the other optional arguments, but it’s not necessary. The first example could have been run like this:

this.extractPages(5);

Or the second example like this:

this.extractPages(5, 5, "TestExtract1.pdf");

Which leads into the next example, using the cEnd input. Using cEnd by itself extracts all pages from the beginning of the document to the page value specified by cEnd. Run this code in the Console Window:

this.extractPages({nEnd:5});

This code extracts pages one through six. It is exactly the same as running this code:

this.extractPages(0,5);

To extract the pages from page five to the end of the document, use this code:

this.extractPages(5, this.numPages-1 );

where this.numPages is a document property that returns the number of pages in the document. So, (this.numPages-1) is the page number for the last page in the file.

Creating a cut-and-paste automation script

Now we’re ready to create the script to split all the 1040s and e-mail them to the right people. Let’s start with breaking out the individual forms for the employees.

Each 1040 form has four pages. Forms were simpler in 1864 (although the tax calculations were still incomprehensible), no schedules or related forms, so we can write a loop to both extract the pages and e-mail the documents.

for(var i=0; i<this.numPages; i+=4) {
	var oNewDoc = this.extractPages({nStart: i, nEnd: i + 3});
	oNewDoc.mailDoc( … );
	oNewDoc.closeDoc(true);
}

This script walks through the document extracting four-page blocks. The extractPages() function returns a pointer to the newly created object, which is then used to e-mail the document, and finally to close it before moving on to the next extraction. You can look up the mailDoc() and closeDoc() functions in the Acrobat JavaScript Reference.

One thing is missing from this script: Where do the e-mail addresses come from? For simplicity, we’ll modify the code to use a list of names and e-mail addresses.

var aEmailList = ["[email protected]","[email protected]","[email protected]"];
for(var i=0,j=0; i<this.numPages; i+=4,j++) {
	var oNewDoc = this.extractPages({nStart: i, nEnd: i + 3});
	// Build file name and path for new file
	var cFlName = aEmailList[j].split("@").shift() + "_1040.pdf";
	var cPath = oNewDoc.path.replace(oNewDoc.documentFileName,cFlName);
	oNewDoc.saveAs(cPath);
	oNewDoc.mailDoc(false, aEmailList[j]);
	oNewDoc.closeDoc(true);
}

A second variable is added to the for statement for walking through the array of e-mails, and a saveAs command is included. Copy and paste the above code into the Console Window. Make sure to select all lines in the script before running it, so all the code is executed at the same time. Acrobat will go out to lunch for a short time. When it returns, you should have three new e-mails in your out folder, each with a PDF attachment.

Unfortunately, the name of the temporary file created by extracting the pages is a bit cryptic, and it ends with “.tmp” instead of “.pdf.” Files should have sensible names so it’s easier to tell a bit about the contents from the name. But we have a potentially bigger problem because of the “.tmp” extension. It’s possible an e-mail server will block an attachment with this extension. The code for creating a new file name and the doc.saveAs() function were added to the script to fix these issues. It saves the temporary file to a name derived from the e-mail address. For example, the first set of extracted pages will be saved to “HBabner_1040.pdf.” The file is saved to a temporary file folder, so it can be cleaned up easily later.

This is a pretty simple script that can make our job a lot easier. But, what if the individual 1040s varied in page length, or the document was so huge it wasn’t practical to set up the e-mail addresses to match the extraction order? How do we make a more flexible automation script?

All these issues can be handled with Acrobat JavaScript. For example, we could use the doc.getPageNthWord() function to both find the page ranges and extract the employee’s name. This information could then be used to look up the e-mails on a local list, or even the company’s server. But, that is a much more complex script, so it will have to wait for another day.

Using the example scripts

In this article, we ran the example code by copying and pasting the scripts into the JavaScript Console Window. In fact, for doing simple-automation tasks, it’s a good idea to place all your favorite scripts into a plain-text document from which you can copy and paste.

To extract pages from a group of files, you would use a Batch Sequence. Batch Sequences are a privileged context, so all the example code can be copied directly into a Batch Sequence.

A more interesting and useful way to run an automation script is with an Acrobat toolbar button or menu item. However, using one of these options requires that the code be enclosed in a trusted function. Code for creating toolbar buttons and trusted functions can be found in this article, Applying PDF security with Acrobat JavaScript.

For more information on functions used in this article, see the Acrobat JavaScript Reference and the Acrobat JavaScript Guide.

https://www.adobe.com/devnet/acrobat.html

Click on the Documentation tab and scroll down to the JavaScript section.