PDF::API2 with DBI
Documentation by example
Webfluency recently was tasked with a project to take the stories from a content database and convert them to PDF files. Since the database had about 5000 stories in it we looked for a programmatic solution to the problem. Once we stumbled upon the PERL module PDF::API2 it seemed that our prayers were answered. However, the complete lack of documentation on this incredibly useful module provided us a additional hurdle to the successful program we were seeking. With help from a partial example at http://www.printaform.com.au/clients/pdfapi2/ we were able to get a solution put together which we feel may help others in understanding how to use PDF:API2.
This page will break down the program we wrote step-by-step so you can better understand our use of the PDF:API2 and how we integrated it with DBI to extract and layout the content in usable PDF files. This program can be easily modified to run as an CGI script on the fly depending on how you would like to use it. Part of the script uses the text_block subroutine provided by Printaform.com under the LGPL licence. You can download the full code here. Please note that the code in this example and the download has been modified from the original version to protect the original client's databases but the functionality of the program is unchanged. If you find these instructions or the script useful, please drop us an e-mail of thanks so we know and can continue to provide functional examples of truly helpful and dynamic programming.
Printing to PDF
Printing to a PDF document is more akin to making pages in Quark that it is to traditional print structures in PERL. The idea is that you define a page, then you define content zones to fill, then you fill those zones. Graphics can be added as well as boxes and lines in the middle of the printing structure. Finally you commit the changes and print the page. If you need another page, you append it and start all over again. It is not a linear print structure and you can keep adding to it until you commit the changes.
In essence this is exactly how PDF:API2 works. You will define a page, then define content zones for text and graphics, then define any drawn lines, and finally produce the page.
The layout of the example PDFsFor this example we will be creating a 3 column layout with 2 different styles. The first page will have an opening layout involving 2 graphics (top and bottom), the headline for the article, a deck, the author, and the beginning of an article in 3 columns. All subsequent pages will only have the bottom graphic, a page number, and a taller 3 column layout to fill the gap created by the missing content. [image of the example]
Lets get startedTo get rolling we need to do our standard PERL declaration and pull in the modules for data collection and PDF creation. Since we will be pulling content from a Microsoft SQL database, we need to include the DBI module.
#!/usr/bin/perl use PDF::API2; use DBI;Getting Content from the database
This a standard use of the DBI module to read content from a database. In this example all of the stories have a headline, deck, byline, full article, and a unique id number. We will use the unique number as part of the file name while sorting out all stories to only get the articles that are reviews. First things first, lets open a connection to the local Microsoft SQL database.
## get data
#open Database connection
my $dbh = DBI->connect("DBI:ADO:Provider=SQLOLEDB;Data Source=local;Initial Catalog=cmsi");
Then we create the SQL query to extract the data and execute it.
# prepare SQL my $sth = $dbh->prepare(<<SQL); select CMSIheadline, CMSIdeck, CMSIbyline, CMSIfullarticle, CMSIarticleid from CMSINewsArticles where CMSIArticleType LIKE '%Reviews%' SQL # get data $sth->execute;
Since the articles are stored in an array from the database extraction we will just keep them there and traverse the array to make the PDF files.
# get data
while (my @row = $sth->fetchrow_array()) {
So for each article in the array, we will pull this out into separate variables. There are probably more efficient ways of doing and we could keep it in the array but this but this works for me and makes it easier for me to show you how the data is used in the script. Also, you don't need to pull data from database to make this work. You can just hard code the variables and the layout part will work just the same.
# Pull items out into individual variables $headline_text = $row[0]; $deck_text = $row[1]; $author = $row[2]; $paragraph = $row[3]; $article_id = $row[4];
Since the full article is copied and pasted from a bunch of MS Word docs into the database, you need to massage the article to remove some unwanted coding which will effect the layout when we put it in the PDF file. This will remove all unwanted code and replace end </p> with two UNIX returns for the proper display.
# massage text for insertion into the PDF $paragraph =~ s/<\/p>/[break]/ig; # swap insert paragraph break $paragraph =~ s/\ \;/ /ig; # swap HTML space for space $paragraph =~ s/ //ig; # remove MS returns (Known as ^M in emacs/vi) $paragraph =~ s/\n//ig; # remove unix returns $paragraph =~ s/\s+/ /ig; # remove extra whitespace $paragraph =~ s/<[^>]*>//ig; # remove all tags $paragraph =~ s/\[break\]/\n\n/ig; # swap the break holder for unix return
Since we are using "thin" columns in the layout, any word that is wider than the defined space will cause and infinite loop to happen in the text_block subroutine. To fix this, we will force spaces into long words like URLs and e-mail addresses so we can confirm they will not cause the program to go into a loop. You can do this any way you see fit and the following ar just examples.
# insert spaces $paragraph =~ s/(.*)@(.*)/$1 @ $2/ig; # space around the @ symbol in a e-mal address $paragraph =~ s/(.*)\/(.*)\//$1 \/ $2 \//ig; # forcing spaces around multiple / items
Next we will use the unique id number to set the name of the PDF file we will create.
# set article name my $article_name = "$article_id.pdf";Making the PDF with PDF:API2
Lets get down to the nitty gritty of PDF:API2. So far we have collected together the content from the database and set what the naming convention for the file is. We just need to set a few more constants to make the program really user friendly. Since PDF:API2 uses pixels/points to define measurements, we will convert that to normal measurements like inches and millimeters. We will also set a counter for the page numbers and for some later customizations. Since Page 1 will have a different layout than the subsequent pages, this counter will control which positions are set for the layout.
# defining constants use constant mm => 25.4/72; use constant in => 1/72; use constant pt => 1; my $page_number = 1;
Now we will take the name we defined before and create a new empty PDF file.
# Open PDF file for Writing my $pdf = PDF::API2->new(-file => $article_name);
Next we check to see if there is content for the PDF file. Since this is the beginning of the file and the first time we have run through the loop, the answer is of course yes. However, when we get done with this page, there may or may not be more content to be placed. If there is we will append a new page with the subsequent page layout.
# add a page if there is content for it
while ($paragraph ne "") {
Since we now have content to put on the page, lets make the page and set the size. The mediabox element sets the width then the length of the new page. If this is a sub page, the new page will be appended to the currently open document.
#create a new page my $page = $pdf->page; # define size of page $page->mediabox(8.5/in, 11/in);
Next we will define the fonts to be used on the page. Most standard fonts can be used as long as they are installed on the system where the script is run. You are still responsible for the legality of any fonts used in the PDF creation so sticking with standard open fonts is probably best.
# defining fonts
my $verdana = $pdf->corefont('Verdana',-encoding => 'latin1');
my $verdanabold = $pdf->corefont('Verdana-Bold',-encoding => 'latin1');
Placing an image
Now that we have the formalities out of the way, lets start defining the the page. First we will place the top and bottom graphics. But remember that we will only place the top graphic if it is page one.
# placing images
if ($page_number == 1) { # Top Graphic
my $photo = $page->gfx;
$photo_file=$pdf->image_jpeg('imagesp.jpg');
$photo->image($photo_file, 25/mm, 9.25/in, 50/pt, 69/pt);
}
That's all nice and good but lets break down the placement line by line for the footer graphic so you can understand it. First you set a variable to define that you will be placing a graphic.
my $photo_eweek_footer = $page->gfx;
Next you define a variable containing what type of graphic you will be using and where to find the graphic file. image_jpeg is what we will use for this example but it is just one of the image types you can use. Other options include image_gif, image_tiff, image_png, image_pnm, and image_gd. Image_gd is probably the most powerful of these calls because you can send dynamic graphics to a dynamic PDF but we will not be covering that here. This takes one option which is the location of the graphic file. If you use image_gif and try to load a JPG file, it will not work because the coding schemas of the file is different than what you are calling it into the program with. However, the program will not error. It will just ignore the call and leave a blank space where you had intended the graphic to be.
$photo_file=$pdf->image_jpeg('images/bot.jpg');
Finally you use the variable which has the image file information in it to feed the variable for the graphic placement. image takes five options for the placement of the image. They are the file variable you just defined in the previous line, the x position of the top left corner of the image from the bottom of the document, the y position of the top left corner of the image from the left side of the document, the height of the image, and the width of the image.
One thing to note here is that I have noticed with GIF and JPG files that the points don't match perfectly yet. You If you have an image that is 10x10, you may need to fiddle with the height and width of the placement in the page to get it just right.
$photo_eweek_footer->image($photo_file, 183/mm, 8/mm, 27/pt, 34/pt);
Finally we set up the object to create a line so we can use it later.
# Define Vertical Lines for later drawing my $black_line = $page->gfx;Placing text
Lets start with laying the headline, deck, and author information. This will only apply if this is page number one so first we do a quick check to make sure this is page one.
# page text
#
# Do the headlines if this is page one.
if ($page_number == 1) {
In the headline we would like fit as many words as possible. Since the filed for the headline is a fixed number of pixels we will need to modify the size of the font for larger headlines. First we do a character count. Depending on how many characters there are we modify the font size accordingly. The maximum number of characters for the headline is 47 so the last thing we do is clip any characters after 47.
# Headline
$_ = $headline_text;
$head_count = tr/a-zA-Z0-9_\+\-\&\'\"\%\$\@ //;
SWITCH: {
if ($head_count <= 27) {$font_size = 26; last SWITCH;}
if (($head_count > 27) and ($head_count <= 29)) {$font_size = 24; last SWITCH}
if (($head_count > 29) and ($head_count <= 35)) {$font_size = 20; last SWITCH}
if (($head_count > 35) and ($head_count <= 39)) {$font_size = 18; last SWITCH}
$font_size = 16;
$headline_text = substr($headline_text,0,47);
}
With all the elements set, it is now a 5 part process to create and place the text field. First, you create the text object. Second, you tell the object which font and what font size to use. Third, you set the color of the text you will be placing. This can either be by common name or by hex code. Fourth, you tell the object were to place the cursor to start the text. This is measured from the bottom by the standard x from the left and y from the bottom like all other objects. Finally, you tell the object which text to display.
my $txt = $page->text;
$txt->font($verdanabold, $font_size);
$txt->fillcolor('black');
$txt->translate(25/mm, 9.25/in);
$txt->text($headline_text);
*change the deck code to the new stuff*
We could place the deck the same way as we did the headline but lets use a more elegant solution. In this case we would like the deck to be one line if it fits and two lines if it doesn't. Seeing how decks are usually longer in size this seems to be a better solution. Also, we don't want to clip the line in the middle of a word which could happen in the headlines. To create this effect we will use the text_block subroutine.
# Deck First
my $txt = $page->text;
$txt->font($verdanabold, 13);
$txt->fillcolor('#9E3025');
$txt->translate(25/mm, 225/mm);
$txt->text("REVIEW:");
# Deck Main
$deck_text = substr($deck_text,0,61);
my $txt = $page->text;
$txt->font($verdana, 13);
$txt->fillcolor('black');
$txt->translate(50/mm, 225/mm);
$txt->text($deck_text);
Since the author line is small enough, we will just place this with a standard five part text object declaration. After this, we close our check for page one and start on the main body of the text.
# Author
my $txt = $page->text;
$txt->font($verdanabold, 10);
$txt->fillcolor('black');
$txt->translate(25/mm, 216/mm);
$txt->text($author);
}
Main body text with text_block:Column One
The main portion of the story with user the text_block subroutine. to define the columns. In between the columns we will draw lines to separate the columns if they are needed. Also, each column will be a little different depending on if this is page one or a sub page. The main difference being that sub pages have longer columns that start higher on the page because there is no headline, deck, or author to worry about.
To start the whole process, we set up the page objects to store text, define the font, and set the font size.
# Body Text #placing text my $left_column_text = $page->text; $left_column_text->font($verdana, 9/pt); $left_column_text->fillcolor( 'black' );
Now for column one we use text_block. First we check to see if this is page one or not so we know which type of columns to use. Then we call text_block feeding it a bunch of parameters. $endw, $ypos, and $paragraph are holding variables for the results of the subroutine. The only one we are majorly concerned with is $paragraph because when the subroutine. is done, it will contain the remaining text to be placed.
After the initial call there is a series of variables we pass to define certain things about the text block. First is the text object to define the text zone. Second is the variable containing the article text we extracted from the database and massaged. Third, we define the x placement of the text field off of the left hand side of the page. Fourth, we define the y placement of the text field off the bottom of the page. Next comes the width of the text block we want to define. That is followed by the height of the text field going downward from the x/y position. The lead is the distance reserved for the font to be placed. This should match the size of the font you defined before. Parspace is the amount of extra space you want to place between paragraphs. Finally, is the text alignment which can be "left", "right", "center", or "justify".
After we do all that, it will set the first column, place enough text to fill the defined space without chopping off a word in the middle, and return the remaining text into the $paragraph variable.
# Column One
if ($page_number == 1) {
($endw, $ypos, $paragraph) = text_block(
$left_column_text,
$paragraph,
-x => 25/mm,
-y => 208/mm,
-w => 50/mm,
-h => 190/mm - 7/pt,
-lead => 10/pt,
-parspace => 0/pt,
-align => 'left',
);
Now lets say that this is not page one. We basically do the same thing except we change the x, y, and height definitions to create a column that starts higher and goes longer.
} else {
# sub pages have columns 35mm bigger and no headline
($endw, $ypos, $paragraph) = text_block(
$left_column_text,
$paragraph,
-x => 25/mm,
-y => 243/mm,
-w => 50/mm,
-h => 225/mm - 7/pt,
-lead => 10/pt,
-parspace => 0/pt,
-align => 'left',
);
}
Column Two
Column two and three are defined the same way however we do a little extra checking and some more design items. First we check to see if we even need to make column two. Since $paragraph now has the remaining text of the article in it, if it is empty then we have printed the entire article and do not need to make column two. If it has value, we go forward and make the column.
# Column Two
if ($paragraph ne "") {
Then we do our standard page one check to see if we are making short columns under the headline, deck and author or long ones.
if ($page_number == 1) {
Since this is column 2, we need to draw a line between column one and column two. Earlier in the program we created the line object and now we will use it. The best way to think if this is like a pen over the page. We will define what color pen it is. Move the pen to a spot over the page, tell the pen where it will go on the page, then put the pen down on the page and draw the line. Strokecolor sets the pen color. move tell the pen what x/y point to start the line defined from the left and bottom of the page respectively. line tells the pen to draw a 1px line to the x/y point for the end of the line. Finally, stroke puts the pen to page and draws the line.
# Vertical Line
$black_line->strokecolor('black');
$black_line->move(81/mm, 211/mm);
$black_line->line(81/mm, 24/mm);
$black_line->stroke;
Ideally the line tool has a definition to set the width of the line being drawn. In this example we really want a 3px wide line but in the testing we were doing I could not find a way to define this which worked consistently. Instead, we just repeat the line tool two more times pushing the pen over slightly to get the 3px wide line we are looking for.
$black_line->move(81.15/mm, 211/mm); $black_line->line(81.15/mm, 24/mm); $black_line->stroke; $black_line->move(81.25/mm, 211/mm); $black_line->line(81.25/mm, 24/mm); $black_line->stroke;
Next we do our standard text_block declaration for column two with the x position is changed from the column one settings so this zone is in the middle of the page.
($endw, $ypos, $paragraph) = text_block(
$left_column_text,
$paragraph,
-x => 88/mm,
-y => 208/mm,
-w => 50/mm,
-h => 190/mm - 7/pt,
-lead => 10/pt,
-parspace => 0/pt,
-align => 'left',
);
That handles column two for page one. If this is a sub page, we do the same thing as we did for page one just with longer line and column definitions. First we draw the line between column one and two.
} else {
# sub pages have columns 35mm bigger and no headline
# Vertical Line
$black_line->strokecolor('black');
$black_line->move(81/mm, 246/mm);
$black_line->line(81/mm, 24/mm);
$black_line->stroke;
$black_line->move(81.15/mm, 246/mm);
$black_line->line(81.15/mm, 24/mm);
$black_line->stroke;
$black_line->move(81.25/mm, 246/mm);
$black_line->line(81.25/mm, 24/mm);
$black_line->stroke;
Then we define and place the text for column two.
($endw, $ypos, $paragraph) = text_block(
$left_column_text,
$paragraph,
-x => 88/mm,
-y => 243/mm,
-w => 50/mm,
-h => 225/mm - 7/pt,
-lead => 10/pt,
-parspace => 0/pt,
-align => 'left',
);
}
}
Column three
Column three is identical to column two except it is placed further along the page. First we check the $paragraph variable to see if we even need to make the third column.
# Column Three
if ($paragraph ne "") {
Next we check to see if this is page one.
if ($page_number == 1) {
Then we place the shorter page one line between column two and three.
# Vertical Line
$black_line->strokecolor('black');
$black_line->move(144/mm, 211/mm);
$black_line->line(144/mm, 24/mm);
$black_line->stroke;
$black_line->move(144.15/mm, 211/mm);
$black_line->line(144.15/mm, 24/mm);
$black_line->stroke;
$black_line->move(144.25/mm, 211/mm);
$black_line->line(144.25/mm, 24/mm);
$black_line->stroke;
Then we place the remaining text into column three adjusting the x placement to place the column off to the right of the page.
($endw1, $ypos1, $paragraph) = text_block(
$left_column_text,
$paragraph,
-x => 149/mm,
-y => 208/mm,
-w => 50/mm,
-h => 190/mm - 7/pt,
-lead => 10/pt,
-parspace => 0/pt,
-align => 'left',
);
If this is a sub page, we define the line and column with the higher start and the longer length.
} else {
# sub pages have columns 35mm bigger and no headline
# Vertical Line
$black_line->strokecolor('black');
$black_line->move(144/mm, 246/mm);
$black_line->line(144/mm, 24/mm);
$black_line->stroke;
$black_line->move(144.15/mm, 246/mm);
$black_line->line(144.15/mm, 24/mm);
$black_line->stroke;
$black_line->move(144.25/mm, 246/mm);
$black_line->line(144.25/mm, 24/mm);
$black_line->stroke;
($endw1, $ypos1, $paragraph) = text_block(
$left_column_text,
$paragraph,
-x => 149/mm,
-y => 243/mm,
-w => 50/mm,
-h => 225/mm - 7/pt,
-lead => 10/pt,
-parspace => 0/pt,
-align => 'left',
);
}
}
Final Actions
That does it for the placement of the text. If there is still a remainder in the $paragraph variable, when the loop is checked it will go through this whole process again appending a new sub page to the open PDF. Before we get to the check, we need to print on this page what the page number is then increment it. Printing is done with the standard text field declaration. First we define the text for the page number then we run through the five step process to place it.
# Page Number Display
$page_num = "Page $page_number";
my $txt = $page->text;
$txt->font($verdana, 8);
$txt->fillcolor('black');
$txt->translate(25/mm, 10/mm);
$txt->text($page_num);
Before we move onto the next page, we increment the page counter for the next page then go back to the top of the
while ($paragraph ne "") {
$page_number++; }
Once we have completed all pages, we close the PDF and produce the page. Then we move on to the next story and repeat the process again until all stories extracted from the database have had PDFs created for them. For 2000 articles, this process will roughly take 20-30 minutes.
#close and save PDF file $pdf->save; $pdf->end( ); } exit();
This is the text_block subroutine which we use in the program to lay out the multiple text blocks. It can be reused under the LGPL.
############################## Sub Programs ##############################
sub text_block {
my $text_object = shift;
my $text = shift;
my %arg = @_;
# Get the text in paragraphs
my @paragraphs = split(/\n/, $text);
# calculate width of all words
my $space_width = $text_object->advancewidth(' ');
my @words = split(/\s+/, $text);
foreach (@words) {
next if exists $width{$_};
$width{$_} = $text_object->advancewidth($_);
}
$ypos = $arg{'-y'};
my @paragraph = split(/ /, shift(@paragraphs));
$first_line = 1;
$first_paragraph = 1;
# while we can add another line
while ( $ypos >= $arg{'-y'} - $arg{'-h'} + $arg{'-lead'} ) {
unless (@paragraph) {
last unless scalar @paragraphs;
@paragraph = split(/ /, shift(@paragraphs));
$ypos -= $arg{'-parspace'} if $arg{'-parspace'};
last unless $ypos >= $arg{'-y'} - $arg{'-h'};
$first_line = 1;
$first_paragraph = 0;
}
$xpos = $arg{'-x'};
# while there's room on the line, add another word
my @line = ();
my $line_width =0;
if ($first_line && exists $arg{'-hang'}) {
my $hang_width = $text_object->advancewidth($arg{'-hang'});
$text_object->translate( $xpos, $ypos );
$text_object->text( $arg{'-hang'} );
$xpos += $hang_width;
$line_width += $hang_width;
$arg{'-indent'} += $hang_width if $first_paragraph;
}
elsif ($first_line && exists $arg{'-flindent'}) {
$xpos += $arg{'-flindent'};
$line_width += $arg{'-flindent'};
}
elsif ($first_paragraph && exists $arg{'-fpindent'}) {
$xpos += $arg{'-fpindent'};
$line_width += $arg{'-fpindent'};
}
elsif (exists $arg{'-indent'}) {
$xpos += $arg{'-indent'};
$line_width += $arg{'-indent'};
}
while ( @paragraph and $line_width + (scalar(@line) * $space_width) + $width{$paragraph[0]} < $arg{'-w'} ) {
$line_width += $width{ $paragraph[0] };
push(@line, shift(@paragraph));
}
# calculate the space width
if ($arg{'-align'} eq 'fulljustify' or ($arg{'-align'} eq 'justify' and @paragraph)) {
if (scalar(@line) == 1) {
@line = split(//,$line[0]);
}
$wordspace = ($arg{'-w'} - $line_width) / (scalar(@line) - 1);
$align='justify';
} else {
$align=($arg{'-align'} eq 'justify') ? 'left' : $arg{'-align'};
$wordspace = $space_width;
}
$line_width += $wordspace * (scalar(@line) - 1);
if ($align eq 'justify') {
foreach my $word (@line) {
$text_object->translate( $xpos, $ypos );
$text_object->text( $word );
$xpos += ($width{$word} + $wordspace) if (@line);
}
$endw = $arg{'-w'};
} else {
# calculate the left hand position of the line
if ($align eq 'right') {
$xpos += $arg{'-w'} - $line_width;
} elsif ($align eq 'center') {
$xpos += ($arg{'-w'}/2) - ($line_width / 2);
}
# render the line
$text_object->translate( $xpos, $ypos );
$endw = $text_object->text( join(' ', @line) );
}
$ypos -= $arg{'-lead'};
$first_line = 0;
}
unshift(@paragraphs, join(' ',@paragraph)) if scalar(@paragraph);
return ($endw, $ypos, join("\n", @paragraphs))
}
Well, that's it. If you have any questions or would like to say thanks for the help, drop us a line at contact @ webfluency.com.
