Generating Thousands of PDFs on EC2 with Ruby

Dec 23, 2009 by Sean Cribbs

Note: this is a repost of an entry I wrote on the RailsDog blog.

The Problem

For about two months, we’ve been working on a static website that exposes the results of complicated economics model to non-economists. We decided to make the site static because of the overhead involved in computing the results and the proprietary nature of the model. We would simply pre-generate the output for all valid permutations of the inputs. The visitor could then select her inputs from a questionnaire, click a button and immediately be shown the results.

The caveat of this decision is that in addition to the numerical outputs, three graphs and a summary (both in HTML and PDF) would need to be generated for each permutation. Since there were 3600 permutations, this would amount to 18000 files in total. Initial local runs of our generation process took about 30 seconds for each permutation, mostly due to embedding the graph images into the PDF. On a single machine, that would take 30 hours of uninterrupted processing! Clearly, this was a job for “the cloud”.

The Tools

Before we get into a discussion of the process of configuring and running the jobs, here’s overview of the tools we used to tackle the problem.

We initially considered using Amazon’s Elastic MapReduce to run the generation jobs, but it requires Java and Hadoop, we had already invested a lot of time in our Ruby tool chain. It is nigh impossible to automatically install Ruby and ImageMagick on an EMR node. Thus, we decided to use vanilla EC2 with the tools shown below.

Prawn

Prawn is the new kid in town for generating PDF in Ruby. Prawn is pretty well-written and easy to start using, and greatly improves on PDF::Writer.

Gruff

Gruff was not the most obvious choice for this project. We liked the flexibility and hackability of Scruffy, but translating its output to PDF was a nightmare and there were some strange inconsistencies in it. In the end, Gruff proved fast, reliable, and simple. The major caveat, as described above, is that embedding images in Prawn is orders of magnitude slower than simply drawing on the canvas.

Haml, Sass, Compass

Haml has been around for 3 years now. Many people cringe at the indentation-sensitive syntax, but it prevents so much frustration that it was a good fit for the project. Naturally, we also used its cousin Sass, and the new-ish CSS/Sass meta-framework Compass. The combination of the these three made it really quick to get started with the static site and make design changes as we iterated.

Chef

You may have already heard of the awesome configuration management tool, Chef. Chef allows you to ensure consistent configuration of your servers using a nice Ruby DSL and a huge library of community-developed “cookbooks” that covers many common use-cases. We were given the chance to try out an alpha of their “Chef Platform”, which is essentially a scalable, hosted, multi-tenant version of the server component of Chef and uses the pre-release version of Chef 0.8. With that, “knife”–the new CLI tool for interacting with the Chef server API–and the custom Opscode AMI, we were well-equipped to quickly deploy a bunch of EC2 nodes. We’ll talk more about the details of the Chef recipes below.

AMQP and RabbitMQ

What’s the best way to distribute a bunch of one-time jobs to a slew of independent machines? A message queue, of course! Despite the version packaged with Ubuntu 9.04 being pretty old, we chose RabbitMQ, having used it on another project. AMQP is also well supported in Ruby.

The Process

Preparing

The first step to start our processing job was to get the data up to S3. You could do this any number of ways, but we created a bucket solely for the data and uploaded all 3600 CSV files with a desktop client.

Next, we created the scripts for the workers and the job initiator. We would potentially need to run the process multiple times, so we chose Aman Gupta’s EventMachine-based AMQP client.

Here’s the worker script, which was set up as a daemon using runit:

#!/usr/bin/env ruby

$: << File.expand_path(File.join(File.dirname(__FILE__),'..','lib'))
require 'rubygems'
require 'eventmachine'
require 'mq'
require 'custom_libraries'

Signal.trap('INT') { AMQP.stop{ EM.stop } }
Signal.trap('TERM'){ AMQP.stop{ EM.stop } }

AMQP.start(:host => ARGV.shift) do
  MQ.prefetch(1)
  MQ.queue('jobs').bind(MQ.direct('jobs')).subscribe do |header, body|
    GenerationJob.new(body).generate
  end
end

Basically, it connects to the RabbitMQ host specified on the command line, subscribes to the job queue, and starts processing messages.

The job initiation script is almost as simple:

#!/usr/bin/env ruby

$: << File.expand_path(File.join(File.dirname(__FILE__),'..','lib'))
require 'rubygems'
require 'eventmachine'
require 'mq'

AWSID = (ENV['AMAZON_ACCESS_KEY_ID'] || 'XXXXXXXXXXXXXXXXXXXX')
AWSKEY = (ENV['AMAZON_SECRET_ACCESS_KEY'] || 'XXXXXXXXXXXXXXXXXXXXXXXXXXXXX')

Signal.trap('INT') { AMQP.stop{ EM.stop } }
Signal.trap('TERM'){ AMQP.stop{ EM.stop } }

host = ARGV.shift
input_bucket = "custom-data"
output_bucket = "custom-output"
output_prefix = Time.now.strftime("/%Y%m%d%H%M%S")
count = 0

AMQP.start(:host => host) do
  exchange = MQ.direct('jobs')

  STDIN.each_line do |file|
    count += 1
    $stdout.print "."; $stdout.flush
    payload = {
      :input => [input_bucket, file.strip],
      :output => [output_bucket, output_prefix],
      :s3id => AWSID,
      :s3key => AWSKEY
    }
    exchange.publish(Marshal.dump(payload))
  end
  AMQP.stop { EM.stop }
end
puts "#{count} data enqueued for generation."

It reads from STDIN the names of files to add to the queue, which are stored in the S3 bucket. Before running the job, we created a text file that listed each of the 3600 files, one per line, which could then be piped to this script on the command line. Then it passes along all the information each worker needs to find the data, and where to put it when completed. We scoped the output by the time the job was enqueued, making it easier to discern older runs from newer ones.

Configuring the cloud

Now that the meat of the job was ready, we dived into configuring the servers with Chef. We created a Chef repository, added the Opscode cookbooks as a submodule, and uploaded these default cookbooks to the server:

apt
build-essential
erlang
imagemagick
runit
ruby

We created some additional cookbooks to fill out the generic setup:

rabbitmq - Installs and configures RabbitMQ
gemcutter - Upgrades Rubygems, installs Gemcutter and makes gemcutter.org the default gem source

Lastly we created our custom cookbook, which sets up all the libraries we need, downloads the code, and sets up the worker process as a runit service. Let’s walk through the default recipe in that cookbook:

%w{haml gruff fastercsv activesupport prawn prawn-core prawn-format prawn-layout eventmachine amqp aws-s3}.each do |g|
  gem_package g
end

This simply installs all of gems that we need to run the job.

# Find the node that has the job queue
q = search(:node, "run_list:role*job_queue*")[0].first

Here we use Chef’s search feature to find the node that has RabbitMQ installed and running so we can pass it to the worker script.

# Create directory to put the code in
directory "/srv"

# Unzip the code if necessary
execute "Unpack code" do
  command "tar xzf generationjobs.tar.gz"
  cwd "/srv"
  action :nothing
end

# Download the code
remote_file "/srv/generationjobs.tar.gz" do
  source "generationjobs.tar.gz"
  notifies :run, resources(:execute => "Unpack code"), :immediate
end

# Create the directory where output goes
directory "/srv/generationjobs/tmp" do
  recursive true
end

In these four resources, we set up the working directory for the worker process, download the project code (stored on the Chef server as a tarball), and unpack it. The interesting thing about this sequence is that we don’t automatically unpack the tarball. Since the Chef client runs periodically in the background, we don’t want to be unpacking the code every time, but only when it has changed. We use an immediate notification from the remote_file resource to tell the unpacking to run when the tarball is a new version; remote_file won’t download the tarball unless the file checksum has changed.

# Create runit service for worker
runit_service "generationworker" do
  options({:worker_bin => "/srv/generationjobs/bin/worker", :queue_host => q})
  only_if { q }
end

The last step is a pseudo-resource defined in the “runit” cookbook that creates all the pieces of a runit daemon for you; we only had to create the configuration templates for the daemon and put them in our cookbook. The additional options passed to the runit_service tell the templates the location of the worker code and the RabbitMQ host. We also take advantage of the “only_if” option so the service won’t be created if there’s no host with RabbitMQ on it yet.

The last step in the Chef configuration was to create two roles, one for the queue and one for the worker. Naturally, the node that has the queue can also act as a worker. Here’s what the role JSON documents look like:

// The queue role
{
  "name": "job_queue",
  "chef_type": "role",
  "json_class": "Chef::Role",
  "default_attributes": {

  },
  "description": "Provides a message queue for sending jobs out to the workers.",
  "recipes": [
    "erlang",
    "rabbitmq"
  ],
  "override_attributes": {

  }
}

// The worker role
{
  "name": "job_worker",
  "chef_type": "role",
  "json_class": "Chef::Role",
  "default_attributes": {

  },
  "description": "Processes the data from a queue into the PDF, PNG and HTML output.",
  "recipes": [
    "apt",
    "build-essential",
    "ruby",
    "gemcutter",
    "imagemagick::rmagick",
    "runit",
    "custom"
  ],
  "override_attributes": {

  }
}

Running the jobs on EC2

Now comes the fun (and easy) part! Armed with an AWS account, an EC2 certificate, and knife, we began firing up nodes to run the job. With Opscode’s preconfigured Chef AMI, you can pass a JSON node configuration in the EC2 initial data. First we generated the configuration for the job queue node:

$ knife instance_data --run-list="role[job_queue] role[job_worker]" | pbcopy

With the JSON configuration in the clipboard, we could paste it into ElasticFox (or the AWS Management console) and fire up the first EC2 node. Several minutes later, the node was ready to go. Now, we created a similar configuration, but with only the worker role:

$ knife instance_data --run-list="role[job_worker]" | pbcopy

Then we fired up nine of the nodes with that configuration and proceeded to initiate the job:

$ ssh -i ~/ec2-keys/my-ec2-cert.pem root@ec2-public-hostname
[root@ec2-public-hostname]$ cd /srv/generationworker
[root@ec2-public-hostname]$ bin/startjobs localhost < manifest.txt

After all the preparation, that’s all there was to it! A little over an hour later, we had generated PNG graphs, PDF, and HTML from all 3600 datasets.

Conclusion

It’s no mystery why “cloud computing” is so popular. The ability to quickly and cheaply access computational power, utilize it, and then dispose of it is really appealing, and tools like Chef and EC2 make it really easy to accomplish. What can you cook up?