Streaming Data with Rails and Heroku

A reasonably common feature in Rails is to export data to CSV. I find this is often a quick-and-easy solution to many problems, since it’s very easy to analyze and filter the data with Excel.

The problem with exporting this data is that it can take quite a while to generate the CSV file, especially if you’re dealing with large datasets. Heroku will terminate any request that takes longer than 30 seconds to complete. The exception to this rule is if you’re doing streaming – as specified by the Heroku documentation:

An application has an initial 30 second window to respond with a single byte back to the client. However, each byte transmitted thereafter (either received from the client or sent by your application) resets a rolling 55 second window. If no data is sent during the 55 second window, the connection will be terminated.

So how do we do this streaming? This is something that has changed quite a bit with the different versions of Rails – the example I’m providing here has been tested on Rails 3.2.

In order to stream data you will need a multi-threaded or multi-process web server – like Puma or Unicorn – otherwise your web server simply blocks until all the data is generated.

Configuration for Unicorn

We are using Unicorn, so here is what your configuration file – config/unicorn.rb – will look like.

worker_processes 3
timeout 120

# Enable streaming (for CSV downloads)

port = (ENV["PORT"] || 3000).to_i
listen port, :tcp_nopush => false

Notice that I am setting an explicit timeout of 120 seconds – so Heroku will allow your streaming to go on indefinitely, but Unicorn will kill the worker if the requests goes longer than 120 seconds. I think this is a good limit – if your requests are going longer than 2 minutes you probably want to either pre-generate the response or generate it in the background and email the result.

The last 2 lines actually allow us to perform streaming. In your Heroku environments you will automatically have a PORT environment variable, but in development we simply use port 3000.

Controller action

Rails 3.1 allows us to simply assign an enumerator to the response_body, so we simply need to yield our results to this object.

def users
  self.response.headers["Content-Type"] ||= 'text/csv'
  self.response.headers["Content-Disposition"] = "attachment; filename=users.csv"
  self.response.headers["Content-Transfer-Encoding"] = "binary"
  self.response.headers["Last-Modified"] = Time.now.ctime.to_s

  self.response_body = Enumerator.new do |yielder|
    yielder << ['first_name', 'last_name', 'email'].to_csv
    User.find_each do |user|
      yielder << [
        user.first_name,
        user.last_name,
        user.email
      ].to_csv
    end
  end
end

Notice that I am using ActiveRecord’s find_each method - this will load the objects in batches of 1000, which will avoid memory bloat.

Keep in mind that we still have the 2 minute limit on Unicorn - if you go over this limit your CSV file will simply be incomplete.

Happy coding.