Patterns for Data Migrations in Rails

In my experience with developing Rails applications I have found migrations to be one of the most error-prone areas. As I result I have come up with a few loose patterns on what to do (and what not to do) when doing migrations.

Rule #1 – Never modify an existing migration

This is probably obvious to anyone who has worked with Rails past a cursory glance, but it is also the first rule I broke as a new Rails developer. Once your migration has been pushed it cannot be changed – environments where your migration has already run will simply ignore the change. This will lead to data inconsistencies (since some environments will run the older version of your migration and some the newer) and general pain in your development cycle.

Rule #2 – Migrations should not interact with models

Let’s look at a sample migration.

class AddStatusToUsers < ActiveRecord::Migration
  def up
    add_column :users, :status, :string
    User.find_each do |user|
      user.update_attribute(:status, 'active')
    end
    change_column :users, :status, :string, null: false
  end

  def down
    remove_column :users, :status
  end
end

This migration simply adds a new ‘status’ column to the ‘users’ table, sets all the existing users to be ‘active’ and then changes the column to not allow null values. This will also run without any issues.

The reason why you don’t want to do this is because of the dependency on the User model. At a later point we might rename the ‘User’ model to something else (‘Person’ for example) in which case our migration will simply blow up.

What then is the alternative?

Pattern #1 - Execute custom SQL statements

Let’s change the previous example to not depend on the User model.

class AddStatusToUsers < ActiveRecord::Migration
  def up
    add_column :users, :status, :string
    db.execute "UPDATE users SET status = 'active'"
    change_column :users, :status, :string, null: false
  end

  def down
    remove_column :users, :status
  end
  
  private
  
  def db
    ActiveRecord::Base.connection
  end
end

This is the most common pattern I use for migrations - write custom SQL and use the various methods on ActiveRecord::Base.connection (see the API docs) to query and manipulate the data.

Remember to use the quote method to properly escape any input you’re passing into SQL, for example we could rewrite this example like this:

db.execute "UPDATE users SET status = #{db.quote(:active)}"

Pattern #2 - Using migration models

Let’s take a look at a more complicated problem, which illustrates the need for migration models. Let’s assume our User model has many comments and we now want to add a last_comment_id field on the User. What would this migration look like?

class AddLastCommentIdToUsers < ActiveRecord::Migration
  def up
    add_column :users, :last_comment_id, :integer
    db.select_values("SELECT id FROM users").each do |user_id|
      last_comment_id = db.select_value("SELECT id FROM comments WHERE user_id = #{db.quote(user_id)} ORDER BY created_at DESC")
      if last_comment_id
        db.update("UPDATE users SET last_comment_id = #{db.quote(last_comment_id)} WHERE user_id = #{db.quote(user_id)}")
      end
    end
  end

  def down
    remove_column :users, :last_comment_id
  end
  
  private
  
  def db
    ActiveRecord::Base.connection
  end
end

Even with this simple example that’s quite a bit of custom SQL we need to write. If we were using STI (single table inheritance) this example would have gotten even more complicated. Of course, with the built-in ActiveRecord models this would have been really simple (and readable).

This is where migration models come in - we basically declare a model purely for the purpose of the migration. Let’s modify our example to use a migration model.

class AddLastCommentIdToUsers < ActiveRecord::Migration
  class Comment < ActiveRecord::Base; end
  class User < ActiveRecord::Base
    has_many :comments,  -> { order("created_at ASC") }
  end

  def up
    add_column :users, :last_comment_id, :integer
    User.find_each do |user|
      last_comment = user.comments.last
      user.update_attribute(:last_comment_id, last_comment.id) unless last_comment.nil?
    end
    
    Comment.reset_column_information
    User.reset_column_information
  end

  def down
    remove_column :users, :last_comment_id
    
    Comment.reset_column_information
    User.reset_column_information
  end
end

This migration will run without any issues. Unfortunately it can also have some unintended side-effects - if our User model has timestamps (meaning created_at and updated_at columns) Rails will automatically update those columns as part of the update, effectively doing a reset on all the timestamp information on our users.

Another very serious issue that can show up with this pattern is when the name of the migration model’s class is used - for example, in polymorphic associations or in single table inheritance. Our migration models are namespaced to the migration class, so you end up seeing values like ‘AddLastCommentIdToUsers::Comment’ in the type field.

This is all to say that we cannot use this pattern in this form. The best way to get around all of these issues is to use ActiveRecord for doing queries, but manual SQL for the updates. So instead of doing update_attribute in this case, we can simply write a manual SQL statement.

class AddLastCommentIdToUsers < ActiveRecord::Migration
  class Comment < ActiveRecord::Base; end
  class User < ActiveRecord::Base
    has_many :comments,  -> { order("created_at ASC") }
  end

  def up
    add_column :users, :last_comment_id, :integer
    User.find_each do |user|
      last_comment = user.comments.last
      execute "UPDATE users SET last_comment_id = #{last_comment.id}" unless last_comment.nil?
    end
    
    Comment.reset_column_information
    User.reset_column_information
  end

  def down
    remove_column :users, :last_comment_id
    
    Comment.reset_column_information
    User.reset_column_information
  end
end

(The ‘execute’ method is always available in migrations - it simply executes the given SQL and doesn’t return any values.)

Summary

I think it’s important to keep in mind that data migrations are complicated - even when the code is very simple there is often hidden complexities that we are unaware of. So which of these 2 patterns should we choose? I usually end up writing custom SQL statements unless the query part is really complicated, in which case I will use migration models but still do the update portion with manual SQL.

If there are other patterns for writing data migrations in Rails I would love to hear about them. Happy coding.