It is possible to do database migrations without incurring any downtime as long as we follow certain patterns. These patterns are not really specific to Rails - they apply to pretty much any web framework - but there are a few nuances that are specific to Rails which I will cover here.
Here is an example of a simple migration that will cause problems unless we put the application in maintenance mode during the deployment.
(Note that even though I’m removing the column I’m still specifying the full type in order to make the migration reversible. You can test that your migration is reversible by running
This change seems innocent enough, but once we deploy it to production errors will start to pop up.
Of course it doesn’t exist - we just removed it! So who is still trying to access this column? The answer, it turns out, is ActiveRecord. Specifically, ActiveRecord in the old web processes. ActiveRecord caches table columns and uses this cache when it builds the
UPDATE SQL statements. Also keep in mind that ActiveRecord by default will update all columns when saving a record, even when that column has not been modified - so in this case we could be modifying a column like
last_seen_at for the user, but ActiveRecord will update all the columns, including the non-existing status column.
So how do we structure our database migrations to avoid these types of errors while still allowing zero downtime deployments? The answer lies in backwards-compatible migrations.
Backwards Compatible Migrations
The rule we have to adhere by is that all migrations need to be compatible with the code that is already running. This involves deploying your changes in multiple steps:
- Make your code compatible with the migration you need to run
- Run the migration
- Remove code specifically written to allow the migration to run
This means that instead of using a single deployment to make our code and database changes at once, we will use multiple deployments to ensure each change is compatible with the code that is already running.
Removing a column
Let’s go back to our example - removing the status column from the users table. Instead of dropping this column, our first change is to tell ActiveRecord to stop writing to this column.
Once this change is deployed, we can go ahead and drop the column from the database, since we’re already guaranteed that the code that is already running will not write to this column. We can also clean up the code we wrote to accomodate the database change.
Renaming a column
In order to rename a column we need to do 3 deployments. As an example, let’s imagine we have a created_date column on our users table which we would like to rename to created_at (the idiomatic Rails convention).
The first step is to add the new column and tell ActiveRecord to write to both the old and the new columns, but keep reading from the old column.
Once this is deployed, we can make sure these 2 columns are in sync and tell ActiveRecord to stop writing to the old column (and start reading from the new column).
Once this is deployed we can remove the old column as well as the code that specifically excludes that column. Optionally we can also add a not-null constraint to this column, if desired (this is assuming the old column had a not-null constraint, otherwise our data wouldn’t adhere to this contraint).
Creating indexes is surprisingly problematic - if you are dealing with large tables creating the index can take minutes or even hours to complete. If you are using PostgreSQL this means your table will allow reads but will be locked against writes during this time. The alternative is to create the index concurrently:
disable_ddl_transaction! tells Rails not to run our migration within a transaction, which is required for creating indexes concurrently.
Given the wide range of changes we can make to the database it’s difficult to have a set pattern for every change (and also why it’s difficult to build this into the ActiveRecord or Rails framework) but the key is to stick to the rule: all migrations need to be compatible with the code that is already running. I find that approaching the problem from this angle makes it easier to reason about the different steps needed in order to make a database change.
Here is a very simplistic list of common changes and the steps needed:
- Adding a column - Completely safe if the column is nullable. If the column is not-nullable (and doesn’t have a default) we need to first add a nullable version of the column and then follow the steps for adding NOT NULL constraints.
- Adding NOT NULL contraints - First make sure that the application layer is never writing NULL values to this column, basically treating it as not-nullable even though the database doesn’t enforce this constraint. Then update all existing records to make sure there are no null values and add the database contraint. The same concept applies to other types of constraints like UNIQUE.
- Creating tables - Completely safe.
- Removing tables - First remove all references to the table from the application layer, then remove the database table.
- Removing contraints/indexes - Completely safe.
That’s all there is to it! Happy coding.