This post was started as a polemic against some views presented by Rob Conery in his latest blog post, but grown a bit. Sorry. These things happen with blog posts!
Database migrations
Ruby on Rails has an ORM called ActiveRecord, which introduced me to the concept of “database migrations”. This was in the days of Rails 1.2 around the year 2007. Yes, I am that old, apparently. The DSL you use to implement migrations slightly changed since that prehistory, as well as versioning scheme (integers vs timestamps), but the general concept remained intact since.
Database migrations in Rails (and in other libraries), provide a way to manipulate database schema in an incremental way. In a single migration, you specify some commands, for example:
- create table “users” with given fields
- add unique index to “users.email” and “users.id”
- add column “post.author_id” to table “posts” that references “users.id”
This is a recipe to build a schema for your database, one conceptual change at a time.
How people often use migrations
I have seen Ruby developers writing migrations with almost religious care, making sure they are all reversible, and can be applied from any version, to any version at any time. When a DSL for writing migrations changes, those people would even go back to their migration files and update them, so that they can be ran without any errors to recreate the full database schema.
Rob in his blog post points out that the after “3rd or 4th year of life [of a project], however, those 40 or so migration files start to look rather ridiculous”. I just ran this command on a Rails project I work on for the last year or so:
ls db/migrate | wc -l
329
…and yeah, this is ridiculous. So I removed 320 oldest migrations without a blink of an eye. Heresy.
On ho, the migrations will fail on a brand new install. Yeah, I don’t care. I only care that I have all migrations between last release to production and now, the rest is irrelevant.
Ruby on Rails, after applying the migrations, creates a schema dump in
either db/schema.rb
or db/structure.sql
, depending on your
configuration. This is the file that is meant to be used when setting up
a project on either new development machine, staging or production
systems. Instead of rake db:migrate
, you should always run rake db:schema:load
to re-create your database.
How I treat migrations
For me, database migrations are mostly useful when working in a larger
team of developers. Migrations created by other developers would appear
on my master
after I git pull
. I run the migrations command, and in
theory I should be good to go, with database changes made by others.
This process is very useful in my opinion, but has some shortcomings.
One slightly annoying thing in ActiveRecord’s implementation of migrations, is
that when two people add migrations in parallel, and those migrations
add/remove some fields on the same table, weird things start happening
to db/schema.rb
. Every other time, a person involved in such
migrations conflict runs the migration command in future, the fields in
db/schema.rb
would appear in different order. Git would show a change in code.
ActiveRecord in Rails ignores the order of columns, so I guess that’s not a big
thing until you start writing some raw SQL that may depend on order of
those columns.
I do not care to write “down” migrations, this is a waste of time. I
only write it if I need to roll back a migration myself, before I even
pushed my changes to repository. “Oh, but then you can’t roll back your
production system if you need to.” True, but that’s rarely possible
anyway. If you remove some columns from the database, and recreate them
in the down
part of the migration later on, you will not get the
foreign keys re-populated anyway if you roll them back.
The correct way to handle situations where you need to roll back a bad deploy, is to snapshot the database just before the deployment. If it breaks immediately, you restore the database from backup. If the breaking change was not discovered immediately - you should write a correcting migration rather than rolling back migrations.
Other problems with migrations
When you use more advanced features of your database, the migrations may not be up to the task. Most common problem I have, is the need to drop and re-create views when you attempt to change one of the tables the views depend on. I think the manual approach to creating migration scripts does not leave much space for improvement here. If the sytem was generating updates to structure dynamically, somehow based on the defined schema in models (Django does that, right?), it could be done automatically too.
If you run migrations as part of your release to production, and there
is an error you did not anticipate (constraint voilation for example),
some of your migrations may be applied, others will not. This is a
problem if your code has been deployed, and runs on the most recent
version, but the schema hasn’t. I think the only good solution here is
to re-deploy previous version of the app, and recover from database back
up (or run migrations down
to given version if you can).
Individual migrations are wrapped in a transaction, so these are applied in full or not at all. This may not the case if you are using MySQL/MariaDB (or whatever it’s called nowdays). Partially applied migrations are pain to recover from.
Data in your database needs migrations too
Your database consists of schema and data, and you can’t take care of one without taking care of the other one.
This way of thinking led me to conclusion that we should actually do modify the data in the database in migrations. Many people discourage that sort of behavior, because of the nasty side-effects and lack of class reloading between migrations. ActiveRecord’s models would not pick up the changes in their schema, if you attempt to use the same model in migrations that modify it.
I often use execute("SOME RAW SQL")
in a migration file to update the
data if we need to. When some data needs to be moved between tables, good pattern is:
- create a new database table in first migration
- copy over / modify the data in second migration
- remove the old table or columns in third migration
If I feel like using ActiveRecord model in a migration to migrate my
data, I do not attempt to use my models from app/model
, instead, I
create a temporary models in a migration file itself:
class TempUser < ActiveRecord::Base
self.table_name = "users"
end
My migration files often consist of scripts other people would put in
the lib/
directory and run post-release. The difference is that, I
don’t really care about cleaning them up afterwards. They will not
clutter my lib/
directory, they will not be autoloaded and eat up any
memory either. At any given point in time in future, I will simply clean
up the db/migrate
directory form the old garbage.
Migrations give you framework to work with
Migrations can be a tool that allow updating your database schema and
contents with a single command. This is particularly useful when
performing continuous deployments. I often use and like
Codeship very much. When my build on master
passes, app is put into maintenance
mode, database snapshot is created, code deployed to staging
server,
migrations are ran. If migrations finish without an error, web app is
unblocked and started. If migrations fail for some reason, database
snapshot is restored and app is re-set to the last known working version.
Before I embraced this way of treating migrations as a tool to modify both schema and data, I often had to manually stop the app, run migrations, run a rake task to modify data, run some more migrations. The automated process is generally more reliable.
When we are working on a large release, I often do a dry run of migrations before releasing to production. I download the recent production backup, load it to local PostgreSQL. After running the migrations on the copy of production database, I can test if the data looks correct. If the data is not correct and requires changes, I do not hesitate to fix the migration that incorrectly handled the data updates. On no, another heresy. I find it’s actually okay to do so, unlike migrations that modify schema. Those, should be fixed by adding another migration, unless they are truly destructive to data.
JSONB and the lack of schema
Second point I have slightly different view than Rob, is JSONB. At some point in the past, I briefly jumped on the NoSQL bandwagon. “There’s no schema, it’s all documents” turned out to be a bullshit sooner than later.
The truth is, there’s always a schema. Your code depends on the JSONB documents being structured in some manner. If you update your code to generate different JSONB documents, you need to somehow take care of existing documents. One could say that you have to migrate the old documents to newer versions.
The process of updating “schema-free” documents from previous de facto schema, to next de facto schema, was painful. This was one of the main reasons why I decided to re-think the idea of NoSQL and schema-less databases in general. The other reason was that MongoDB was randomly losing some documents back in the day.
The idea behind JSONB and MongoDB documents is basically the same. You have a (possibly nested in many layers) hash, where you set some values and it’s all shiny since you don’t have to modify any schema. Yes, in development that is very much convenient. As soon as you start to have production data to take care of, any form of schema-less documents become liability. You need to take care of updating the documents yourself, often not in bulk, but one by one in a script. If your database is of millions of records, that can be time consuming, and no one likes having unnecessary downtime.
When working in a team of 4 or 5 developers on such schema-less database structures, you often find easier juts to delete and re-create stuff the database rather than write a piece of code that would update the documents. “Guys, please erase all in column X”. Again, as soon as you have production data, this practice starts to be problematic: you find yourself writing a data migration scripts just before the release. It can be weeks after you or or colleague made the silent schema change. You may not remember some edge cases that need to be taken care of by then. If you had a migration file written at the time of making the change, all you would have to do is to run that file.
I guess the problems above can be solved with self-discipline and additional practices to ensure all goes well, and your schema remains consistent. In my experience people take shortcuts wherever they can, so having no framework on updating schema and data will result in a mess.
JSONB and atomic updates
JSONB documents, by default, do not allow you to do atomic updates onto individual fields in a document. Either the whole document is updated, or nothing is changed.
This can be, and often is, a problem if you have application that has more than one user allowed to write. Even if there is just one user, if he or she attempts to perform operations on fields of JSONB document in parallel, only one of the operations will be successful: the other one will be silently overwritten when updating JSONB field.
Sure, you can use some sort of locking mechanism to prevent that. But if you do want to allow multiple users to interact with the same records, this stops being fun instantly.
The upcoming version of PostgreSQL (9.5) will have jsonb_set
built-in
function to take care of that. For now, you are limited to third party
extensions if you need to do that. If you use ORM like ActiveRecord,
chances are that you’ll need to do some heavy gymnastics to use those
custom functions. You may even find yourself writing a SQL by hand (on
no!).
When to use JSONB
Having said all the above, I use and like JSONB very much. I found at least two cases where I and my team use JSONB column types:
- configurations/settings/permissions hashes
- data a back-end for JavaScript app
The first case is pretty obvious use of JSONB column type. The second part, I often find handy if I want to avoid writing a complicated API to back up some data JavaScript part of app operates on.
As an example, a project I currently work on, has a “compensation calculator”. This is a fairly complex JavaScript piece of UI, that can be embedded on different records: employment records, job offers, candidate info etc. We don’t really do anything with the data server-side, except maybe rendering it to HTML. We simply allow JavaScript component to read the data and save it to updated version. The decision not to create normalized tables, and not to create API endpoints for them, probably saved us a lot of time. Having that said, we had to migrate the structure of the JSONB documents once already, with a migration that goes to every single document, re-calculates some stuff using Ruby code, and saves the document. If we were to do those sort of re-calculations more often, we would be better off having a set of normalized 4 or 5 tables in database, and performing the calculations dynamically.
Summary
The bottom line is: be careful with JSONB, use it where it makes sense and in conditions that make sense to you and your co-workers. It’s not a silver bullet.
Migrations are not perfect, but they help a lot when working in a team.
Post by Hubert Łępicki
Hubert is partner at AmberBit. Rails, Elixir and functional programming are his areas of expertise.