On migrations, JSONB and databases in general

This post was started as a polemic against some views presented by Rob Conery in his latest blog post, but grown a bit. Sorry. These things happen with blog posts!

Database migrations

Ruby on Rails has an ORM called ActiveRecord, which introduced me to the concept of “database migrations”. This was in the days of Rails 1.2 around the year 2007. Yes, I am that old, apparently. The DSL you use to implement migrations slightly changed since that prehistory, as well as versioning scheme (integers vs timestamps), but the general concept remained intact since.

Database migrations in Rails (and in other libraries), provide a way to manipulate database schema in an incremental way. In a single migration, you specify some commands, for example:

create table “users” with given fields
add unique index to “users.email” and “users.id”
add column “post.author_id” to table “posts” that references “users.id”

This is a recipe to build a schema for your database, one conceptual change at a time.

How people often use migrations

I have seen Ruby developers writing migrations with almost religious care, making sure they are all reversible, and can be applied from any version, to any version at any time. When a DSL for writing migrations changes, those people would even go back to their migration files and update them, so that they can be ran without any errors to recreate the full database schema.

Rob in his blog post points out that the after “3rd or 4th year of life [of a project], however, those 40 or so migration files start to look rather ridiculous”. I just ran this command on a Rails project I work on for the last year or so:

ls db/migrate | wc -l
329

…and yeah, this is ridiculous. So I removed 320 oldest migrations without a blink of an eye. Heresy.

On ho, the migrations will fail on a brand new install. Yeah, I don’t care. I only care that I have all migrations between last release to production and now, the rest is irrelevant.

Ruby on Rails, after applying the migrations, creates a schema dump in either db/schema.rb or db/structure.sql, depending on your configuration. This is the file that is meant to be used when setting up a project on either new development machine, staging or production systems. Instead of rake db:migrate, you should always run rake db:schema:load to re-create your database.

How I treat migrations

For me, database migrations are mostly useful when working in a larger team of developers. Migrations created by other developers would appear on my master after I git pull. I run the migrations command, and in theory I should be good to go, with database changes made by others.

This process is very useful in my opinion, but has some shortcomings. One slightly annoying thing in ActiveRecord’s implementation of migrations, is that when two people add migrations in parallel, and those migrations add/remove some fields on the same table, weird things start happening to db/schema.rb. Every other time, a person involved in such migrations conflict runs the migration command in future, the fields in db/schema.rb would appear in different order. Git would show a change in code. ActiveRecord in Rails ignores the order of columns, so I guess that’s not a big thing until you start writing some raw SQL that may depend on order of those columns.

I do not care to write “down” migrations, this is a waste of time. I only write it if I need to roll back a migration myself, before I even pushed my changes to repository. “Oh, but then you can’t roll back your production system if you need to.” True, but that’s rarely possible anyway. If you remove some columns from the database, and recreate them in the down part of the migration later on, you will not get the foreign keys re-populated anyway if you roll them back.

The correct way to handle situations where you need to roll back a bad deploy, is to snapshot the database just before the deployment. If it breaks immediately, you restore the database from backup. If the breaking change was not discovered immediately - you should write a correcting migration rather than rolling back migrations.

Data in your database needs migrations too

Your database consists of schema and data, and you can’t take care of one without taking care of the other one.

This way of thinking led me to conclusion that we should actually do modify the data in the database in migrations. Many people discourage that sort of behavior, because of the nasty side-effects and lack of class reloading between migrations. ActiveRecord’s models would not pick up the changes in their schema, if you attempt to use the same model in migrations that modify it.

I often use execute("SOME RAW SQL") in a migration file to update the data if we need to. When some data needs to be moved between tables, good pattern is:

create a new database table in first migration
copy over / modify the data in second migration
remove the old table or columns in third migration

If I feel like using ActiveRecord model in a migration to migrate my data, I do not attempt to use my models from app/model, instead, I create a temporary models in a migration file itself:

class TempUser < ActiveRecord::Base
  self.table_name = "users"
end

My migration files often consist of scripts other people would put in the lib/ directory and run post-release. The difference is that, I don’t really care about cleaning them up afterwards. They will not clutter my lib/ directory, they will not be autoloaded and eat up any memory either. At any given point in time in future, I will simply clean up the db/migrate directory form the old garbage.

Migrations give you framework to work with

Migrations can be a tool that allow updating your database schema and contents with a single command. This is particularly useful when performing continuous deployments. I often use and like Codeship very much. When my build on master passes, app is put into maintenance mode, database snapshot is created, code deployed to staging server, migrations are ran. If migrations finish without an error, web app is unblocked and started. If migrations fail for some reason, database snapshot is restored and app is re-set to the last known working version.

Before I embraced this way of treating migrations as a tool to modify both schema and data, I often had to manually stop the app, run migrations, run a rake task to modify data, run some more migrations. The automated process is generally more reliable.

When we are working on a large release, I often do a dry run of migrations before releasing to production. I download the recent production backup, load it to local PostgreSQL. After running the migrations on the copy of production database, I can test if the data looks correct. If the data is not correct and requires changes, I do not hesitate to fix the migration that incorrectly handled the data updates. On no, another heresy. I find it’s actually okay to do so, unlike migrations that modify schema. Those, should be fixed by adding another migration, unless they are truly destructive to data.

JSONB and the lack of schema

Second point I have slightly different view than Rob, is JSONB. At some point in the past, I briefly jumped on the NoSQL bandwagon. “There’s no schema, it’s all documents” turned out to be a bullshit sooner than later.

The truth is, there’s always a schema. Your code depends on the JSONB documents being structured in some manner. If you update your code to generate different JSONB documents, you need to somehow take care of existing documents. One could say that you have to migrate the old documents to newer versions.

The process of updating “schema-free” documents from previous de facto schema, to next de facto schema, was painful. This was one of the main reasons why I decided to re-think the idea of NoSQL and schema-less databases in general. The other reason was that MongoDB was randomly losing some documents back in the day.

The idea behind JSONB and MongoDB documents is basically the same. You have a (possibly nested in many layers) hash, where you set some values and it’s all shiny since you don’t have to modify any schema. Yes, in development that is very much convenient. As soon as you start to have production data to take care of, any form of schema-less documents become liability. You need to take care of updating the documents yourself, often not in bulk, but one by one in a script. If your database is of millions of records, that can be time consuming, and no one likes having unnecessary downtime.

When working in a team of 4 or 5 developers on such schema-less database structures, you often find easier juts to delete and re-create stuff the database rather than write a piece of code that would update the documents. “Guys, please erase all in column X”. Again, as soon as you have production data, this practice starts to be problematic: you find yourself writing a data migration scripts just before the release. It can be weeks after you or or colleague made the silent schema change. You may not remember some edge cases that need to be taken care of by then. If you had a migration file written at the time of making the change, all you would have to do is to run that file.

I guess the problems above can be solved with self-discipline and additional practices to ensure all goes well, and your schema remains consistent. In my experience people take shortcuts wherever they can, so having no framework on updating schema and data will result in a mess.

JSONB and atomic updates

JSONB documents, by default, do not allow you to do atomic updates onto individual fields in a document. Either the whole document is updated, or nothing is changed.

This can be, and often is, a problem if you have application that has more than one user allowed to write. Even if there is just one user, if he or she attempts to perform operations on fields of JSONB document in parallel, only one of the operations will be successful: the other one will be silently overwritten when updating JSONB field.

Sure, you can use some sort of locking mechanism to prevent that. But if you do want to allow multiple users to interact with the same records, this stops being fun instantly.

The upcoming version of PostgreSQL (9.5) will have jsonb_set built-in function to take care of that. For now, you are limited to third party extensions if you need to do that. If you use ORM like ActiveRecord, chances are that you’ll need to do some heavy gymnastics to use those custom functions. You may even find yourself writing a SQL by hand (on no!).

When to use JSONB

Having said all the above, I use and like JSONB very much. I found at least two cases where I and my team use JSONB column types:

configurations/settings/permissions hashes
data a back-end for JavaScript app

The first case is pretty obvious use of JSONB column type. The second part, I often find handy if I want to avoid writing a complicated API to back up some data JavaScript part of app operates on.

As an example, a project I currently work on, has a “compensation calculator”. This is a fairly complex JavaScript piece of UI, that can be embedded on different records: employment records, job offers, candidate info etc. We don’t really do anything with the data server-side, except maybe rendering it to HTML. We simply allow JavaScript component to read the data and save it to updated version. The decision not to create normalized tables, and not to create API endpoints for them, probably saved us a lot of time. Having that said, we had to migrate the structure of the JSONB documents once already, with a migration that goes to every single document, re-calculates some stuff using Ruby code, and saves the document. If we were to do those sort of re-calculations more often, we would be better off having a set of normalized 4 or 5 tables in database, and performing the calculations dynamically.

Summary

The bottom line is: be careful with JSONB, use it where it makes sense and in conditions that make sense to you and your co-workers. It’s not a silver bullet.

Migrations are not perfect, but they help a lot when working in a team.

Post by Hubert Łępicki

Hubert is partner at AmberBit. Rails, Elixir and functional programming are his areas of expertise.