PostgreSQL awesomeness for Rails developers

PostgreSQL is the true awesome SQL database, that you should probably be using for your relational data in your Ruby on Rails applications. In fact, if you are using Heroku, you are quite likely using PostgreSQL already. If you are not using PostgreSQL, or just starting, you have no idea how awesome it is. In this post, I will walk you through some basics that all Ruby on Rails developers who use PostgeSQL should know about.

Installation

PostgreSQL is easy to install, esp. if you are running Linux, but it might be outdated. In Ubuntu 12.04 (which I use for development in Vagrant and production deployments), most recent version is 9.1. Still awesome, but we can do better and fetch ourselves version 9.3 (or soon-to-be-released 9.4) and enjoy improved JSON data types, materialized views etc.

You need to add official APT repository and install PostgreSQL from it:

$ echo "deb http://apt.postgresql.org/pub/repos/apt/ precise-pgdg main" | sudo tee -a /etc/apt/sources.list
$ wget --quiet -O - https://www.postgresql.org/media/keys/ACCC4CF8.asc | sudo apt-key add -
$ sudo apt-get update
$ sudo apt-get install postgresql-9.3 libpq-dev

I will not describe basic configuration of your Rails application here, or setting up PostgreSQL on MacOS X, instead please have a look at this Railscast by Ryan Bates.

First things first, creating user in PostgreSQL

PostgreSQL has it’s own users and roles system. As most things in Postgres, it’s also quite awesome ;). You can create read-only users, grant access for individual tables, databases or actions… To set up PostgreSQL for comfortable Ruby on Rails development, all we need is an admin user. I will use ‘vagrant’ as this is the user I use in virtual machine to run my application.

$ sudo bash
$ su - postgres
$ createuser -a vagrant

The “-a” flag makes our user admin, so be sure you do it only in safe environment, i.e. development virtual machine and not production servers.

If we now connect to PostgreSQL as user ‘vagrant’ from shell/psql client we should have permissions to create, drop and drop tables.

$ psql
postgres=#

Using psql

psql is command-line client for PostgreSQL. If you are migrating from MySQL, there might be some commands that you miss, esp. listing databases/tables, changing output format. Here are some tips:

List all databases:

postgres=# \list; 

Switch to database ‘foo’:

postgres=# \connect foo; 

Show tables in given database:

postgres=# \dt; 

Show slightly more info about tables in your database:

postgres=# \dt+; 

Show table details and associated indexes:

postgres=# \d foo; 

Enable extended view, so it’s a bit clearer to read output of commands:

postgres=# \x; 

See more on psql docs page.

Getting dirty: configuring PostgreSQL

Main configuration file of your PostgreSQL installation is called ‘postgresql.conf’. On my installation it is located under ‘/etc/postgresql/9.3/main/postgresql.conf’. You should edit the file and amend some settings, especially on your production and staging environments, but it can also help speed up the tests.

There are a few settings with you might consider changing, let’s go through them and describe in simple words:

listen_addresses = '*' # To which interface we should bind. '*'
                       # makes your PostgreSQL visible to the Internet

max_connections = 200  # How many connections we should allow from
                       # our app, workers, delayed_jobs etc. combined
shared_buffers = 16GB  # How much memory our PostgreSQL can use for
                       # buffers. Default value is insanely small.
                       # If PostgreSQL is the only thing we run on
                       # the machine, set it to 1/4 of available RAM
work_mem = 10MB        # Increase the small value so the
                       # sorts perform better.
maintenance_work_mem = 128MB 

synchronous_commit = off # Speed up writes in exchange for possible
                         # loss of data (be careful here!)

wal_buffers = 16MB # Basically how much data we can loose. But
                   # increasing makes things faster. Choose wisely.
                   # Also applies to settings below.
wal_writer_delay = 400ms
checkpoint_segments = 32
checkpoint_timeout = 900s 
checkpoint_completion_target = 0.9

random_page_cost = 2.0 # Make planner use indices a bit more often
                       # vs. sequential table scans.

effective_cache_size = 32GB  # How much memory in total our
                             # PostgreSQL can use. Twice of
                             # shared_buffers seems good.

Comments in this file are very helpful, and you can also consult PostgreSQL docs for more in-depth overview of each configuration option and possible values.

If you don’t edit the file, you will not achieve even a fraction of awesome speed your database can perform, so go back and edit it, don’t just skip the section.

Data types awesomeness

PostgreSQL has some pretty awesome data types.

Arrays

You can use them with Rails 4 now, they should work mostly out of the box. In your migrations you need to do:

t.string :tags, array: true, default: []
...
# and use your model
Post.create!(body: "AmberBit", tags: ["ruby", "rails", "development", "company"])
...
# and query it:
Post.where('tags @> ARRAY[?]', ['ruby', 'development'])

Read more about arrays in Rails here or here.

If you add ‘postgres_ext’ gem to your Gemfile, you can query it with nicer syntax.

Don’t just jump to using arrays instead of associations, it is a neat tool but performance implications and bad database designs are likely to happen if you overuse it.

Hashes

Hstore is data type that behaves like Ruby’s hash.

execute 'CREATE EXTENSION hstore'
...
t.hstore :dict
...

# And use it in your model
Post.create!(body: "foo", dict: {author: "Hubert Łępicki", topic: "Rails web development"})
...
# And query it
Post.where("dict -> 'author' = 'Hubert Łępicki'")

For available query interface, have a look at the docs again.

Others

Date and time types are great. If you have ‘time’ column, it will not be mapped nicely to Ruby class (because there is no default equivalent), but it will still work. ‘interval’ is also worth noticing.

INET/CIDR for storing IP addresses and network info. Use it with postgres_ext gem.

UUID is the random ID you can give your models instead of sequential numeric IDs used by default by Rails. Using it with Rails 4 is pretty straightforward.

PostGIS

PostGIS is geospatial extension for PostgreSQL. What it means, is that you can have latitude/longitude in your models, and query them by distance etc. If you are still using Rails 3, you are in better position because you can quite easily use activerecord-postgis-adapter but it’s support for Rails 4 is still work in progress. Instead you can use manual approach for now.

Scaling up

When you build and launch your application, things can go smooth for some time, but then you might be hit with massive amount of data when you are really successful. Or, you can analyse big amounts of data (famously known as ‘big data’) from the start.

PostgreSQL is an old-school player here. Unlike ElasticSearch, it does not do sharding by default, it will not distribute/rearrange data between nodes. But it will also be consistent from start (not ‘eventually-consistent’), and you can be pretty sure it won’t loose large amounts of data.

You can scale your database vertically, by adding faster disks, more RAM, more CPUs etc., but you will hit the wall in the end. Thankfully, PostgreSQL can, somehow, scale horizontally. But it will require more effort on your end.

For an overview of horizontal scaling approaches in PostgreSQL, I recommend you this book by Zoltan Böszörmenyi and Hans-Jürgen Schönig. It is available on Safari Bookshelf if you are subscriber.

Master-slave replication

This approach to scaling is simplest one. You set up master instance that will handle all writes, and one or more slave instances that you can read from. The configuration is somehow complicated if you don’t know what you are doing, and PostgreSQL’s docs are not very helpful in this case.

Streaming replication is my choice, which means that updates to write-ahead-log (WAL) are streamed from master to slaves. Setting it up this way, gives you unexpected benefit: you are getting possibility to do point-in-time recovery, which is great because you can restore your system to exact state just before major disaster happened. It won’t happen often, but it saved my life (okay, my job probably) once already.

Setting up streaming replication is not that complicated if you find right tutorial online.

On our Rails web application, we have to support it somehow. We need to point our application to perform some operations on master and some on slave.

Master and slave are in fact two different PostgreSQL instances, and we need to configure and use different connections. There is a nice gem called ‘octopus’ that allows us to do just that.

To use octopus, you need to put config/shards.yml configuration file into your Rails application root, to point it to individual servers (a.k.a shards):

octopus:
  replicated: true
  production:
    slave1:
      adapter: postgresql
      host: some-host
      database: app_db
    slave2:
      adapter: postgresql
      host: other-host
      database: app_db
  development:
  ...

Database configured in config/database.yml will be used as master in such case.

In your Ruby code, you can specify which server to use:

Company.using(:master).count

Company.using(:slave1).count

Octopus.using(:slave2) do 
  Company.count
end

If you want all write to be done on master, and reads on slaves, you can enforce that by using ‘replicated_mode’ method in your models:

class Company < ActiveRecord::Base
  replicated_model
end

Read wiki page on replication on Github for more info.

Using octopus, we can have different configurations for different environments, so that we can configure more slaves on production, a bit less on staging, and use just on database for local web development of our Rails application.

Sharding

Sharding is a technique that allows you to distribute data between multiple PostgreSQL instances. Mind my words: multiple instances. This means, that they will not appear as one database for your application and you will not be able to do join queries between them. But, if your web application is writing some log files, or you can consider some denormalization, this can be, and often is, good enough.

For sharding, octopus will work just fine. You need to create multiple databases on multiple servers and you can select which one to use with ‘using(:shard1)’ syntax as above. The only difference is that you have to specify which shard to write to.

Partitioning

Another scaling option is partitioning. It is good choice if you want to create applications that you will need to perform joins/queries on whole tables, but you can speed things up by physically distributing records between partitions (which are normal PG tables).

Currently there is no good gem that I know of that will let you do partitioning, so we need to do some hacking to get it working under Ruby on Rails application.

How it works, is that we will create main tables for records we want to partition, and use PostgreSQL inheritance to create child tables. Inheritance allows us to change structure of the main table, and the changes will be done also on the child tables (hint: migrations will work just fine). When we create tables we are enforcing “check” conditions on them. If the checks are separate, and we provide conditions in our SQL queries that will allow planner to figure out which partition it should query, it retrieve results much faster. We can also query individual tables if we really want to, but this transparent approach is usually much better and easier.

When you insert data into PostgreSQL, you need to insert to proper partition instead of main table. This is a problem, because Rails will not do it by default. We can work around this situation with trigger, that will determine proper destination table and insert data into it instead.

I will use 32 partitions and a ‘partition_id’ key to identify which partition the data should go to. We have ‘images’ table and we will partition data by user:

create_table :images do |t|
  t.integer :user_id
  t.integer :partition_id
  t.integer :url
end

You have also to create partitions in a migration for your table:

partitions = 32

shards.times do |i|
  execute "CREATE TABLE images_p#{i} (
      CHECK ( partition_id = #{i} )
      ) INHERITS (images)"
end

And triggers:

tbl = 'images'
execute "CREATE OR REPLACE FUNCTION #{tbl}_insert_trigger()
               RETURNS TRIGGER AS
               $$
               DECLARE
                  _tablename text;
               BEGIN
                   _tablename := '#{tbl}_p' || NEW.\"partition_id\";

                   EXECUTE 'INSERT INTO ' || quote_ident(_tablename) || ' SELECT ($1).*'
                   USING NEW;
                   RETURN NEW;
               END;
               $$
               LANGUAGE plpgsql;"

      execute "CREATE TRIGGER insert_#{tbl}_trigger
               BEFORE INSERT ON #{tbl}
               FOR EACH ROW EXECUTE PROCEDURE #{tbl}_insert_trigger();"

      execute "CREATE OR REPLACE FUNCTION #{tbl}_delete_master()
               RETURNS trigger
               AS $$
               DECLARE
                   r #{tbl}%rowtype;
               BEGIN
                   DELETE FROM ONLY #{tbl} where id = NEW.id returning * into r;
                   RETURN r;
               END;
               $$
               LANGUAGE plpgsql;"

      execute "create trigger after_insert_#{tbl}_trigger
               after insert on #{tbl}
               for each row
                   execute procedure #{tbl}_delete_master();
               end;"

Yeah, that’s PL/SQL ;).

In our User model we need to assign each user a partition_id. We can probably do it modulo 32 to distribute images (not users) across all partitions:

class User < ActiveRecord::Base
  def partition_id
    self.id % 32
  end
end

We also need to add association that allows us to query / create records in proper partitions:

class User < ActiveRecord::Base
  has_many :images, ->(u) { where( partition_id: u.partition_id ) }
...
end 

We need also to set ‘constraint_exclusion’ in our postgresql.conf:

constraint_exclusion = on

Since now, the following code will use partitions:

 u = User.first
 u.images.create! :url => "http://www.amberbit.com/assets/amberbit_logo_big.png"
 u.images.all

But queries to retrieve images by ID, or all updates will not get any faster (in fact they will get slower) because PostgreSQL does not know to which partition it has to look to find the record, so it queries all.

To avoid that, I am afraid we have to monkey-patch ActiveRecord. As the code is quite longish, I will just give you a link to GitHub here.

Put it somewhere in your lib/ and require from ‘application.rb’ just after requiring Rails frameworks.

What the monkey patch actually does is adding ‘where partition_id=XX’ to each update/delete query to trigger constraint exclusion mechanism to kick in, and perform operations on just one partition.

In my tests, performance improvements are significant when using indices on partitions, and are just insanely huge when querying data that requires sequential scanning.

More about partitioning in docs.

Summary

That turned out to be pretty long, but you get the idea: PostgreSQL is quite awesome. Before you jump on the (often sinking) NoSQL boat, you should definitely check if your problem cannot be solved by using PostgreSQL better. Support for advanced features in Rails is limited, but you can still use it when developing your Rails application with help of external gems (postgres_ext) or your own hacks (like partitioning above). Hopefully we will see a bit more Postgres-specific features in future versions of Rails.

by Hubert Łępicki, twitter: @hubertlepicki

comments powered by Disqus