The misunderstanding of “let it crash”

If you are programming in Erlang, Elixir or any other of growing number of languages that run on top of BEAM VM, you probably heard about “let it crash” philosophy.

There are myths and misunderstandings of the concept, both within - and outside of the community of BEAM developers, as you can clearly see on this thread I stumbled upon recently:

this slide looks interesting to me - I never bought into this erlang/elixir narration of better dealing with errors - why not prevent them?
— Andrzej Krzywda (@andrzejkrzywda) July 18, 2019

“Let it crash” hurts BEAM community?

The debate goes way back, and this post by authors of Cowboy argues that it’s probably harmful to community, precisely because it’s so commonly misunderstood.

People see “let it crash” and think of this:

Default Rails error page

Is “let it crash” philosophy to not handle errors and just crash web requests issued by the user? Not at all!

Before we explore what it actualy means, let’s think of why you wouldn’t want to handle errors.

You can (and should) expect errors

Errors in computer programs do happen all the time. They are unavoidable, not only because of programmer errors, but also (mostly?) because we are dealing with external systems that can be unreliable, or communicate over unreliable network, or deal with hardware failure, or suffer from one of the classic examples of PEBKAC of one of the system operators, such as misconfiguration.

There is a number of ways to deal with errors. Many programming languages provide way to handle exceptions, and this is indeed also available on the BEAM. In fact I blogged about being able to detect all kinds of errors and exceptions that can happen while serving web requests in Phoenix, and the framework itself comes with ways, to, handle errors. There is no reason our web application users should see random, default, ugly error page. There are better ways of handling that on the UI level and Elixir and Phoenix provide good mechanisms of implementing just that.

The language themselves can also provide additional security for programmer, eliminating whole ranges of errors at compile time. I have blogged about how much of improvement in this area Elixir was for me when migrated from Ruby a while ago. Gleam language goes even further than Elixir into static typing territory to deal with errors at compile time!

But we still have “let it crash” in Elixir, Erlang and Gleam alike, and programmers use the technique all the time. How is it so?

Why you want to “let it crash” …sometimes

While handling errors and exceptions is very much possible and used all the time on the BEAM, writing defensive code to protect ourselves from errors can be tricky.

If errors can happen, and you want to handle them, you may be used to writing try-catch clauses, and responding to different kinds of errors in different ways. This adds quite a lot code, at many different places at your code base, growing the project in size, impeding performance and increasing complexity of the system. It definitely is easier to read algorithm that consists of steps that we assume will succeed, rather than the one that assumes each and every step can fail and we need to branch out to handle such cases.

Building components of software that do not concern themselves with detailed error handling is the core of “let it crash” philosophy. This is not the same as building software that do not concern itself with handling these errors at all.

Let it crash, the proper way

BEAM allows us to write programs that consist of many processes. “Let it crash” philosophy applies to these individual processes. Your business process may start multiple BEAM processes to perform some complicated task, and I will show you some examples where it’s a justified way of doing so.

You don’t need to want to do multiple things at the same time in order to start a BEAM process, the possibility of allowing it to crash and ease of recovery from this situation may be incentive good enough to consider spawning processes in first place.

Let’s consider system, which allows user to specify URL of their avatar, and when they do so, they submit a form and avatar is being fetched form external web site, then resized and converted to PNG format, trimmed to the right aspect ratio and then appears on the web site.

There are multiple points of failure in the above scenario. The external web site, which hosts user’s avatar may be temporarily unavailable or have a hiccup due to heavy server load. Or, the virtual switches on our VPS provider where our app is deployed may be reconfigured at the very moment and our connection gets interrupted. Possibly the web server returned a piece of HTML saying we’re “offline for maintenance” and not an intended JPEG file, and our “convert” command fails because it expects a valid image file and it didn’t get one. Maybe the file on external server is still being processed and we didn’t receive 100% of it? Maybe our NFS storage ran out of disk space and we can’t save the file to disk at the moment. I am sure there is more things that can, and given enough time, will go wrong with this simple scenario.

Now, this is perfect example where you want to employ “let it crash” philosophy, no matter if you are on BEAM or other platform. BEAM just makes it simpler, and I’ll tell you why in a bit.

Easy way to “handle” errors in above scenario would be just to show error to the user. But is it that simple? If the error happened when we already opened local file descriptor to write the file to, we need to remember about closing it. If we opened database transaction it’s likely we want to close it too. Then, we want to remove the temporary file if “convert” failed …and there’s probably other tasks a programmer has to think about to clean up after themselves if exception happens.

And what if users get annoyed by high rate of errors, and your Product Owner decides to add a feature stating that we should re-try downloading and/or converting the image if it failed along the way? Maybe you put it on the background job, which has some built-in retry mechanism. But you still have to think about some of the error handling, to clean up after yourself, hence programs on other platforms often can’t just “let it crash” and require good amount of defensive programming to remain fault tolerant at bigger scale. You don’t want your program to exceed memory, limit of database connections or file descriptors that it opened. By doing so, you are trading simplicity of part of your code for possibility of system crashing as a whole. Not a good deal IMHO. Total catastrophic, cascading failure of the system can be caused by forgetting to close file you opened for writing, when minor exception happened! You generally don’t want this to ever happen, I assume.

Here comes Erlang, destroyer of defensive programming

Erlang brought to the table supervision trees and :gen_server, ability to start, link and monitor spawned processes. It came with ability to re-try execution of processes that did fail to perform their job in form of different supervisor restart strategies.

You can write programs that will start sub-processes to execute some tasks fairly easy, and monitor their success/failure, re-try, defer re-try and react to permanent failure using these building blocks of Erlang/OTP.

This is all good, and it helps to build programs where you start some process and then you “let it crash” if it fails at any given point in time, and simply re-try the failed task in a bit. Hooray, this simple strategy can go very long way in my experience in terms of dealing with unreliable network resources for example. Try adding alias 'npm install'='(npm install || npm install)' for immediate improvement of your Node.js experience ;).

But in Erlang, you surely have to remember about cleaning after yourself when you crash a process, right?

In many cases: yes, but in many more cases such as open file descriptors, TCP sockets and linked processes your original process may have spawned, the answer is nope. Erlang will do the heavy lifting for you, since BEAM ties it’s processes to resources they consume. It is likely to close file descriptors for you, shut down opened network connections and kill any of the processes your master process started. This allows you to simplify your code by not having to think about many things that can go wrong, and focus on implementing the “happy path” instead.

It’s not just BEAM and Erlang’s or Elixir’s standard library that follow this principle. If you use little library called “briefly” to conviniently create temporary files and directories on your hard disk, you don’t have to worry about removing these temporary files. You can just “let your process crash” and the library is already going to take care of removing temporary files from disk if this happens.

This is very common mechanism in Erlang/Elixir world, precisely because both OTP and BEAM give us mechanisms to implement these smart and automatic ways of reacting to unexpected software failures. Of course, you can always stumble upon half-finished library which refuses to expect your processes to crash, but the established ones often expect user processes to crash at any given time, and perform required cleanup themselves.

“Let it crash” is a way of enhancing software fault tolerance. It’s not about showing more - but about showing less error pages to the user.

Post by Hubert Łępicki

Hubert is partner at AmberBit. Rails, Elixir and functional programming are his areas of expertise.