Erlang

Draft

Alchemy 101: Part 3 - Fault Tolerance doesn't come out of the box

by Thomas Hutchinson

Alchemy 101: Part 3 - Fault Tolerance doesn’t come out of the box

One of the biggest selling points of Elixir is the means it gives you to write fault tolerant applications via its concurrency model. Processes can broadcast their failure to dependant processes which can take appropriate action. You decide how processes should respond to failure based on your use case. There is no single solution. I’ll give you an example in which not handling failure led to, you guesed it, more failure.

First install and start RabbitMQ.

Then create a new project.

mix new rabbit --module Rabbit
cd rabbit

Then add the amqp dependency to mix.exs.

defp deps do
  [{:amqp, "~> 0.1.5"}]
end

Now add the following to lib/rabbit.ex.

defmodule Rabbit do
  use GenServer
  require Logger

  def start_link do
    GenServer.start_link(__MODULE__, nil)
  end

  def init(_) do
    {:ok, connection} = AMQP.Connection.open("amqp://localhost")
    {:ok, channel} = AMQP.Channel.open(connection)
    {:ok, channel}
  end

  def publish(pid, payload) do
    GenServer.cast(pid, {:publish, payload})
  end

  def handle_cast({:publish, payload}, channel) do
    AMQP.Basic.publish(channel, "", "", payload)
    Logger.info("Published #{payload}")
    {:noreply, channel}
  end
end

That’s all you need, fetch the dependencies and test Rabbit.

mix deps.get
iex -S mix
iex(1)> {:ok, pid} = Rabbit.start_link()
{:ok, #PID<0.139.0>}
iex(2)> for i <- 1..3, do: Rabbit.publish(pid, "message #{i}")
19:21:31.471 [info]  Published message 1
19:21:31.471 [info]  Published message 2
19:21:31.471 [info]  Published message 3

Head to the default exchange and you should see some activity on the Message rates chart. Keep the above IEx shell open, you will need it again soon.

Time to introduce a problem. Restart RabbitMQ.

brew services restart rabbitmq

Go back to the IEx shell, notice how there is an OTP error report. Looks serious, the main takeaway from it is below. The socket to RabbitMQ closed causing a process to crash.

19:22:57.994 [error] GenServer #PID<0.142.0> terminating
** (stop) :socket_closed_unexpectedly
Last message: :socket_closed

But it still appears like you can publish messages. Try it.

iex(3)> for i <- 1..3, do: Rabbit.publish(pid, "message #{i}")
19:21:31.471 [info]  Published message 1
19:21:31.471 [info]  Published message 2
19:21:31.471 [info]  Published message 3

But if you head to the default exchange there appears to be no new messages coming in. But why?

First of all we don’t see any failures in the Rabbit process because AMQP.Basic.publish/4 ultimately leads to a :gen_server.cast/2 (amqp_channel.erl in rabbit_common) being called with channel.pid. :gen_server.cast/2 will not return or throw an error if the PID (in this case channel.pid) does not exist. This means the failure was hard to detect. Now imagine if your application was running in the background, this could have been even more difficult to spot.

Here comes the good part, how to handle the failure. We want Rabbit to be sent a message when the socket to RabbitMQ is closed. To do this we need to link to the channel process (to receive an exit message if it stops) we started and trap exits i.e. not crash if we receive an exit signal. To do this add the following code.

  def init(_) do
    {:ok, connection} = AMQP.Connection.open("amqp://localhost")
    Process.flag(:trap_exit, true)
    {:ok, channel} =  AMQP.Channel.open(connection)
    Process.link(connection.pid)
    {:ok, channel}
  end

  def handle_info({:EXIT, from, :socket_closed_unexpectedly}, channel) do
    Logger.warn("Received :EXIT from #{inspect(from)} for #{:socket_closed_unexpectedly}")
    {:stop, :lost_rabbitmq_connection}
  end

Kill and start the IEx shell and then run the following.

brew services restart rabbitmq

Jump to the IEx shell and observe the output.

19:27:33.497 [warn]  Received :EXIT from #PID<0.142.0> for :socket_closed_unexpectedly

Perfect, now when the socket is closed the dependent process, Rabbit, is informed. Now to respond to the failure. This is something you must decide, no one including myself, library/framework developers and others can tell what to do. This is in your hands. It is also the reason why Elixir (and Erlang) has such a good reputation when it comes to building fault tolerant systems. You decide on how to respond to failure, there is no single silver bullet. For the demo though, I have choosen to simply stop then Rabbit GenServer when I receive an :EXIT for :socket_closed_unexpectedly.

Hope you enjoyed reading and please give me your feedback.

Go back to the blog

Tags: Elixir
×

Request more information:

* Denotes required
×

Thank you for your message

We sent you a confirmation email to let you know we received it. One of our colleagues will get in touch shortly.
Have a nice day!