gen_statem Unveiled

gen_statem and protocols

This blog post is a deep dive into some of the concepts discussed in my recent conference talk at FOSDEM. The presentation explored some basic theoretical concepts of Finite State Machines, and some special powers of Erlang’s gen_statem in the context of protocols and event-driven development, and building upon this insight, this post delves into harnessing the capabilities of the gen_statem behaviour. Let’s jump straight into it!

Protocols

The word protocol comes from the Greek “πρωτόκολλον”, from πρῶτος (prôtos, “first”) + κόλλα (kólla, “glue”), used in Byzantine greek as the first sheet of a papyrus-roll, bearing the official authentication and date of manufacture of the papyrus. Over time, the word describing the first page became a synecdoche for the entire document.

The word protocol was then used primarily to refer to diplomatic or political treaties until the field of Information Technology overloaded the word to describe “treaties” too, but between machines, which as in diplomacy, governs the manner of communication between two entities. As the entities communicate, a given entity receives messages describing the interactions that peers are establishing with it, creating a model where an entity reacts to events.

In this field of Technology, so much of the job of a programmer is implementing such communication protocol, which reacts to events. The protocol defines the valid messages and the valid order, and any side effects an event might have. You know many such protocols: TCP, TLS, HTTP, or XMPP, just to name some good old classics.

The event queue

As a BEAM programmer, implementing such an event-driven program is an archetypical paradigm you’re well familiar with: you have a process, which has a mailbox, and the process reacts to these messages one by one. It is the actor model in a nutshell: an actor can, in response to a message it receives:

  • send a finite number of messages to other Actors;
  • create a finite number of new Actors;
  • designate the behaviour to be used for the next message it receives.

It is ubiquitous to implement such actors as a gen_server, but, pay attention to the last point: designate the behaviour to be used for the next message it receives. When a given event (a message) implies information about how the next event should be processed, there is implicitly a transformation of the process state. What you have is a State Machine in disguise.

Finite State Machines

Finite State Machines (FSM for short) are a function 𝛿 of an input state and an input event, to an output state where the function can be applied again. This is the idea of the actor receiving a message and designating the behaviour for the next: it chooses the state that will be input together with the next event.

FSMs can also define output, in such cases they are called Finite State Transducers (FST for short, often simplified to FSMs too), and their definition adds another alphabet for output symbols, and the function 𝛿 that defines the machine does return the next state together with the next output symbol for the current input.

gen_statem

When the function’s input is the current state and an input symbol, and the output is a new state and a new output symbol, we have a Mealy machine. And when the output alphabet of one machine is the input alphabet of another, we can then intuitively compose them. This is the pattern that gen_statem implements.

gen_statem has three important features that are easily overlooked, taking the best of pure Erlang programming and state machine modelling: it can simulate selective receives, offers an extended mailbox, and allows for complex data structures as the FSM state.

Selective receives

Imagine the archetypical example of an FSM, a light switch. The switch is for example digital and translates requests to a fancy light-set using an analogous cable protocol. The code you’ll need to implement will look something like the following:

handle_call(on, _From, {off, Light}) ->
    on = request(on, Light),
    {reply, on, {on, Light}};
handle_call(off, _From, {on, Light}) ->
    off = request(off, Light),
    {reply, off, {off, Light}};
handle_call(on, _From, {on, Light}) ->
    {reply, on, {on, Light}};
handle_call(off, _From, {off, Light}) ->
    {reply, off, {off, Light}}.

But now imagine the light request was to be asynchronous, now your code would look like the following:

handle_call(on, From, {off, undefined, Light}) ->
    Ref = request(on, Light),
    {noreply, {off, {on, Ref, From}, Light}};
handle_call(off, From, {on, undefined, Light}) ->
    Ref = request(off, Light),
    {noreply, {on, {off, Ref, From}, Light}};

handle_call(off, _From, {on, {off, _, _}, Light} = State) ->
    {reply, turning_off, State};  %% ???
handle_call(on, _From, {off, {on, _, _}, Light} = State) ->
    {reply, turning_on, State}; %% ???
handle_call(off, _From, {off, {on, _, _}, Light} = State) ->
    {reply, turning_on_wait, State};  %% ???
handle_call(on, _From, {on, {off, _, _}, Light} = State) ->
    {reply, turning_off_wait, State}; %% ???

handle_info(Ref, {State, {Request, Ref, From}, Light}) ->
    gen_server:reply(From, Request),
    {noreply, {Request, undefined, Light}}.

The problem is, that now the order of events is not defined, and reorderings of the user requesting a switch and the light system announcing finalising the request are possible, so you need to handle these cases. When the switch and the light system had only two states each, you had to design and write four new cases: the number of new cases grows by multiplying the number of cases on each side. And each case is a computation of the previous cases, effectively creating a user-level callstack.

So we now try migrating the code to a properly explicit state machine, as follows:

off({call, From}, off, {undefined, Light}) ->
    {keep_state_and_data, [{reply, From, off}]};
off({call, From}, on, {undefined, Light}) ->
    Ref = request(on, Light),
    {keep_state, {{Ref, From}, Light}, []};
off({call, From}, _, _) ->
    {keep_state_and_data, [postpone]};
off(info, {Ref, Response}, {{Ref, From}, Light}) ->
    {next_state, Response, {undefined, Light}, [{reply, From, Response}]}.

on({call, From}, on, {undefined, Light}) ->
    {keep_state_and_data, [{reply, From, on}]};
on({call, From}, off, {undefined, Light}) ->
    Ref = request(off, Light),
    {keep_state, {{Ref, From}, Light}, []};
on({call, From}, _, _) ->
    {keep_state_and_data, [postpone]};
on(info, {Ref, Response}, {{Ref, From}, Light}) ->
    {next_state, Response, {undefined, Light}, [{reply, From, Response}]}.

Now the key lies in postponing requests: this is akin to Erlang’s selective receive clauses, where the mailbox is explored until a matching message is found. Events that arrive out of order can this way be treated when the order is right.

This is an important difference between how we learn to program in pure Erlang, with the power of selective receives where we chose which message to handle, and how we learn to program in OTP, where generic behaviours like gen_server force us to handle the first message always, but in different clauses depending on the semantics of the message (handle_cast, handle_call and handle_info). With the power to postpone a message, we effectively choose which message to handle without being constrained with the code location.

This section is inspired really by Ulf Wiger’s fantastic talk, Death by Accidental Complexity. So if you’ve known of the challenge he explained, this section hopefully serves as a solution to you.

Complex Data Structures

This was much explained in the previous blog on state machines, by using gen_statem’s handle_event_function callback, apart from all the advantages explained in the aforementioned blog, we can also reduce the implementation of 𝛿 to a single function called handle_event, which makes the previous code take advantage of a lot of code reuse, see the following equivalent state machine:

handle_event({call, From}, State, State, {undefined, Light}) ->
    {keep_state_and_data, [{reply, From, State}]};
handle_event({call, From}, Request, State, {undefined, Light}) ->
    Ref = request(Request, Light),
    {keep_state, {{Ref, From}, Light}, []};
handle_event({call, _}, _, _, _) ->
    {keep_state_and_data, [postpone]};
handle_event(info, {Ref, Response}, State, {{Ref, From}, Light}) ->
    {next_state, Response, {undefined, Light}, [{reply, From, Response}]}.

This section was extensively described in the previous blog post so to learn more about it, please enjoy your read!

An extended mailbox

We saw that the function 𝛿 of the FSM in question is called when a new event is triggered. In implementing a protocol, this is modelled by messages to the actor’s mailbox. In a pure FSM, a message that has no meaning within a state would crash the process, but in practice, while the order of messages is not defined, it might be a valid computation to postpone them and process them when we reach the right state.

This is what a selective receive would do, by exploring the mailbox and looking for the right message to handle for the current state. In OTP, the general practice is to leave the lower-level communication abstractions to the underlying language features, and code in a higher and more sequential style as defined by the generic behaviours: in gen_statem, we have an extended view of the FSM’s event queue.

There are two more things to notice we can do with gen_statem actions: one is to insert ad-hoc events with the construct {next_event, EventType, EventContent}, and the other is to insert timeouts, which can be restarted automatically on any new event, any state change, or not at all. These seem like different event queues for our eventful state machine, together with the process’s mailbox, but really it is only one queue we can see as an extended mailbox.

The mental picture is as follows: There is only one event queue, which is an extension of the process mailbox, and this queue has got three pointers:

  • A head pointing at the oldest event;
  • A current pointing at the next event to be processed.
  • A tail pointing at the youngest event;

This model is meant to be practically identical to how the process mailbox is perceived.

  • postpone causes the current position to move to its next younger event, so the previous current position is still in the queue reachable from head.
  • Not postponing an event i.e consuming it causes the event to be removed from the queue and current position to move to its next younger event.
  • NewState =/= State causes the current position to be set to head i.e the oldest event.
  • next_event inserts event(s) at the current position i.e as just older than the previous current position.
  • {timeout, 0, Msg} inserts a timeout, Msg event after tail i.e as the new youngest received event.

Let’s see the event queue in pictures:


handle_event(Type1, Content1, State1, Data) ->
{keep_state_and_data, [postpone]};
When the first event to process is 1, after any necessary logic we might decide to postpone it. In such case, the event remains in the queue, reachable from HEAD, but Current is moved to the next event in the queue, event 2.

handle_event(Type1, Content1, State1, Data) ->
{keep_state_and_data, [postpone]};
...
handle_event(Type2, Content2, State1, Data) ->
{next_state, State2};
When handling event 2, after any necessary logic, we decide to transition to a new state. In this case, 2 is removed from the queue, as it has been processed, and Current is moved to HEAD, which points again to 1, as the state is now a new one.

handle_event(Type1, Content1, State1, Data) ->
{keep_state_and_data, [postpone]};
...
handle_event(Type2, Content2, State1, Data) ->
{next_state, State2};
...
handle_event(Type1, Content1, State2, Data) ->
{keep_state_and_data, [{next_event, TypeA, ContentA}]};
After any necessary handling for 1, we now decide to insert a next_event called A. Then 1 is dropped from the queue, and A is inserted at the point where Current was pointing. HEAD is also updated to the next event after 1, which in this case is now A.
handle_event(Type1, Content1, State1, Data) ->
{keep_state_and_data, [postpone]};
...
handle_event(Type2, Content2, State1, Data) ->
{next_state, State2};
...
handle_event(Type1, Content1, State2, Data) ->
{keep_state_and_data, [{next_event, TypeA, ContentA}]};
...
handle_event(TypeA, ContentA, State2, Data) ->
{keep_state_and_data, [postpone]};

Now we decide to postpone A, so Current is moved to the next event in the queue, 3.

handle_event(Type1, Content1, State1, Data) ->
{keep_state_and_data, [postpone]};
...
handle_event(Type2, Content2, State1, Data) ->
{next_state, State2};
...
handle_event(Type1, Content1, State2, Data) ->
{keep_state_and_data, [{next_event, TypeA, ContentA}]};
...
handle_event(TypeA, ContentA, State2, Data) ->
{keep_state_and_data, [postpone]};
...
handle_event(Type3, Content3, State2, Data) ->
keep_state_and_data;
3 is processed normally, and then dropped from the queue. No other event is inserted nor postponed, so Current is simply moved to the next, 4.



handle_event(Type1, Content1, State1, Data) ->
{keep_state_and_data, [postpone]};
...
handle_event(Type2, Content2, State1, Data) ->
{next_state, State2};
...
handle_event(Type1, Content1, State2, Data) ->
{keep_state_and_data, [{next_event, TypeA, ContentA}]};
...
handle_event(TypeA, ContentA, State2, Data) ->
{keep_state_and_data, [postpone]};
...
handle_event(Type3, Content3, State2, Data) ->
keep_state_and_data;
...
handle_event(Type4, Content4, State2, Data) ->
{keep_state_and_data,
     [postpone, {next_event, TypeB, ContentB}]};
And 4 is now postponed, and a new event B is inserted, so while HEAD still remains pointing at A, 4 is kept in the queue and Current will now point to the newly inserted event B.

This section is in turn inspired by this comment on GitHub.

Conclusions

We’ve seen how protocols are governances over the manners of communication between two entities and that these governances define how messages are transmitted, and processed, and how they relate to each other. We’ve seen how the third clause of the actor model dictates that an actor can designate the behaviour to be used for the next message it receives and how this essentially defines the 𝛿 function of a state machine and that Erlang’s gen_statem behaviour is an FSM engine with a lot of power over the event queue and the state data structures.

Do you have protocol implementations that have suffered from extensibility problems? Have you had to handle an exploding number of cases to implement when the event order might reorder in any possible way? If you’ve suffered from death by accidental complexity, or your code has suffered from state machines in disguise, or your testing isn’t comprehensive enough by default, the tricks and points of view of this post should help you get started, and we can always help you keep moving forward!

Keep reading

Why do systems fail? Tandem NonStop system and fault tolerance

Why do systems fail? Tandem NonStop system and fault tolerance

Explore the NonStop architecture's influence on Elixir, Gleam, and Erlang developers. Learn about modularity, fault containment, and process-pairs design for resilient software systems.

Erlang Concurrency: Evolving for Performance

Erlang Concurrency: Evolving for Performance

Erlang’s concurrency model, built for scalability, now competes with C and Rust in performance, powering messaging systems and large-scale platforms.

Elixir, 7 steps to start your journey

Elixir, 7 steps to start your journey

Embark on a journey into Elixir with our series "Elixir, 7 Steps to Start Your Journey" by Lorena Mireles.