A guide to tracing in Elixir

Greetings! I’m Gabor, an Erlang programmer. I like tracing, and I use it whenever I need it. When I started working with Elixir, I asked the obvious question, how do I do tracing in Elixir-land? The Elixir community have improved the user experience for so many things (mix vs pre-rebar3, macros vs parse-transforms, etc, etc, ad-infinitum). Naturally, I was looking forward to using all the great tools for tracing in Elixir. Although there are some good tools, namely Rexbug and Tracer, I was a bit surprised to see that there is very little talk about tracing. I couldn’t find all the great blog posts, tutorials explaining the battle stories, and the wonders of tracing. There are some, but this powerful feature of the BEAM virtual machine seems to be somewhat underrepresented in the Elixir community.

In this blog post, I will discuss “what is tracing”, and “when is it appropriate to use tracing?” and provide a high-level overview of the available tracing tools for the BEAM.

What is tracing?

Tracing is defined by Wikipedia as the following: “tracing involves a specialised use of logging to record information about a program’s execution.” This definition is not only useful but also provides an idea of where tracing fits in the spectrum of logging and debugging. Logging is an invaluable tool to see what is happening in our system, it can give insights to errors and bugs. But this narrow definition has a big flaw, the places where we choose to emit log messages are hard-coded into our program logic.

Choosing where to insert log messages is entirely subjective, dependent upon assumptions and prior experience. Some of the time the logging statement is inserted into a useful location, but if not, you will have to recompile your code and insert extra log statements, this can be a very tedious activity (and is the norm in the vast majority of computer languages). In the pathological scenario, gratuitous use of log statements can even become a significant drain on system IO (process 500 records, generate 50000 lines of logging statements).

Logging can be useful in order to explain what the system is doing, but is less helpful for understanding system behaviour in an unexpected failure state.

At the other end of the spectrum is debugging with the traditional step-by-step debugger such as the GNU debugger (gdb). This requires no modification of your code, we observe the program as it executes and the data as it evolves one step at a time. This is a fine way to observe how an algorithm works or how a data structure changes. But it is very problematic in a distributed system because many operations are handled by transient processes with short timeouts, taking a single example, gen_server has a call timeout of 5000 milliseconds, and there are very few Elixir or Erlang systems which are not making significant use of OTP to implement system functionality.

The key here is observing the system while it is working without stopping the world. Does that sound impossible?

Here comes tracing to the rescue. Tracing is a window through which we can observe the machinations of the system. Starting with top-level function calls and message passing with configurable stack traces, right the way through to monitoring the behaviour of the scheduler, port interaction, and garbage collection. In this regard, tracing is very close to a step-by-step debugger, but it turns the debugging steps into a stream of log messages which we can analyse at our leisure in order to determine what was happening in the system at a given point in time.

Myths

If you have good monitoring, then you don’t need tracing.

Let us assume that diligent monitoring is already in place. Tracing is by no means a substitute for logging. But monitoring has the same problems as logging. If you know beforehand what are the parts to monitor for every possible event, then you will probably never need tracing. But for any sufficiently complex application, you cannot monitor everything. So there will be many parts of the system, which are not measured. This is where tracing comes very handy.

It is not safe to trace in production.

Doing anything with production carries a certain amount of risk. This is also true for tracing. You should be careful, as it has a minimal amount of overhead. You need to understand what tracing does and how it affects certain parts of the system. Probably you want to test the tracing activity in a test environment first, but saying it is not safe is not true. With due diligence, you get a lot of ad-hoc logging for free by leveraging the capabilities of the BEAM. Elixir’s success is partly because it successfully leveraged the concurrency and distribution features of the BEAM. This is another free meal once you learn to utilise it.

You should change the architecture so that you don’t need tracing.

If you have the power to do so, you have probably found the right solution. But there are many systems where upgrading the system isn’t straightforward. Just think about a system that needs to be certified before it can be put into production. “Being fast and breaking things” is not allowed everywhere. These are usually the environments where observation tools like tracing are most utilised. But you don’t need to be in a highly regulated environment to enjoy the benefits. Starting a trace takes less time than deploying a change, it can save time when it is needed most… when things are on fire.

I use OpenTracing, I’m covered.

OpenTracing is a really good way to keep track of requests in your whole product including front-end, back-end and infrastructure. But it requires you to instrument your code. It is good to track the business specific part of the code, but again, it has the same problem of logging. It is not ad-hoc. You can’t just easily add extra deeper layers, or see the system internals of the BEAM. If you use sampling, then capturing the event you are looking for is not even guaranteed. Unfortunately, it shares the same name, but the use-case is not exactly the same.

How to use tracing

Let us outline a strategy that makes tracing a non-painful experience regardless of whether it is a development, test or production system.

1. Form a theory

Using tracing without a clear plan is a recipe for disaster. So first and foremost, you need an idea what are you tracing for and what output you expect. To have that you need to read the code you are about to trace and form some understanding of how that code interacts in the system.

Do you expect the function to be called only a few times? Do you expect the function arguments to be large? Do you expect the process to exist, or will it be created after starting tracing? Do you expect the function not to crash? Do you expect the process not to be scheduled out between two function calls? Do you expect the garbage collection to be quick?

These are some questions that you can answer by using tracing, but if your expectations are far off from the truth, then the performance characteristics of tracing can cause some headaches. If the function is called many times, instead of every once in a while, you need an automatic mechanism that stops the tracing before it causes any problems, or trace into a file with wrapping in place. Or you expect a function to return properly, but it crashes and doesn’t generate a return trace entry. In this case, you need to capture the crash by setting the right trace flags. These are considerations you should make before starting the trace, it might be your only chance for a rare case to collect evidence.

You can also use tracing to understand the code better. It might not seem to be intuitive, but in this case, I still recommend you try to form and prove a theory of system behaviour. Don’t trace for everything, but one thing at a time.

Remember, Elixir (and Erlang) applications are built with resiliency in mind. Some errors are expected and not necessarily reported by the application. Catching these cases might be more difficult than anticipated.

2. Choose the right tool

There are many tracing tools out there. Yes, even for Elixir. The higher level tools will cover 99% of your use case, but you can always defer to the lower level tools that might be a bit more difficult to control.

The tracing facilities are built into the VM, so Elixir developers have a harder time to translate their code to the one the BEAM understands. This is a price paid for the nice and efficient syntax, but in reality, it is not that far off.

A good starting point is rexbug which is a wrapper for Redbug. It not only gives a convenient interface to use with Elixir (it translates the Elixir syntax to Erlang, and calls Redbug), but it also provides a lot of safeguards to protect us from making a big mistake and prevents the trace messages overloading the node. It is also possible to gather the trace entries from more than one node if we have a distributed application. Imagine a cluster where a certain function is executed on a random node as a means of load balancing. You can log into one node, start the tracer there and see what is happening regardless of which node executes it.

Although the matching syntax of rexbug is convenient and easy to use, sometimes it is necessary to fine tune the tracer for better performance or safer operation. In this case, dbg comes to the rescue. It has no built-in safeguards but has the ability to use all the mighty powers of match specifications to restrict what can be traced. You can trace messages sent or received with optional filters, you can also specify which specific processes you want to trace. You can also choose an output file or a network port as a destination for the trace entries. Some of this functionality is exposed to rexbug, but not all. The number of problems you can solve with this tool is so vast that you hardly ever need to use the built-in trace BIFs from the erlang module.

3. Analyse the result

It is possible to customise the output of the tracer tools to ease the analysis phase. This is the time when you verify your initial assumptions and find evidence in the trace logs. You can inspect the function arguments, return values, scheduling logic, garbage collection cycles, the interaction between processes and port drivers.

Sometimes it is a good idea to visualise the events by post-processing the output by e.g. python and matplotlib. Sometimes the frequency of events is the clue you are looking for.

If you are searching for a needle in the haystack, then it is a good idea to trace to a file (or more files with wrapping) and log even gigabytes of data. Later you can use a custom tool to filter through all that noise. You can create your own trace analysing tool by writing a trace client that reads from a file. You can also do this filtering during tracing, by writing a tracer module (in Erlang or Elixir) and use that as an on-the-fly filtering mechanism.

If you are into graphical interfaces, then erlyberly is another option. Written in Java, it provides a GUI which you can use for tracing. erlyberly can connect to a remote Erlang node or cluster and direct the output messages to the GUI. It is quite convenient if you want your trace commands to survive node restarts.

Lastly, if you have a commercial BEAM-focused monitoring system like WombatOAM deployed, you can execute the tracing commands in WombatOAM so you don’t even need to manually connect to your node.

Learn more

You can learn about tracing and debugging systems built on the BEAM from the excellent book “Erlang in Anger” by Fred Hebert. I recommend checking out the tools I mentioned in this post.