Guess Less with Erlang Doctor

Contents

1 Introduction
2 Tracing with Erlang Doctor
3 Getting Started
4 Basic usage
- 4.1 Trace collection
- 4.2 Trace analysis
5 Quick profiling
- 5.1 Call statistics
- 5.2 Call tree statistics
6 ETS table manipulation
7 Summary

BEAM languages, such as Erlang and Elixir, offer a powerful tracing mechanism, and Erlang Doctor is built on top of it. It stores function calls and messages in an ETS table, which lowers the impact on the traced system, and enables querying and analysis of the collected traces. Being simple, always available and easy to use, it encourages you to pragmatically investigate system logic rather than guess about the reason for its behaviour.
This blog post is based on a talk I presented at the FOSDEM 2024 conference.

Introduction

It is tough to figure out why a piece of code is failing, or how unknown software is working. When confronted with an error or other unusual system behaviour, we might search for the reason in the code, but it is often unclear what to look for, and tools like grep can give a large number of results. This means that there is some guessing involved, and the less you know the code, the less chance you have of guessing correctly. BEAM languages such as Erlang and Elixir include a tracing mechanism, which is a building block for tools like dbg, recon or redbug. They let you set up tracing for specific functions, capture the calls, and print them to the console or to a file. The diagram below shows the steps of such a typical tracing activity, which could be called ad-hoc logging, because it is like enabling logging for particular functions without the need for adding log statements to the code.

The first step is to choose the function (or other events) to trace, and it is the most difficult one, because usually, we don’t know where to start – for example, all we might know is that there is no response for a request. This means that the collected traces (usually in text format) often contain no relevant information, and the process needs to be repeated for a different function. A possible way of scaling this approach is to trace more functions at once, but this would result in two issues:

Traces are like logs, which means that it is very easy to get overwhelmed with the amount of data. It is possible to perform a text search, but any further processing would require data parsing.
The amount of data might become so large, that either structures like function arguments, return values and message contents become truncated, or the messages end up queuing up because of the I/O bottleneck.

The exact limit of this approach depends on the individual case, but usually, a rule of thumb is that you can trace one typical module, and collect up to a few thousand traces. This is not enough for many applications, e.g. if the traced behaviour is a flaky test – especially if it fails rarely, or if the impact of trace collection makes it irreproducible.

Tracing with Erlang Doctor

Erlang Doctor is yet another tool built on top of the Erlang tracer, but it has an important advantage – by storing the traces in an ETS table, it reduces the impact on the traced system (by eliminating costly I/O operations), while opening up the possibility of further processing and analysis of the collected traces.

Being no longer limited by the amount of produced text, it scales up to millions of collected traces, and the first limit you might hit is the system memory. Usually it is possible to trace all modules in an application (or even a few applications) at once, unless it is under heavy load. Thanks to the clear separation between data acquisition and analysis, this approach can be called ad-hoc instrumentation rather than logging. The whole process has to be repeated only in rare situations, e.g. if wrong application was traced. Of course tracing production nodes is always risky, and not recommended, unless very strict limits are set up in Erlang Doctor.

Getting Started

Erlang Doctor is available at https://github.com/chrzaszcz/erlang_doctor. For Elixir, there is https://github.com/chrzaszcz/ex_doctor, which is a minimal wrapper around Erlang Doctor. Both tools have Hex packages (erlang_doctor, ex_doctor). You have a few options for installation and running, depending on your use case:

If you want it in your Erlang/Elixir shell right now, use the “firefighting snippets” provided in the Hex or GitHub docs. Because Erlang Doctor is just one module (and ExDoctor is two), you can simply download, compile, load and start the tool with a one-liner.
For development, it is best to have it always at hand by initialising it in your ~/.erlang or ~/.iex.exs files. This way it will be available in all your interactive shells, e.g. rebar3 shell or iex -S mix.
For easy access in your release, you can include it as a dependency of your project.

Basic usage

The following examples are in Erlang, and you can run them yourself – just clone erlang_doctor, compile it, and execute rebar3 as test shell. Detailed examples for both Erlang and Elixir are provided in the Hex Docs (erlang_doctor, ex_doctor). The first step is to start the tool:

1> tr:start().
{ok,<0.86.0>}

There is also tr:start/1 with additional options. For example, tr:start(#{limit => 10000}) would stop tracing, when there are 10 000 traces in the ETS table, which provides a safety valve against memory consumption.

Trace collection

Having started the Erlang Doctor, we can now trace selected modules – here we are using a test suite from Erlang Doctor itself:

2> tr:trace([tr_SUITE]).
ok

The tr:trace/1 function accepts a list of modules or {Module, Function, Arity} tuples. Alternatively, you can provide a map of options to trace specific processes or to enable message tracing. You can also trace entire applications, e.g. tr:trace_app(your_app) or tr:trace_apps([app1, app2]).

Let’s trace the following function call. It calculates the factorial recursively, and sleeps for 1 ms before each step:

3> tr_SUITE:sleepy_factorial(3).
6

It’s a good practice to stop tracing as soon as you don’t need it anymore:

4> tr:stop_tracing().
ok

Trace analysis

The collected traces are accumulated in an ETS table (default name: trace). They are stored as tr records, and to display them, we need to load the record definitions:

5> rr(tr).
[node,tr]

If you don’t have many traces, you can just list all of them:

6> tr:select().
[#tr{index = 1, pid = <0.175.0>, event = call, mfa = {tr_SUITE,sleepy_factorial,1},
     data = [3], ts = 1559134178217371, info = no_info},
 #tr{index = 2, pid = <0.175.0>, event = call, mfa = {tr_SUITE,sleepy_factorial,1},
     data = [2], ts = 1559134178219102, info = no_info},
 #tr{index = 3, pid = <0.175.0>, event = call, mfa = {tr_SUITE,sleepy_factorial,1},
     data = [1], ts = 1559134178221192, info = no_info},
 #tr{index = 4, pid = <0.175.0>, event = call, mfa = {tr_SUITE,sleepy_factorial,1},
     data = [0], ts = 1559134178223107, info = no_info},
 #tr{index = 5, pid = <0.175.0>, event = return, mfa = {tr_SUITE,sleepy_factorial,1},
     data = 1, ts = 1559134178225146, info = no_info},
 #tr{index = 6, pid = <0.175.0>, event = return, mfa = {tr_SUITE,sleepy_factorial,1},
     data = 1, ts = 1559134178225153, info = no_info},
 #tr{index = 7, pid = <0.175.0>, event = return, mfa = {tr_SUITE,sleepy_factorial,1},
     data = 2, ts = 1559134178225155, info = no_info},
 #tr{index = 8, pid = <0.175.0>, event = return, mfa = {tr_SUITE,sleepy_factorial,1},
     data = 6, ts = 1559134178225156, info = no_info}]

The index field is auto-incremented, and data contains an argument list or a return value, while ts is a timestamp in microseconds. To select specific fields of matching records, use tr:select/1, providing a selector function, which is passed to ets:fun2ms/1.

7> tr:select(fun(#tr{event = call, data = [N]}) -> N end).
[3, 2, 1, 0]

You can use tr:select/2 to further filter the results by searching for a specific term in data. In this simple example we search for the number 2:

8> tr:select(fun(T) -> T end, 2).
[#tr{index = 2, pid = <0.395.0>, event = call, mfa = {tr_SUITE,sleepy_factorial,1},
     data = [2], ts = 1705475521744690, info = no_info},
 #tr{index = 7, pid = <0.395.0>, event = return, mfa = {tr_SUITE,sleepy_factorial,1},
     data = 2, ts = 1705475521750454, info = no_info}]

This is powerful, as it searches all nested tuples, lists and maps, allowing you to search for arbitrary terms. For example, even if your code outputs something like “Unknown error”, you can pinpoint the originating function call. There is a similar function tr:filter/1, which filters all traces with a predicate function (this time not limited by fun2ms). In combination with tr:contains_data/2, you can get the same result as above:

9> Traces = tr:filter(fun(T) -> tr:contains_data(2, T) end).
[#tr{index = 2, pid = <0.395.0>, event = call, mfa = {tr_SUITE,sleepy_factorial,1},
     data = [2], ts = 1705475521744690, info = no_info},
 #tr{index = 7, pid = <0.395.0>, event = return, mfa = {tr_SUITE,sleepy_factorial,1},
     data = 2, ts = 1705475521750454, info = no_info}]

There is also tr:filter/2, which can be used to search in a different table than the current one – or in a list. As an example, let’s get only function calls from Traces returned by the previous call:

10> tr:filter(fun(#tr{event = call}) -> true end, Traces).
[#tr{index = 2, pid = <0.395.0>, event = call, mfa = {tr_SUITE,sleepy_factorial,1},
     data = [2], ts = 1705475521744690, info = no_info}]

To find the tracebacks (stack traces) for matching traces, use tr:tracebacks/1:

11> tr:tracebacks(fun(#tr{data = 1}) -> true end).
[[#tr{index = 3, pid = <0.395.0>, event = call, mfa = {tr_SUITE,sleepy_factorial,1},
      data = [1], ts = 1705475521746470, info = no_info},
  #tr{index = 2, pid = <0.395.0>, event = call, mfa = {tr_SUITE,sleepy_factorial,1},
      data = [2], ts = 1705475521744690, info = no_info},
  #tr{index = 1, pid = <0.395.0>, event = call, mfa = {tr_SUITE,sleepy_factorial,1},
      data = [3], ts = 1705475521743239, info = no_info}]]

Note, that by specifying data = 1, we are only matching return traces, as call traces always have a list in data. Only one traceback is returned, starting with a call that returned 1. What follows is the stack trace for this call. There was a second matching traceback, but it wasn’t shown, because whenever two tracebacks overlap, the longer one is skipped. You can change this with tr:tracebacks/2, providing #{output => all}) as the second argument. There are more options available, allowing you to specify the queried table/list, the output format, and the maximum amount of data returned. If you only need one traceback, you can call tr:traceback/1 or tr:traceback/2. Additionally, it is possible to pass a tr record (or an index) directly to tr:traceback/1.

To get a list of traces between each matching call and the corresponding return, use tr:ranges/1:

12> tr:ranges(fun(#tr{data = [1]}) -> true end).
[[#tr{index = 3, pid = <0.395.0>, event = call, mfa = {tr_SUITE,sleepy_factorial,1},
      data = [1], ts = 1705475521746470, info = no_info},
  #tr{index = 4, pid = <0.395.0>, event = call, mfa = {tr_SUITE,sleepy_factorial,1},
      data = [0], ts = 1705475521748499, info = no_info},
  #tr{index = 5, pid = <0.395.0>, event = return, mfa = {tr_SUITE,sleepy_factorial,1},
      data = 1, ts = 1705475521750451, info = no_info},
  #tr{index = 6, pid = <0.395.0>, event = return, mfa = {tr_SUITE,sleepy_factorial,1},
      data = 1, ts = 1705475521750453, info = no_info}]]

There is also tr:ranges/2 with options, allowing to set the queried table/list, and to limit the depth of nested traces. In particular, you can use #{max_depth => 1} to get only the top-level call and the corresponding return. If you only need the first range, use tr:range/1 or tr:range/2.

Last but not least, you can get a particular trace record with tr:lookup/1, and replay a particular function call with tr:do/1:

13> T = tr:lookup(1).
#tr{index = 1, pid = <0.175.0>, event = call, mfa = {tr_SUITE,sleepy_factorial,1},
    data = [3], ts = 1559134178217371, info = no_info}
14> tr:do(T).
6

This is useful e.g. for checking if a bug has been fixed without running the whole test suite, or to reproduce an issue while capturing further traces. This function can be called with an index as the argument: tr:do(1).

Quick profiling

Although there are dedicated profiling tools for Erlang, such as fprof and eprof, you can use Erlang Doctor to get a hint about possible bottlenecks and redundancies in your system with function call statistics. One of the advantages is that you already have the traces collected from your system, so you don’t need to trace again. Furthermore, tracing only specific modules gives you much simpler output, that you can easily read and process in your Erlang shell.

Call statistics

To get statistics of function call times, you can use tr:call_stat/1, providing a function that returns a key by which the traces will be aggregated. The simplest use case is to get the total number of calls and their time. To do this, we group all calls under one key, e.g. total:

15> tr:call_stat(fun(_) -> total end).
#{total => {4,7216,7216}}

The tuple {4,7216,7216} means that there were four calls in total with an accumulated time of 7216 microseconds, and the “own” time was also 7216 μs – this is the case because we have aggregated all traced functions. To see different values, let’s group the stats by the function argument:

16> tr:call_stat(fun(#tr{data = [N]}) -> N end).
#{0 => {1,1952,1952}, 1 => {1,3983,2031}, 2 => {1,5764,1781}, 3 => {1,7216,1452}}

Now it is apparent that although sleepy_factorial(3) took 7216 μs, only 1452 μs were spent in the function itself, and the remaining 5764 μs were spent in the nested calls. To filter out unwanted function calls, just add a guard:

17> tr:call_stat(fun(#tr{data = [N]}) when N < 3 -> N end).
#{0 => {1,1952,1952}, 1 => {1,3983,2031}, 2 => {1,5764,1781}}

There are additional utilities: tr:sorted_call_stat/1 and tr:print_sorted_call_stat/2, which gives you different output formats.

Call tree statistics

If your code is performing the same operations very often, it might be possible to optimise it. To detect such redundancies, you can use tr:top_call_trees/0, which detects complete call trees that repeat several times, where corresponding function calls and returns have the same arguments and return values, respectively. As an example, let’s trace a call to a function which calculates the 4th element of the Fibonacci sequence recursively. The trace table should be empty, so let’s clean it up first:

18> tr:clean().
ok
19> tr:trace([tr_SUITE]).
ok
20> tr_SUITE:fib(4).
3
21> tr:stop_tracing().
ok

Now it is possible to print the most time-consuming call trees that repeat at least twice:

22> tr:top_call_trees().
[{13, 2, #node{module = tr_SUITE,function = fib, args = [2],
               children = [#node{module = tr_SUITE, function = fib, args = [1],
                                 children = [], result = {return,1}},
                           #node{module = tr_SUITE, function = fib, args = [0],
                                 children = [], result = {return,0}}],
               result = {return,1}}},
 {5, 3, #node{module = tr_SUITE,function = fib, args = [1],
              children = [], result = {return,1}}}]

The resulting list contains tuples {Time, Count, Tree} where Time is the accumulated time (in microseconds) spent in the Tree, and Count is the number of times the tree repeated. The list is sorted by Time, descending. In the example, fib(2) was called twice, which already shows that the recursive implementation is suboptimal. You can see the two repeating subtrees in the call tree diagram:

The second listed tree consists only of fib(1), and it was called three times. There is also tr:top_call_trees/1 with options, allowing customisation of the output format – you can set the minimum number of repetitions, maximum number of presented trees etc.

ETS table manipulation

To get the current table name, use tr:tab/0:

23> tr:tab().
trace

To switch to a new table, use tr:set_tab/1. The table need not exist.

24> tr:set_tab(tmp).
ok

Now you can collect traces to the new table without changing the original one. You can dump the current table to file with tr:dump/1 – let’s dump the tmp table:

25> tr:dump("tmp.ets").
ok

In a new Erlang session, you can load the data with tr:load/1. This will set the current table name to tmp. Finally, you can remove all traces from the ETS table with tr:clean/0. To stop Erlang Doctor, just call tr:stop/0.

Summary

Now you have an additional utility in your Erlang/Elixir toolbox, which you can try out whenever you need to debug an issue or learn about unknown or unexpected system behaviour. Just remember to be extremely cautious when using it in a production environment. If you have any feedback, please provide it on GitHub, and if you like the tool, consider giving it a star.

Pawel Chrzaszcz

Paweł started his journey with Erlang back in 2007. After graduating from the AGH University of Science and Technology in Kraków, Poland, he worked as an Erlang developer at Klarna AB, Stockholm, Sweden, and was teaching computer science at a university for seven years. He has a PhD in Computer Science, and his dissertation was about automatic creation of semantic dictionaries. Since 2012 he has been working at Erlang Solutions, and for the last few years he has been a Software Architect and Team Lead of the MongooseIM project. Paweł enjoys mountain biking and sometimes even wins a race.