JIT Compiler BEAM VM

In early September, our colleague, Lukas Larsson, announced the pull request for BEAMAsm, a just-in-time compiler for the Erlang Virtual Machine. Lukas worked on the development of this feature in conjunction with the OTP core team at Ericsson. The JIT compiler is a huge addition to the BEAM community. Two of its most impressive features are the speed improvements it delivers, with the JIT compiler offering anything from 30% to 130% increase in the number of iterations per second. The second thing that it offers is the perf integration which allows developers to profile where bottlenecks are in production.

In this blog, we’ll take you through how we conducted the benchmarking and performance testing of the JIT using a RabbitMQ deployment. Find out more about our RabbitMQ services here. If you’d like to see the talk from Code BEAM V, watch the video below.

Please accept marketing-cookies to watch this video.

RabbitMQ benchmarks

Like any performance comparison, the first thing we need to do is to conduct benchmark testing for RabbitMQ. To do this, we need to create a docker image with the Erlang interpreter and a separate one with the JIT. Below is what such a docker file could look like.

FROM docker.pkg.github.com/erlang/otp/ubuntu-base

RUN apt-get update && apt-get install -y rabbitmq-server git linux-tools-generic

## Enable us to connect from outside the docker container without auth
RUN echo "loopback_users = none" >> /etc/rabbitmq/rabbitmq.conf

ENV MAKEFLAGS=-j4 \
        ERLC_USE_SERVER=yes \
        ERL_TOP=/buildroot/otp \
        PATH=/otp/bin:$PATH

WORKDIR /buildroot

## Download Erlang/OTP with JIT support
RUN git clone https://github.com/erlang/otp && cd otp && \
        git checkout 4d9f947ea71b05186d25ee346952df47d8339da6

WORKDIR /buildroot/otp/

## Build Erlang/OTP with JIT support
RUN ./otp_build setup -a --prefix=/otp/ && make install

## Build Erlang/OTP without JIT support
RUN make install FLAVOR=emu

USER rabbitmq

CMD "rabbitmq-server"

Then we build it as follows below:

docker build -t rabbit .

Then we can start RabbitMQ with the JIT:

docker run -d -e ERL_FLAGS="" -p 5672:5672 -p 15672:15672

We can also start it with the interpreter.

docker run -d -e ERL_FLAGS="-emu_flavor emu" -p 5672:5672 -p 15672:15672

RabbitMQ PerfTest

We then use RabbitMQ PerfTest to measure the difference in the number of messages per second:

Single Queue Performance

> docker run -d -e ERL_FLAGS="-emu_flavor emu" -p 5672:5672 -p 15672:15672 rabbit
476fba6ad56c5d8b34ceac9336b035737c021dee788f3f6c0d21b9309d67373e
> bin/runjava com.rabbitmq.perf.PerfTest --producers 1 --consumers 1 --queue q1 --flag mandatory --qos 300 --confirm 100
id: test-104559-735, starting consumer #0
id: test-104559-735, starting consumer #0, channel #0
id: test-104559-735, starting producer #0
id: test-104559-735, starting producer #0, channel #0
id: test-104559-735, time: 1,000s, sent: 13548 msg/s, returned: 0 msg/s, confirmed: 13449 msg/s, nacked: 0 msg/s, received: 13433 msg/s, min/median/75th/95th/99th consumer latency: 404/5648/7770/11609/14864 µs, confirm latency: 241/3707/5544/9435/11787 µs
id: test-104559-735, time: 2,000s, sent: 20558 msg/s, returned: 0 msg/s, confirmed: 20558 msg/s, nacked: 0 msg/s, received: 20569 msg/s, min/median/75th/95th/99th consumer latency: 919/4629/6245/9456/16564 µs, confirm latency: 440/3881/4493/6971/9642 µs
id: test-104559-735, time: 3,000s, sent: 26523 msg/s, returned: 0 msg/s, confirmed: 26530 msg/s, nacked: 0 msg/s, received: 26526 msg/s, min/median/75th/95th/99th consumer latency: 1274/3689/3970/4524/5830 µs, confirm latency: 495/3617/3842/4260/4754 µs
id: test-104559-735, time: 4,000s, sent: 25835 msg/s, returned: 0 msg/s, confirmed: 25827 msg/s, nacked: 0 msg/s, received: 25834 msg/s, min/median/75th/95th/99th consumer latency: 1946/3830/4124/4451/5119 µs, confirm latency: 1852/3760/4051/4415/5275 µs
id: test-104559-735, time: 5,000s, sent: 25658 msg/s, returned: 0 msg/s, confirmed: 25658 msg/s, nacked: 0 msg/s, received: 25659 msg/s, min/median/75th/95th/99th consumer latency: 556/3840/4110/4646/7325 µs, confirm latency: 1611/3727/4020/4474/5996 µs
....
id: test-104559-735, time: 15,000s, sent: 25180 msg/s, returned: 0 msg/s, confirmed: 25177 msg/s, nacked: 0 msg/s, received: 25182 msg/s, min/median/75th/95th/99th consumer latency: 843/3933/4152/4543/5517 µs, confirm latency: 863/3898/4110/4495/9506 µs
^Ctest stopped (Producer thread interrupted)
id: test-104559-735, sending rate avg: 24354 msg/s
id: test-104559-735, receiving rate avg: 24352 msg/s
> docker stop 476fba6ad56c5d8b34ceac9336b035737c021dee788f3f6c0d21b9309d67373e

The above results are for a single queue performance when under heavy load.

We see that the average sending and receiving rates are both 24k. If we complete the same tasks with the JIT compiler we get the following results:

> docker run -d -e ERL_FLAGS="" -p 5672:5672 -p 15672:15672 rabbit
993b90ab29f662b45ad4cab7d750a367ac9e5d47381812cad138e4ec2f64b2a3
> bin/runjava com.rabbitmq.perf.PerfTest -x 1 -y 1 -u q1 -f mandatory -q 300 -c 100
...
id: test-105204-749, time: 15,000s, sent: 39112 msg/s, returned: 0 msg/s, confirmed: 39114 msg/s, nacked: 0 msg/s, received: 39114 msg/s, min/median/75th/95th/99th consumer latency: 1069/2578/2759/3101/4240 µs, confirm latency: 544/2453/2682/2948/3555 µs
^Ctest stopped (Producer thread interrupted)
id: test-105204-749, sending rate avg: 36620 msg/s
id: test-105204-749, receiving rate avg: 36612 msg/s
> docker stop 993b90ab29f662b45ad4cab7d750a367ac9e5d47381812cad138e4ec2f64b2a3

Which is 36K msgs/second already, we’ve seen a 45% increase in the number of messages per second.

Multi-Queue Performance

With a 45% increase in a single queue performance, it’s time to test how the JIT compares when looking at the multi-queue performance. Below are the results for the Erlang interpreter.

bin/runjava com.rabbitmq.perf.PerfTest -x 150 -y 300 -f mandatory -q 300 -c 100 --queue-pattern 'perf-test-%d' --queue-pattern-from 1 --queue-pattern-to 100
...
id: test-105930-048, time: 60,396s, sent: 39305 msg/s, returned: 0 msg/s, confirmed: 39293 msg/s, nacked: 0 msg/s, received: 39241 msg/s, min/median/75th/95th/99th consumer latency: 44909/352133/503781/742036/813501 µs, confirm latency: 36217/352381/491656/714740/821029 µs
id: test-105930-048, sending rate avg: 37489 msg/s
id: test-105930-048, receiving rate avg: 37242 msg/s

Here you can see 37,489 messages sent per second and 37,242 received.

Now let’s test that benchmark against the JIT:

bin/runjava com.rabbitmq.perf.PerfTest -x 150 -y 300 -f mandatory -q 300 -c 100 --queue-pattern 'perf-test-%d' --queue-pattern-from 1 --queue-pattern-to 100
...
id: test-105554-348, time: 60,344s, sent: 50905 msg/s, returned: 0 msg/s, confirmed: 50808 msg/s, nacked: 0 msg/s, received: 50618 msg/s, min/median/75th/95th/99th consumer latency: 32479/250098/359481/565181/701871 µs, confirm latency: 16959/257612/371392/581236/720861 µs
^Ctest stopped (Producer thread interrupted)
id: test-105554-348, sending rate avg: 47626 msg/s
id: test-105554-348, receiving rate avg: 47390 msg/s

Again, we have a significant boost in performance, this time there are 30% more messages sent and received.

Profiling

We can use perf to profile what is happening under-the-hood in RabbitMQ. We then need to add some permissions to run the perf tool. There are many ways to do that, the simplest (though not the most secure) is to add --cap-add SYS_ADMIN when starting the container. I’ve also added +S 1 because that makes the perf output a little easier to reason about and +JPperf true to be able to disassemble the JIT:ed code.

> docker run -d --cap-add SYS_ADMIN -e ERL_FLAGS="+S 1 +JPperf true" -p 5672:5672 -p 15672:15672 rabbit
514ea5d380837328717

Then we can run perf in an exec session to the container:

> docker exec -it 514ea5d380837328717 bash
> perf record -k mono --call-graph lbr -o /tmp/perf.data -p $(pgrep beam) -- sleep 10
[ perf record: Woken up 57 times to write data ]
[ perf record: Captured and wrote 15.023 MB /tmp/perf.data (54985 samples) ]
> perf inject --jit -i /tmp/perf.data -o /tmp/perf.data.jit
> perf report -i /tmp/perf.data

If you run the above, while the RabbitMQ PerfTest is running, it will display as seen in the image below:

Please, accept marketing-cookies to view this image from Imgur.

View post on imgur.com

The default report from perf sorts the output according to the accumulated run-time of a function and its children. So from the above, we can see that most of the time seems to be spent doing bif calls closely followed by the gen_server2:loop and gen_server2:handle_msg.

It makes sense for gen_server2 to be at the top of accumulated time as that is where most of RabbitMQ’s code is called from, but why would the accumulated time for bif be more extensive? The answer is in the usage of lbr (last branch records) as the call stack sampling method. Since the lbr buffer is limited, it does not always contain all the call stack of a process, so when calling bifs that have a lot of branches, perf loses track of the calling process and thus cannot attribute the time to the correct parent frame. Using lbr has a lot of drawbacks and you have to be careful when analyzing its finsing, however, it is better than having no call stack information at all.

Using the information from the lbr we can expand the call_light_bif function and see which bifs we are calling and how what each call takes:

Please, accept marketing-cookies to view this image from Imgur.

View post on imgur.com

Here we can see that a lot of accumulated time is spent doing erlang:port_command, ets:select and erlang:port_control.

So the RabbitMQ application is busy doing tcp communication and selecting on an ets table. If we wanted to optimize RabbitMQ for this benchmark, it would be a good start to see if we could eliminate some of those calls.

Another approach to viewing the perf report is by sorting by time spent in each function without its children. You can do that by calling perf as follows:

> perf report --no-children -i /tmp/perf.data

Please, accept marketing-cookies to view this image from Imgur.

View post on imgur.com

The functions that we spent the most time on here are do_minor (which is part of the emulator that does garbage collection), make_internal_hash (the hashing function for terms used internally that does not have to be backwards compatible)and eq (the function used to compare terms). If we expand make_internal_hash and eq we see something interesting:

Please, accept marketing-cookies to view this image from Imgur.

View post on imgur.com

The top user of both functions is erlang:put/2, i.e. storing a value in the process dictionary. So it would seem that RabbitMQ is doing a lot of process dictionary updates. After some digging, I found that these calls come from rabbit_reader:process_frame/3 and credit_flow:send/2 which seem to deal with flow control within RabbitMQ. A possible optimization that could be done here is to use the fact that literal keys to erlang:put/2 are pre-hashed when the module is loaded. That is instead of using a dynamic key like this:

erlang:put({channel, Channel}, Val)

you use a literal like this:

erlang:put(channel, Val)

and the emulator will not have to calculate the hash every time we put a value in the process dictionary. This is most likely not possible in this specific case for RabbitMQ, but it’s worth keeping in mind for further investigations.

We’re excited about the JIT compiler and its use cases. If you’ve put it to use and can share some insight, get in touch with us, as we would be happy to co-organise a webinar on the JIT compiler! If you’d like to more help understanding how the JIT compiler can improve the performance of your RabbitMQ platform, Phoenix application, or any other Erlang/Elixir system, get in touch we’re always happy to help!

Lukas Larsson

Lukas Larsson is a technical consultant and systems engineer and Erlang Solutions.

Performance testing the JIT compiler for the BEAM VM

RabbitMQ benchmarks

RabbitMQ PerfTest

Single Queue Performance

Multi-Queue Performance

Profiling

Keep reading

RabbitMQ

ex_rabbit_pool open source AMQP connection pool