Erlang & Elixir DevOps From The Trenches - Why we felt the need to formalize operational experience with the BEAM virtual machine
by Francesco Cesarini
Let’s backtrack to the late 90s, when I was working on the AXD301 switch, one of Erlang’s early flagship products. A first line support engineer broke procedure and tried to fix a bug in a live system. They compiled the Erlang module, put it in the patches directory and loaded it in one of the nodes. The patch did not solve the problem, so they deleted the BEAM file but were unaware they had to load the old version again and purge the patched module. So the switch was still running the wrong version of the module. The issue was eventually escalated to third line support, where just to figure out the node was running a different version of the code than originally thought ended up taking a colleague of mine forty hours. All this time was wasted before they could start troubleshooting the issue itself.
Year after year, we came across similar incidents, so I started asking myself how we could formalize these experiences in reusable code. Our aim was to ensure no one would ever have to spend 40 hours figuring out that a first line engineer had not followed procedure. At the same time, I wanted to make sure no one had to reinvent the wheel every time they started on a new project. This is how the idea for WombatOAM was born, a standalone Erlang node that acts as a generic operations and maintenance node (O&M for short) for Erlang clusters. Any system with requirements on high availability should follow a similar pattern and approach. Why not formalize it in reusable code?
Much of what we do here at Erlang Solutions, is developing, supporting and maintaining systems. Every incident we ever experienced which could have been avoided by analyzing symptoms and taking action has been formalized. Take the AXD301 example. WombatOAM will generate a unique module identifier using the md5 digest of the source code in every beam file, omitting information such as compilation date and attributes which do not affect the execution. If two nodes running the same release have different md5 digests of the same module, we raise an alarm that alerts an operator. If a module is loaded or purged in the system, we log it. If something gets typed into the shell, we log it as well. So not only are we alerted that nodes running the same release have different versions of a module, we also have the audit trail which lead to that state for post mortem debugging purposes.
Every system that never stops needs mechanisms for collecting and suppressing alarms, monitoring logs and generating and analyzing metrics. It is one or more subsystem that collects functionality used for monitoring, pre-emptive support, support automation and post-mortem debugging. Applications such as exometer, folsom, elarm, eper, lager and recon will help, but only so far. In the telecom world, this functionality is put in a standalone node to minimize the impact on live traffic, both in terms of throughput and downtime. If the O&M node crashes or is taken offline, the system will still switch calls. This is the operations and maintenance node approach we believe should be adopted by other verticals, as high availability is today relevant to most server side systems. Let’s look at some stories from behind the trenches, and see how a proper O&M system would have reduced downtime and saved $$ in the form of person months of troubleshooting efforts and reduced hardware requirements.
I’ve rarely seen the concept of alarms being properly used outside of telecoms. In some cases, threshold based alarms are applied, but that is where it often stops. A threshold based alarm is when you gather a metric (such as memory consumption or requests per second) and raise an alarm if the node on which it is gathered on reaches a certain upper or lower bound. But the potential of alarms goes beyond monitoring thresholds in collected metrics. The concept is easy; if something that should not be happening is happening, an alarm is raised. When issues, maybe on their own accord, through automation (scripts triggered by the alarm) or human intervention revert back to normal, the alarm is cleared. Your database or network connectivity goes down? Raise an alarm and alert the operator as soon as the system detects it. Latency hits the limits of your SLA? Raise an alarm in your system the second it happens, not when the metrics are collected. Or a process message queue (among millions of processes) is growing faster than the process is able to consume the messages? Once again, raise the alarm. If the network or database link comes back up, latency becomes acceptable or the message queue is consumed, the active alarm is cleared.
Processes with long message queues are usually a warning of issues about to happen. They are easy to monitor, but are you doing it? We had a node crashing and restarting over a three-month period at a customer site. Some refactoring meant they were not handling the EXIT signal from the ports we were using to parse the XML. Yaws recycled processes, so every process ended up having a few thousand EXIT messages from previous requests that had to be traversed before the new request could be handled. About once a day, the node ran out of memory and was restarted by hand. The mailboxes were cleared. Our customers complained that at times, the system was slow. We blamed it on them using Windows NT, as we were not measuring latency. We occasionally saw the availability drop from 100% to 99.999% as a result of the external probes running their request right during the crash or when the node was restarting. This was rarely caught, as external probes sent a request a minute that took half a second to process, whilst the node took 3 seconds to restart. So we blamed the glitch on operations messing with firewall configurations. With triple redundancy, it was only when operations happened to notice that one of the machines was running at 100% CPU that we got called in. Many requests going through the system, we thought, but that count was only 10 requests per second. Had we monitored the message queues, we would have picked this the issue immediately. Had we had notifications on nodes crashing, we would have picked the problem up after the event, and had we been monitoring memory usage, we would have seen the cause, leading us to look at the message queue metrics.
Or what about checking for file sanity? There was a software upgrade that required changes of the Erlang application Environment variables. The DevOps team upgraded the sys.config file, copying and pasting from a word document, invisible control characters included. Thank you Microsoft! Months later, a power outage caused the nodes to be rebooted. But because of the corrupt sys.config file, the invisible control characters would not parse and the node would crash in the startup phase. Looking at the sys.config file did not make us any wiser, as the control characters were not visible. It took half a day to figure this one out. We now regularly check and parse all boot, app and config files (as well as any dependencies). If one of them got corrupted or changed manually and might prevent the node from restarting, we will raise an alarm.
I once did an ets:tab2list/1 call in the shell of a live system, and used the data to solve the problem. A few months later, when a similar issue arose, I instructed a first line support engineer to do the same, forgetting that the shell stores the result of its calls. We coped with two copies of the ets table (which happened to store subscriber data for millions of users) but the third caused the system to run out of memory (oh, had we only been monitoring memory utilization!). Today, WombatOAM monitors the actual memory utilisation of the shell process and raises an alarm should it become unproportionately large.
I could go on with war stories for every other alarm we have, describing how some of our customers, not having that visibility, caused unnecessary outages and call-outs. Every alarm we implemented in WombatOAM has a story behind it. Either an outage, a crash, a call from operations, or hours of time wasted looking for a needle in a haystack as we did not have the necessary visibility and did not know where to look. We have 20 alarms in place right now, including checking system limits (many of which are configurable, but you need to be aware you risk reaching them), sanity checks (such as corrupt files, clashing module versions, multiple versions) and the unusual shell history size alarm. I trust you get the point. Oh, and if you are using Nagios, Pagerduty, or other alarming and notification services, WombatOAM allows you to push alarms.
Metrics are there to create visibility, help troubleshoot issues, prevent failure and for post mortem debugging. How many users out there with Erlang and Elixir in production are monitoring BEAM’s memory usage, and more specifically, how much memory is being allocated and used by processes, modules, the atom table, ets tables, the binary heap and the code server? How do you know a leak in the atom table is causing the node to run out of memory? Monitor its size. Or what if the cause is long message queues? You should see the used and allocated process memory increase, which leads you to the sum of all message queues. These metrics allow you to implement threshold-based alarms, alerting the DevOps team the node has utilized 75% and 90% of its memory. It also allows you to figure out where the memory went after the node crashed and was restarted.
What about the time the distributed Erlang port hung and requests queued up, causing a net split effect despite all network tests being successful and not raising an alarm? The Distributed Erlang port busy counter would have quickly pointed us to the issue. It is incremented every time a process tries to send data using distributed Erlang, but the port is busy with another job. We hooked up WombatOAM and realised we were getting three million port busy notifications per day! Distributed Erlang was not built to carry the peak load of this particular system. Our customer migrated from distributed Erlang to gen_rpc, and everything worked smoothly ever after.
We once spent three months soak testing a system that, contractually, had to run for 24 hours handling 15,000 operations per second sustained. Each operation consisted on average of four http requests, seven ets reads, three ets writes and about ten log entries to file, alongside all of the business logic. The system was running at an average of 40% CPU with about half a gig of memory left over on each node. After a random number of hours, nodes would crash without any warning, having run out of memory. None of the memory graphs we had showed any leaks. We were refreshing the data at 10-second intervals, showing about 400mb of memory available in the last poll right before the crash. We suspected memory leaks in the VM, looked for runaway non-tail recursive functions, reviewed all the code and ended up wasting a month before discovering that seconds prior to the crash, a very rapid increase of processes with unusually high memory spikes and high activity of garbage collection. With this visibility, provided by the system trace, we narrowed this down to a particular operation which, when run on its own, caused a little spike in the memory usage graph. But when many of these operations randomly happened at the same time, the result was a monster wave that caused the VM to run out of memory. This particular issue took up to 20 hours to reproduce. It kept two people busy for a month trying to figure out what happened. When we finally knew what was going on, it took two weeks to fix it. Wombat today increments counters for processes spending too long garbage collecting, processes with unusually high memory spikes, or NIFs or BIFs hogging the scheduler, blocking other processes from executing and affecting the soft real-time properties of the system. If you suspect there is an issue, you can enable notifications and log the calls causing the issues themselves.
Already using folsom so you do not need WombatOAM I hear? Think again. Out of the box, folsom will give you a dozen metrics. Use folsom or exometer for your business metrics, measuring application specific values such as failed and successful requests, throughput and latency. And let Wombat gather over a hundred VM specific metrics, with a hundred more depending on which applications you are running in your stack. It will take anyone a few days (or hours if you are a hero programmer) to implement a few instrumentation functions to collect metrics from Cowboy, RabbitMQ, Riak Core or Mnesia. It will however take weeks (sometimes months) to figure out what metrics you actually need and optimize the calls to reduce CPU overheads, ensuring there is no impact on the throughput of the system you are monitoring. When looking for a needle in a haystack, after a crash, you never know what metric you are going to need until after the event. If we’ve missed it, be sure we’re going to add it to WombatOAM as soon as one of our customers points it out. So now, you are not only getting the benefit of our experience, but also that of all other customers.
Notifications are log entries recording a state change. They help with troubleshooting and post mortem debugging. How many times did you have to investigate a fault not knowing if any configuration changes had been made? In the AXD301 example, we would have logged all modules loaded in the nodes, and any commands executed in the Erlang Shell.
In our first release of WombatOAM, we used to store information on all system notifications, including the distributed Erlang Port Busy notifications and processes with unusually high memory spikes and log garbage collection pauses. Until a customer started complaining that they were running out of disk space. It was the very same customer who was getting three million port busy messages a day. System notifications are today turned off by default, but as soon as you detect you have a problem, you can turn them on and figure out which process has unusually high memory spikes or what function call is sending data to the distributed Erlang port. High volumes of logs, especially business logs, need to be monitored for size. As every system is different, you will have to configure and fine tune what is pushed.
We once had (well, we still do, they are now running WombatOAM) a customer with a several hundred nodes in production, running over thirty different services. The only way for them to notice a node had been restarted was to log on to the server and look for the crash dump file. The only way to know if a process had crashed was to log on to the machine, connect to the shell and start the report browser and search for crash reports. Minor problem. This was an enterprise customer, and developers were not given access to the production machines. Wombat will automatically collect node crash notifications (and rename the crash dump file), as well as pull in error, warning and crash notifications. They are all in one place, so you can browse and search all node restarts and crash reports (and twenty other standard notifications, alongside application specific ones). We handle system logs out of the box, but if you are using logging applications such as the SASL Logger, Lager and the Elixir Logger, you can pull in your business logs and use them to prove your innocence or admit guilt from one location. And if that one location is not WombatOAM, WombatOAM can push them to LogStash, Splunk, DataDog, Graylog, Zabbix or any other log aggregation service you might be using.
But it is only Erlang Code!
Yes, someone who was trialling WombatOAM actually did say this to us, and suggested they implement everything themselves! You can do all of WombatOAM yourself, and add items after outages or as you get more experience of running beam clusters. Figure out what metrics, alarms and notifications to generate, and add them for SASL, Lager, Elixir Logger, Folsom, Exometer, Mnesia, OSmon, Cowboy, Phoenix, Poolboy, Riak KV, Riak core, Yokozuna, Riak Multi-Datacenter Replication, Ecto, RabbitMQ, MongooseIM (all this at the time of writing, more are being added each month). After all, as they said, it is just code. When done, you can spend your time testing and optimizing your instrumentation functions, chasing BIFs which cause lock contention and optimize your nodes so as to reduce memory utilization and CPU overhead. And when you are done with that, you can start working on the tooling side, adding the ability to check and update application environment variables. Run etop like tools or execute expressions from the browser. Or implement northbound interfaces towards other O&M tools and SAAS providers. Or implement the infrastructure to create your own plugins, optimize connectivity between WombatOAM and the managed nodes, optimize your code to handle hundreds of nodes in a single WombatOAM node, or put in place an architecture and test that it will scale to 20,000 nodes. Honestly? If you want to reinvent the wheel, don’t bother. Just get WombatOAM or come work for us instead! (Shameless plug: The WombatOAM team is recruiting).
Let’s wrap up with a story from our own teams who pretty much caught the essence of Wombat in an area I had not thought of originally. You know a tool is successful when others start using it in ways you had never thought of. Our original focus was operations and DevOps teams, but developers and testers have since used WombatOAM for visibility and quick turnaround of bugs.
Our MongooseIM team had to stress test and optimize a chat server running on four nodes, with about 1.5 million simultaneously connected users on four machines. As soon as they started generating load, they got the long message queue alarm on one of the nodes followed by the ets table system limit alarm.
Based on this, they figured out Mnesia is the bottleneck, and more specifically, the login operation which uses transactions (hence, the ETS tables). Investigating, they discovered from the graphs that one of the nodes carries twice the load of the other three. They look at the schema and see that tables are replicated on three of the four nodes.
How did the above misconfiguration happen? They looked at the notifications and discovered a logged shell command where the table schema was created using nodes() instead of [node()|nodes()], missing out creating a table on the node they had run the operation on.
They found and solved the problem in 30 minutes, only to have to jump on to the next one, which was an unusually high memory spike occurring during a particular operation. Followed by issues with the versioning of the MongoDB driver (Don’t ask!). Similar issues, when they did not have the same level of visibility, could have taken days to resolve.
It has long been known that using Erlang, you achieve reliability and availability. But there seems to be some common myth that it all happens magically out of the box. You use Erlang, therefore your system will never fail. Sadly, there is no such thing as a free lunch. If you do not want to kiss your five nines goodbye, the hard work starts when designing your system. Monitoring, pre-emptive support, support automation and post mortem debugging are not things you can with easily just bolt on later. By monitoring, we mean having full visibility into what is going on. With pre-emptive support, we mean the ability to react to this visibility and prevent failure. Support automation allows you to react to external events, reducing service disruption resolving problems before they escalate. And post-mortem debugging is all about quickly and efficiently being able detect what caused the failure without having to stare at a screen hoping the anomaly or crash experienced in the middle of the night happens again whilst you are watching. Whilst five nines will not happen magically, Erlang takes care of a lot of accidental difficulties allowing you to achieve high levels of availability at a fraction of the effort compared to other programming languages. WombatOAM will take care of the rest.
I think the above is the beginning of the end of a blog post, and if you are still reading, trying WombatOAM out is probably worth more than another demo or walk-through. You never know when you are going to need the information and visibility until it is too late, and unless you are troubleshooting issues, the more you see the better.
If you want to read more about systems which never stop, I recommend chapter 16 in my book Designing for Scalability with Erlang/OTP, which covers monitoring and pre-emptive support. You can find it on Safari, BitTorrent or buy it on Amazon or the O’Reilly site. If you use the O’Reilly site, discount code authd gives you 50% off the digital copy and 40% off the printed one. And of course, there is also Erlang in Anger, by Fred Hebert.Go back to the blog