WombatOAM, the powerful operations, monitoring, and maintenance platform has a new machine learning library to assist with metric prediction. Learn about the main features, the algorithms used and how you can benefit from it together with WombatOAM’s new UI from Mohamed Ali Khechine, Tamás Lengyel and Janos Sandor.
Mohamed Ali Khechine: The theme of today’s webinar is the way we have created a machine learning library for metric prediction for the monitoring and maintenance tool WombatOAM which we develop at Erlang Solutions. We built this tool using Elixir. This will be presented by the next host and for now, I just wanted to say a few words about the Wombat tool itself, WombatOAM is a software used to monitor BEAM-based systems. What I wanted to show today are the main capabilities. I will go through them and I will explain bit by bit why are they used by WombatOAM and how we came up with these tools in our software.
Wombat is a software that you self-host and you can use it for monitoring systems that vary from RabbitMQ to Elixir, Phoenix, and, of course, Erlang itself. So we have many ways to install Wombat in your system, AWS, Kubernetes or simply having it locally. So I will share this documentation after the webinar is finished, but I just wanted to tell you that depending on how your system and the environment are, Wombat will be able to adapt and use all of the capabilities there.
When you install Wombat and start it, you just have to set up a new password and that’s it. Then you can access the interface.
You should be able to access the node that you wish to monitor through the internet or through a network that you have set up locally, or anything which allows distributed communication through TCP. So you just need to specify the node name and the cookie, and then Wombat is able to add it. For example, for this webinar, I have set up a RabbitMQ node and I have set up four Erlang nodes.
When you add a node, what happens is that Wombat will automatically start these GenServers agents on the specified node and these GenServers are responsible for getting the metrics and the capabilities to the Wombat node.
So for example, we have here a group of agents or GenServers that are mainly focused on bringing the BEAM specific metrics that we can, for example, see here.
By default, out of the box, Wombat will offer you all of the specific metrics that are important to monitor for the health of your node, but also a few others that are a bit more like the health of the machine. The node where the application is running, and of course, we have a few extra metrics that I will explain later, but we do have, a more in-depth view of the processes that are running. For example, we can have information about memory usage, of all of the processes that are running in this specific application. We also have a similar reduction number of all processes running in a specific application.
So all of this information is fine-grained to allow the user to get more information out of the node and the same thing is happening for nodes that, for example, have the RabbitMQ application.
The next part of the system is that Wombat by default collects also all of the logs that are generated in the nodes that you wish to monitor and all of these logs are, of course, stored in your machine, and you can filter them, you can explore them, and you can also if you wish to send them to another log-aggregating tool like Graylog or anything like that, you can also push them there in the same way. I wanted to show, for example, how alarms are shown.
We can monitor specific processes, for example. Let’s say you have a very important process that you use for routing, and you want to monitor that specific process for its memory usage, and its message queue length. What you should do is you should create a monitor for it, and let’s see how we can trigger it. Let’s try to push some metrics.
Wombat will detect that a process has an issue with the mailbox, this by default will create an alarm with all of the information that is necessary to debug this issue as soon as it happens. Wombat first needs to check the number of messages that are stuck. If it’s for example, 100 and then it drops down, it would not be triggered because they were quickly resolved. But if it just gets stuck then Wombat will automatically create an alarm and it will be shown within a minute of the peak of the message queue.
These are the alarms that are generated by default by Wombat. What you can see above is that you can get information about the Pid of the process where the issue occurred and you can get a sample of the messages that are stuck in the message queue so that you can at least know what type of message is making the process get stuck in that phase.
The processes are all listed and sorted by message queue or reductions. For example, you can get information about the process that has the highest messages in their mailbox.
Wombat allows you also to have a remote shell towards your monitored node. You can also explore the Mnesia and ETS tables.
In case you want to send the information that you saw now to other tools, you can have a choice here of which system you already use. For example, I have already set up the Grafana dashboard with the metrics coming in from Wombat. What I did is basically, set up Wombat to report the metrics in the Prometheus format which is shown here.
All of this is configurable, of course. I didn’t speak about it now because this presentation is mainly going to be about the machine learning aspect of metric predictions. But I wanted to show you that from the documentation, you can explore, for example, all of the metric capabilities.
Please find the WombatOAM documentation here.
We also monitor and have integration with Phoenix, Elixir, RabbitMQ, Riak, MongooseIM, and so on. The full list of available metrics will be in the documentation.
The next presentation is going to be about the machine learning tool.
Arnold- the machine learning tool
Tamás Lengyel: We started developing Arnold about a year ago. First, we started building it for Wombat only, but later it became a separate component from Wombat.
We are using Arnold for time-series forecasting, anomaly detection, and analysing the incoming metric values using the Axon library.
First, I want to mention the Nx library which is the native tensors implementation for Elixir. You can think of it as the NumPy for Elixir. Nx has a GPU acceleration that is built with Google XLA, and it’s natively implemented inside Elixir. Axon was built on top of the Nx library. Therefore, it has a Nx-powered neural network interface and this is the tool that we used for creating Arnold. You can think of it as the TensorFlow of Elixir. Currently, both libraries are heavily in development, and while Nx has a stable release version, 0.1.0., Axon does not have one yet. So no official release for that.
What are the main features of Arnold? As I mentioned, it is a separate component. It has a RestAPI for communication with the outside world. So not only nodes, Elixir or Erlang nodes, can communicate with it, but a Python application or any kind of application that can make the RestAPI calls can communicate with Arnold. We implemented multiple prediction methods and we’ll talk about them later. We call them simple and complex methods. We have dynamic alarms and load balancing, and inside Wombat, we implemented metrics filtering as well, not to overload the tool.
It’s a simplified structure now. There are three main components of Arnold which is the first.
One is the sensors where the metrics are stored and all the incoming metrics are gathered and preprocessed before sending them to the neural network and the training algorithm.
We are storing the incoming metrics in Mnesia with the help of a separate record library called Memento. We have three tags for each metric which are hourly, daily, and weekly. Each tag has a constant value, which is the number of values that we should gather before we start the training algorithm. Wombat sends the hourly metrics every minute. Then we make an average of the last 15 minutes of metrics and then we call that one daily metric. When we reach a certain threshold defined by the tag, we are going to send it to the neural network for training.
The training algorithm takes Axon models. The prediction methods are decided on whether we have a trained neural network or not. That’s how we can determine if we should use the Simple or Complex method.
The complex ones are the neural networks and the simple ones are statistical tools and models. Mainly, we use them for instant predictions and analysis tools for the alarms.
What algorithms are used?
For forecasting, we use Exponential smoothings, the single, the double, and the triple. We use the single one if we cannot detect any trend or seasonality. The double is used when we can detect the trend component only, and we use the triple Exponential Smoothings when we detect seasonality as well. For trend detection, we use the Mann-Kendall Test. For seasonality detection, we use pattern matching. We are trying to find a correlation between a predefined small seasonal value and our current values. If we have any kind of correlation, then we say that we have seasonality in that metric.
When we have enough data, we send it to the training algorithm and then we can switch to the complex neural network-based predictions and forecasting. For alarms, we use linear correlation to see when a metric has any kind of relationship with other metrics, so that it could be easier to find the root cause of a possible problem.
Feeding data into Arnold
If you have a node, in our case, Wombat, which uses these API calls, then we have to use the following query parameters, the node ID, and the metric name. In a JSON body, we specify the type, the value, and a unique timestamp. So the type can be any kind of type that is present in Wombat as well. But it can be a single integer or float. It can be a histogram duration, meter, or spiral, and you can send it in a row format. Arnold will handle the processing of it. In the same way, as we input data into Arnold, we can fetch data as well with the prediction route.
The training process
The first step is to gather data. Arnold’s manager sends the data to the correct sensors with a timestamp-value two element tuple. The manager is responsible for getting the correct sensor from the load balancer. One sensor is responsible for one metric. So if we send 200 metrics, then we are going to have 200 sensors. A sensor stores the data for that given tag until their threshold is reached.
For example, we start training for the hourly metrics, which are sent from Wombat every minute, when we gather 180 metrics. That’s three hours of data in total. But it can be increased to five hours, six hours. It depends on the user. These sensors are saved every five seconds by the manager. Also, the manager does the training checks. Basically when we reach the tag threshold, the manager checks or marks that sensor for training. Then we start the training algorithm or we start the training process where the first step is the data preprocessing.
First, we have to extend the features with the time of tags. So you can see here the logarithmic scale of the raw timestamp value. As you can see, it’s increasing.
It’s not usable or understandable for the neural network. We have to transform that with the sine and cosine functions. As you can see here, the blue line represents one hour. So two peaks are considered as one hour.
Let’s imagine there is a value that is sent at 11:50. That value after transforming the data is going to be -0.84, and if we go forward in time, then we will reach 12:50, 13:50, the transformed data will always be the same. It’s always going to return with this transformation, -0.84 so that the neural network can translate it whether the incoming values are following an increasing or decreasing trend, or whether it has seasonality in a time period of an hour, and so on. Of course, we did that for the hourly metrics, the daily metrics which is the red line, and for the weekly metrics as well which is the yellow one.
The next step is splitting the data. We use 80% for the training and as an option, we can use 20% for testing purposes to see how our training model is performing. After splitting the data, we have to normalise them. We use the mean and the standard deviation for normalisation.
The algorithm or the formula that we are using.
From each value, we are subtracting the mean and then dividing it by a standard deviation. We use Nx built-in functions for that, the Nx mean, and the standard deviation, which was not present in version 0.1.0 but will be in the next release. Currently, we are using our own implementation for standard deviation.
The next step is creating the dataset. So we are gonna zip the features with the target values. Of course, we have optional values here like the batch size which is by default is 32 and we can shuffle the data as well for better training purposes. The last step is to send it to the training algorithm. But first, before sending, we need to have a model. We have SingleStep, and MultiStep models and they are easily extendable. Currently, we are using a single-step dense model, but there is a linear model as well.
After we send our dataset, and data to the training algorithm, we use the model to start the training on the client’s machine. In this algorithm, this function will return the model state. After that, we can enter the test function as well. So finish time depends on the number of values we have and the performance of the machine it is running on. It can take from 10 minutes to an hour to train the neural network.
Here you can see what it looks like inside Wombat.
So as you can see, here we have the actual values for different metrics, then we have the expected ranges for the predictions.
You can see here what happens if one metric or two crosses their expected range. Here we are using Exponential Smoothings and they are calculated all of the values before the current timestamp, including the current one as well. It’s going to adapt as it goes. As the actual values are going down, the predictions are going to follow that trend. So two alarms raised because we have two metrics that crossed their expected ranges.
From Wombat, we can simply configure it. We have a port, a forecast horizon and as I said, we can filter the metrics that we would like to send for training. The port is necessary for the RestAPI communication. The forecast horizon defines how many metrics we want to forecaste after the current timestamps. So in the case of Wombat, hourly metrics mean that, if we set the horizon to five, it means we will have five minutes of metrics ahead of the current timestamp. For the daily one, as they are calculated every 15 minutes, it means that a forecast horizon with the value of 5 results in 5 times 15, 1 hour and 15 minutes worth of metrics ahead of the current timestamp. For the weekly, it means, 5 hours ahead of the current timestamp because the weekly metrics are calculated every hour. So we will have a lot of data for those types as well.
The resources we have used are the TensorFlow documentation and tutorials because it’s really easy to integrate those models and concepts into Axon. Of course, I used Axon and Nx documentations, “Forecasting: Principles and Practices” book. I used the second edition, but the third edition is available as well. It’s really interesting and it tells you a lot of cool stuff about how they did time series, data forecasting when neural networks were not available, how to use with it statistics like ARIMA, seasonal ARIMA, exponential smoothings, how they compose the data into a separate component.
It was very interesting to read I learned a lot of cool stuff that was very helpful in the creation of Arnold. I used the Erlang Ecosystem Foundation Slack Channel and Machine Learning Channel where I could contact the creators of both libraries. And the time series forecasting methods from influx data because that’s where the idea came of instant friction to combine exponential smoothings with neural networks.
What are the plans for Arnold?
We are cleaning it up and we are implementing unit tests as well so that we can automate the testing process and the release process. We would like to have Prometheus integration to have an optional training method. So we won’t have dynamic learning, but with the help of this integration, we can instantly send a lot of data to Arnold and we don’t have to use simple predictions. We can immediately start the training and we can use neural networks.
We are open sourcing the project which is available now on GitHub and also in the documentation, and a Wiki guide on how you can contribute.
We have a long README where I’m trying to gather how Arnold works and how you can use it, how to compile it from source code or you can just download an already prepackaged compressed file. We have the wiki as well on how you can contribute and the structure is still under development. And, of course, we have the whole documentation for Arnold at esl.github.io/arnold.
We can see Arnold running in the background. As you can see, we have the expected ranges, the actual values, and the predictions as well. And I’m just gonna do a quick garbage collection so that it will trigger alarms and I can show you one.
If you want to use Arnold, in Wombat, you just have to manually configure it at the data export page and you don’t have to download Arnold separately or start it separately. It’s all handled by the integration. So basically, no additional manual work is needed for that.
Two metrics crossed their expected ranges and we can see two alarms were raised. And we can see that a metric is out of range, the detailed information can be found in the additional info that process memory is currently lower than the expected range by that amount of bytes. We can see the total memory as well and as you can see, we have a positive correlation… For the process memory, we have a positive correlation with total memory. And for the total memory, we have a positive correlation with the process memory.
Wombat’s new UI
Janos Sandor: There will be a new UI for Wombat. We’ve been working on this since the old UI became overwhelmed, too complex, and hard to maintain. Contained a lot of deprecated JS libraries and was hard to make proper E2E test cases. The new UI will use the Angular library and official angular-based extensions. This is almost built from the ground and we wanted to keep only the necessary parts.
The base design is provided by the Angular Material official add-on library. It has lots of built-in animated and nice-looking web elements with icons. Almost everything could be imported from its modules that we used in the previous UI. And it will be easier to build custom themes later on.
We kept the “front-page” dashboard view. We can combine metrics, graphs, alarms, and statistics here.
We can move these panels, we can resize them, and can add more panels, or we can switch to different front pages.
The side menu provides the primary way of navigating between the main views. (Before we had the top toolbar, but now we have a side menu.)
On Topology we can see the information about the nodes.
Logs can be listed on another view, called ‘Logs’. We can set up an auto-refresh, and filter the logs. We can check them. If the log message was too long, it would be split into more parts and the user could load them all or one-by-one.
We have a different page to check the raised alarms.
The new dashboard has a feature for the right-click (context) menu. The context menu is different on each page. There are some persistent menu items, like adding nodes or getting to the configurations.
Metrics page looks almost the same as before. We can visualise metric graphs just like on the old UI. There is a live-mode to update the graph continuously.
We have the tools page. We can check the process manager. We can see the process information and also we can monitor processes. Of course, we have a table visualizer. We can read the content of the ETS Tables.
In the configuration menu, we have the ‘Data explore’ menu where we can configure the integrations.
The new UI has another new feature. It has a dark mode. We can switch to dark mode.
We can create new families, manage the nodes, and remove nodes or families.
Mohamed Ali Khechine: We rely on telemetry to expose certain metrics. For example, by default, Wombat gets its Phoenix metrics by the telemetry ones and similarly to the Ecto. We also have a telemetry plugin that creates metrics that you customised through telemetry. So basically, if you have telemetry metrics which are events that expose values and have specific event names and so on, Wombat will create metrics based on them and they will, of course, be shown here. So in the same way as exometer, when you create an exometer metric, Wombat will also pick it up and expose it as a metric that you can subsequently basically expose in the Prometheus format and, of course, show it in the Grafana or anywhere else. I hope I answered the question.