Erlang

Advanced RabbitMQ Support Part 1: Leveraging WombatOAM Alarms

by Ayanda Dube

Erlang Solutions offers world-leading RabbitMQ consultancy, support & tuning solutions. Learn more >

Introduction

From simple single node deployments to complex and mission critical clustered/federated node architectures, RabbitMQ often finds itself put to use in various MOM (Message-Oriented Middleware) solutions. This is mainly due to its seamless ease of use and adaptability to different use cases, each with distinct requirements on aspects such as high availability, latency stringencies, just to name a few. User testimonials for RabbitMQ back this up; across all its spheres of use, teams have found it to be very impressive.

For such a highly esteemed system, playing such an crucial role in the most mission critical solutions in the industry, you’d assume that it is fully equipped with in-built, top-of-the-range operations and maintenance sub-components, utilities, and functions, which enable the most effective and efficient support possible. Unfortunately, that’s often not the case, in particular when we put focus on the all-important aspect of alarming.

What RabbitMQ considers as alarms are notifications and error messages which it writes to its logs, coupled with some internal defence mechanisms or recovery strategies, which are put into effect when a subset of some of its common operational problems are encountered. An alarm, in its true OAM sense, can be defined as;

a system generated indicator to an external entity, consisting of as much useful information as possible, in order to trigger an aiding action to prevent, resolve and/or alleviate the reported problem’s root cause.

With this definition in mind, we come to realise that RabbitMQ has in fact (at the time of publishing this discussion), a limitation in notifying and triggering external entities for carrying out any corrective, or, pre-emptive action(s) for resolving the problem cause. When an alarm is raised within RabbitMQ, an action plan is decided upon, and automatic resolution or recovery attempts are internally executed by the system. See Fig 1 below (NOTE: This is for illustration purposes as the ACTION PLAN step is carried out within RabbitMQ).

Fig 1: RabbitMQ Alarming routine illustration

Some of RabbitMQ’s native alarms provide accompanying graphical illustrations on the native Management UI, for example, VM High Memory Watermark alarms will graphically indicate that a node’s permissible memory usage has been exceeded. However, the user is not necessarily notified (unless continually monitoring from the UI) beforehand.

An ideal alarming scenario and resolution plan would not only decide on a particular action plan (internally), but go further to warn and/or notify its user(s) of the reported problem, similar to the illustration in Fig 2. This would imply that for any (and many) reported warnings and problems notified to the user, the time taken for recovery would be much much faster, depending on how quickly (and skilled) the user who participates in the recovery procedure is.

Fig 2: Expected RabbitMQ Alarming routine illustration

To handle and cater for common or rare internal RabbitMQ system alarms in a near ideal manner, custom RabbitMQ plugins or external tools which help meet and fulfill these alarming requirements need to be developed, or made use of if already in existence. This is where WombatOAM comes into play, showcasing its significant importance and need for use in operations and maintenance functions of all RabbitMQ installations.

i. Metric based alarms

WombatOAM provides RabbitMQ users with a rich set of native and AMQP related metrics to monitor. Coupled with its alarming infrastructure, specific alarms may be defined for each supported RabbitMQ metric provided by WombatOAM. Currently, approximately 50 RabbitMQ-specific metrics are provided by WombatOAM, which equates to a staggering same number of possible distinct alarming scenarios which may be configured and used by support engineers to assist in ensuring optimum end-to-end service provision.

Let’s take the Connections created metric as an example. For an installed RabbitMQ node, getting a clear picture of this attribute is crucial for many reasons, such as its direct influence on the node’s service availability and memory utilization. RabbitMQ nodes will accept connections only to certain extents depending on how they’ve been configured and the resources limitations on which they’ve been installed. As an example, typical IoT use cases tend to have a common need for dynamic connection establishment(s) from their endpoint electronic devices, interacting with some SaaS backend via RabbitMQ.

With RabbitMQ as the middleman, the use of WombatOAM to define and raise alarms when the number of created connections exceeds a predefined threshold becomes a crucial aspect to expose to support engineers. This functionality, most importantly, assists them in making decisions such as knowing “when to scale,”, i.e. adding one or more nodes and directing newly inbound connection establishments there, when the current nodes reach connection saturation. Such an alarm could be defined as follows (in your wombat.config file);

{set, wo_metrics, threshold_sets,
 [
  [{nodes, ["rabbit@Ayandas-MacBook-Pro", "rabbit_1@Ayandas-MacBook-Pro", "rabbit_2@Ayandas-MacBook-Pro"]},
   {rules,
    [[{name, "RABBITMQ_CONNECTIONS_CREATED_ALARM"},
    {metric, {"RabbitMQ", "Connections created"}},
      {raise_level, 75},
      {cease_level, 50},
      {unit, percentage},
      {direction, warn_above},
      {percentage_base, 100000}]]}]
 ]}.

This configuration will raise an alarm when (75% of 100000) 75000 connections have been created on each node configured in the nodes list. Fig 3 illustrates an example of this alarm when raised:

Fig 3: Connections created alarm example

As already mentioned, the possible number of alarming cases which WombatOAM can be configured to expose is remarkably rich. Other useful metric alarms which can be further configured include the following cases, or more, depending on your business criticalities (these may be copied as is into your wombat.config file):

- Total active connections
- Total active channels
- Total active consumers
- Total active exchanges
- Total active queues
- Channels created
- Channels closed
- Consumers created
- Permission created
- Queue created
- Queue deleted
- User created
- User deleted
- User password changed
- User authentication failure
- User authentication success
- Exchanges created


And with the RabbitMQ statistics database running, WombatOAM will fetch more metrics from the database, and you can extend the number of possible metric alarming scenarios to the following:

- Publish rate
- Deliver no ack rate
- Deliver rate
- Ack rate
- Confirm rate
- Redeliver rate
- Channel consumer count
- Queue messages
- Queue messages ready
- Queue messages unacknowledged
- Queue message bytes
- Queue message bytes ready
- Queue message bytes unacknowledged


and so forth.

ii. VM Memory High Watermark Alarms

One of the most commonly encountered problems when you have a couple of RabbitMQ nodes operating under reasonable usage and deployed on host machines which don’t really boast of the most high-end specifications, are memory alarms, or, as known to RabbitMQ’s jargon, VM Highwater Mark Memory alarms. When a RabbitMQ node’s memory usage hits this limit, an internal defence mechanism is triggered which blocks any further publishing of messages from connected clients at that particular point in time. This indeed is a good ploy and attempt to limiting any further consumption of memory resources. The only problem is that your engineers tend not to be warned in good time prior to incurring service side effects, like the blocking of publishing clients being activated.

With WombatOAM installed, these memory alarms can be detected in good time. The likelihood of the internal RabbitMQ memory alarms can be detected before publishing clients are blocked from issuing more messages. Similar to the metric based alarms already discussed, WombatOAM’s Total memory metric can be used for monitoring and detecting RabbitMQ’s VM Highwater Mark Memory alarms in any installation.

For example, using approximately 16G as a percentage base, you could set the threshold for the Total memory metric alarm to 40% = 6.4G. If you ensure that this equates to a value less than what you configured vm_memory_high_watermark in the RabbitMQ configuration file (i.e. rabbitmq.config), then you’re guaranteed to get this alarm raised prior to RabbitMQ activating its defence mechanism of blocking publishing clients.

{set, wo_metrics, threshold_sets,
 [
  [{nodes, ["rabbit@Ayandas-MacBook-Pro", "rabbit_1@Ayandas-MacBook-Pro"]},
   {rules,
    [[{name, "RABBITMQ_VM_MEMORY_HIGH_WATERMARK_WARNING_ALARM"},
    {metric, {"Memory", "Total memory"}},
    {raise_level, 40},
    {cease_level, 35},
    {unit, percentage},
    {direction, warn_above},
    {percentage_base, 16000000000}]]}]
 ]}.

The other option for raising native RabbitMQ VM Memory High Watermark alarms is to make use of WombatOAM’s notification based alarms. Unlike the first option, this type of configured memory alarms would only be raised at the same time as when RabbitMQ employs its defence mechanism. This wouldn’t buy you enough time to implement any action plan to avoid any disruptions in service. However, it would still be useful to have this alarm configured, to not only notify you of the initial warning alarm, but also the actual internal RabbitMQ alarm when internal defence strategies are being employed (in case you missed the first warning alarm, or were still busy working on an action plan and the node’s memory usage exceeds that which it’s permitted);

{set, wo_alarms, log_alarms,
 [[{alarm_properties,
   [{alarm_id, 'RABBITMQ_VM_MEMORY_HIGH_WATERMARK_ALARM'},
    {severity, major},
    {alarm_tags, [<<"dev">>]},
    {probable_cause, <<"Rabbit node consuming high memory">>},
    {probable_repair_action, <<"Free up resources or bump up host memory">>}]},
    {match_tags, [<<"dev">>]},
    {message_pattern, <<"Publishers will be blocked until this alarm clears">>}],

With both these memory alarms configured, you’re fully set and equipped to detect, capture and resolve any excess memory utilisation problems, which could’ve potentially led to your node(s) crashing if they weren’t catered for in this manner!

iii. Cluster Partition Alarms

With a majority of RabbitMQ installations being clustered, which employ Erlang distribution under the hood, the likelihood of network partitions taking place in a distributed environment cannot be ignored. In fact, network partitions tend to be part and parcel of the most common problems a typical engineer supporting a RabbitMQ installation would need to be able to handle to bring to rapid resolution, or, just lessen some of their detrimental effects (like data inconsistencies across mirrored queues).

RabbitMQ implements a couple of automatic strategies for recovering from cluster partitions, however full restoration of the cluster is only achieved once the previously clustered nodes have re-established visibility of each other, i.e. when node connectivity has been restored. So despite providing some extremely useful recovery attempts, these strategies lack the functionality of remotely notifying external entities of any sort, as is expected of the alarming paradigm. This is where using WombatOAM to monitor your RabbitMQ installation becomes extremely useful, and of absolute importance.

By configuring WombatOAM alarms to detect network partitions, support engineers not only play a big part in the recovery strategies thereof, but also become part of the alarming communication path. In most circumstances, if not all, lapsing network related problems tend to be resolved by human intervention anyway. Internally to RabbitMQ, when a cluster partition does occur, appropriate messages are written to the logs (along with the automatic partition handling strategies coming into effect). WombatOAM can thus make full use of these by raising corresponding notification based alarms.

The following is an example of how you would configure a cluster partition alarm in WomabtOAM. Any detected cluster partition known to RabbitMQ would trigger this corresponding alarm in WombatOAM;

{set, wo_alarms, log_alarms,
 [
  [{alarm_properties,
   [{alarm_id, 'RABBITMQ_CLUSTER_PARTITION_ALARM'},
    {severity, major},
    {alarm_tags, [<<"dev">>]},
    {probable_cause, <<"Network connectivity glitch">>},
    {probable_repair_action, <<"Restore network connectivity across cluster nodes">>}]},
   {match_tags, [<<"dev">>]},
   {message_pattern, <<"Partial partition detected">>}
  ]
]}.

Just add this configuration to your wombat.config file and restart Wombat, and that should be cluster partition alarm configured! Waahoo! Farewell to being caught by surprises of all those cluster partition alarms in your RabbitMQ logs!

Finally, get notified for your alarms

So you’ve configured your multiple alarms as described in the previous sections, safeguarding your installation from severe problematic scenarios, and finally, you’d like to get notified when each of these alarms are triggered, via email for example. Well, WombatOAM makes use of Elarm for alarm management, and allows easy configuration of SMTP message sending of configured alarms. Email notifications are also configured in the wombat.config file, as illustrated below (using the example alarms we’ve previously discussed and configured);

{set, elarm_mailer, sender, "wombat.alarms@gmail.com"}.
{set, elarm_mailer, recipients, ["ayanda.dube@erlang-solutions.com"]}.
{set, elarm_mailer, gen_smtp_options,
 [{relay, "smtp.gmail.com"},
  {username, "wombat.alarms@gmail.com"},
  {password, "password"},
  {port, 25}]}.
{set, elarm_mailer, subscribed_alarms,
 ['RABBITMQ_CONNECTIONS_CREATED_ALARM',
  'RABBITMQ_VM_MEMORY_HIGH_WATERMARK_WARNING_ALARM',
  'RABBITMQ_VM_MEMORY_HIGH_WATERMARK_ALARM',
  'RABBITMQ_CLUSTER_PARTITION_ALARM']}.

With this configuration, you will get email notifications for each of the subscribed alarms, thus allowing you to take appropriate action to help resolve the problem as soon as possible. Oh and of course, you should use your preferred SMTP server under gen_smtp_options, replace the sender address, and list of email recipients with your own, or more!, in the above example.

Conclusion

This wraps up our first discussion on carrying out advanced RabbitMQ support activities through the use of WombatOAM and making full usage of it’s state of the art alarming infrastructure. With Wombat in use, the configured RabbitMQ nodes become much, much easier to manage, with offerings of improved uptime guarantees from the fact that most, if not all, typical and most commonly anticipated RabbitMQ problems become detectable beforehand or in a near instantaneous manner by the support engineer, alleviating extreme service-impacting side effects.

Without WombatOAM, the hassle of active monitoring and continuous checking of RabbitMQ log files, along with frequently being caught off guard by various unmonitored aspects of RabbitMQ reaching extremes becomes such an unnecessary and unpleasant norm!

Get in touch with Erlang Solutions, make request for the latest release of WombatOAM and take your RabbitMQ support capabilities to the next level, guaranteeing a smooth and much improved service availabity to your connected client applications and services!

Related

[1] RabbitMQ monitoring: WombatOAM and the RabbitMQ Management plugin

[2] Erlang & Elixir DevOps From The Trenches - Why we felt the need to formalize operational experience with the BEAM virtual machine

Go back to the blog

×

Thank you for your message

We sent you a confirmation email to let you know we received it. One of our colleagues will get in touch shortly.
Have a nice day!