Monitoring Sidekiq Queue Latency Using AWS CloudWatch

When you have a Rails app that depends on Sidekiq (and Redis) for processing background jobs, it is critical to monitor queue latency to make sure there are enough Sidekiq workers running based on the volume of jobs at any given time. In Sidekiq, queue latency is the difference between when the oldest job was enqueued and the current time. To take full advantage of distributed job processing, the latency of each job queue must remain very low.

AWS CloudWatch is a great monitoring service to handle these types of custom application metrics, especially if the application is hosted on AWS infrastructure. It allows for the publishing of metric data from any source through their SDK and the creation of alarms with notifications based on custom defined criteria.

The application on which the below code examples are based upon runs in Docker containers on AWS and I wanted to expose the Sidekiq data via a simple web API that could be consumed by a process running in a different container or machine, without requiring a connection to Redis. This approach also makes the data available for use in an externally hosted status page. The web API is by no means required and the code to pull the stats from Sidekiq could be combined with code used to publish the metric data to CloudWatch. If you do not plan on doing anything else with the Sidekiq stats outside your application, I would recommend the combined approach instead and skip the step of exposing the data through a Rails controller.

However you chose to implement it, the critical thing to keep in mind is that none of the monitoring code should be running from within the context of a Sidekiq job. If it is dependent on job processing then the metric data will not be published to CloudWatch if the job queue becomes backed up.

Getting Stats from Sidekiq⌗

First we create a new Rails controller (app/controllers/statuses_controller.rb), which will simply provide stats from the Sidekiq API in JSON format. No authentication is required since we are not exposing anything sensitive, this is basically the equivalent of a status page in JSON format. It should also be noted that Sidekiq already provides JSON formatted stats at /dashboard/stats if the web dashboard is enabled. However this endpoint only provides latency information for the default queue and like the rest of the dashboard, it should be protected by authentication and authorization.

class StatusesController < ApplicationController
  respond_to :json

  def sidekiq
    respond_with sidekiq_stats
  end

  private

  def sidekiq_queue_stats(name)
    queue = Sidekiq::Queue.new(name)
    { name: queue.name,
      size: queue.size,
      latency: queue.latency }
  end

  def sidekiq_stats
    stats = Sidekiq::Stats.new
    { enqueued: stats.enqueued,
      retries: stats.retry_size,
      queues: stats.queues.keys.map { |key| sidekiq_queue_stats(key) } }
  end
end

You may notice that I included additional stats from the Sidekiq API (enqueued and retries), they are not required for monitoring queue latency. I exposed these stats for other uses outside of monitoring queue latency, so they can be omitted if you do not require them.

We specify the controller action in our routes.rb:

resource :status, only: [] do
  get 'sidekiq', on: :collection
end

This gives us a JSON output from /status/sidekiq.json which looks something like the following:

{
  "enqueued": 0,
  "retries": 0,
  "queues": [
    {
      "name": "mailers",
      "size": 0,
      "latency": 0
    },
    {
      "name": "default",
      "size": 0,
      "latency": 0
    },
    {
      "name": "critical",
      "size": 0,
      "latency": 0
    }
  ]
}

Using AWS SDK to Publish Metric Data⌗

Next we need something that will parse this data in regular intervals and publish it to AWS CloudWatch.

I have this code organized into a class as part of a larger plain ruby application which uses Clockwork as an entry point for task scheduling. I will skip over the implementation details on this app since there are many ways to run something on a scheduled basis. You could simply have a cron job configured to run this every few seconds and modify the below script to make it self-executing by adding SidekiqStatusMonitor.perform to the end of the file. I have this task configured to run every 10 seconds.

require 'aws-sdk'
require 'http'

class SidekiqStatusMonitor
  def self.perform
    SidekiqStatusMonitor.new.perform
  end

  def perform
    response = HTTP.get(url).body
    @stats = JSON.parse(response)
    process_stats
  end

  private

  def client
    Aws::CloudWatch::Client.new(region: 'us-east-1')
  end

  def publish(queue)
    options = {
      namespace: 'YourAppName',
      metric_data: [
        {
          metric_name: 'SidekiqQueueLatency',
          dimensions: [
            {
              name: 'QueueName',
              value: queue['name'],
            }
          ],
          timestamp: Time.now,
          value: Float(queue['latency']),
          unit: 'Seconds'
        }
      ]
    }
    client.put_metric_data(options)
  end

  def process_stats
    @stats['queues'].each do |queue|
      publish(queue)
    end
  end

  def url
    'https://YOUR_APP_DOMAIN/status/sidekiq.json'
  end
end

Hopefully the above code is fairly easy to follow. It uses the HTTP gem to get the JSON response from our API endpoint created earlier, parses the response and iterates through the array of queue stats. For each of the queue stats, an option hash is created to pass into the put_metric_data method, providing the metric data published to CloudWatch. SidekiqQueueLatency is used for the metric name each time we process a queue, adding QueueName as a dimension of the metric. The value of the dimension is set to the actual queue name configured in Sidekiq, which in the example are default, mailers, and critical. Finally the actual latency value is added with a timestamp and assigned the unit of Seconds.

The metric/dimension names are not configured in CloudWatch beforehand, they will simply appear under the metrics view once data is published with them. A single metric with multiple dimensions is not required, alternatively separate metric names without dimensions could be published for each queue. I felt organizing the queues as dimensions under a single metric was a better fit and the built-in AWS resource metrics seem to mostly follow this same pattern.

Create IAM User and Policy⌗

Create a IAM policy called CloudWatchPutMetricData that will restrict access to just publishing metric data:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "1",
            "Effect": "Allow",
            "Action": [
                "cloudwatch:PutMetricData"
            ],
            "Resource": [
                "*"
            ]
        }
    ]
}

Create an IAM user in the AWS console, assign it the CloudWatchPutMetricData policy, and store the credentials somewhere safe, such as encrypted password database. You can then assign the credentials to environment variables AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY or use any of the other methods described in the AWS Ruby SDK.

Deploy and Verify Metric Data⌗

Now it’s time to run the script and verify metric data is being published to CloudWatch in the dashboard. For testing, you can simulate an increase in latency by using the monitor on your test environment and temporarily disable Sidekiq workers.

Create Alarms⌗

With the monitoring script deployed and data being published in regular intervals, CloudWatch alarms can now be created for each queue. My alarms are set to go into the ALARM state if the latency is greater than one second for three intervals, with each interval representing a one minute average of the latency value. Once an alarm is created this should result in an alarm running in the OK state, assuming everything is running smoothly.

Receiving notifications on alarm state changes requires actions to be configured for each each alarm. In the “Actions” section of the alarm details, one or more notifications can be defined for the alarm based on the state, providing alarm notifications via email and/or SMS via SNS topics. Insure that notifications are defined for each of the three possible states (ALARM, OK, and INSUFFICIENT DATA). This will provide notifications when the metric goes into an alarm state, recovers from the alarm state, or if metric data stops being published in the event the monitoring script has stopped running. AutoScaling actions can also be defined, providing the ability to automatically react to an increase in queue latency.

Conclusion⌗

CloudWatch is a great choice for monitoring custom metrics generated from within an application such as Sidekiq queue latency. It definitely makes sense for applications hosted on AWS infrastructure, allowing for the easy collection and monitoring of resource metrics and logs in a single place.