Baby's first monitoring system

Till last week, I didn't know what a monitoring system really looked like. A week later, I'm in the process of setting up one for the RC's shared computing cluster¹ which is community maintained. Here are some notes about the various tools I'm using and how they work together (TL;DR).

I'm halfway through my second batch at RC, and one of my batch goals was to learn DevOps/SRE skills by contributing to this cluster. Having put it off for the first 5 weeks, I finally reached out to folks in the weekly meeting about the cluster, where I was recommended to look into Prometheus.

Prometheus

The Prometheus server at its core is a database. More specifically, it is a time-series database, which means it stores key-value pairs with the key being a timestamp, thus showing how a particular value changed over time. This data can be used to create graphs and dashboards or trigger alerts if the values cross a certain threshold (more on both later). It has its own query language called PromQL.

The server can pull and store data from multiple machines, so it runs on only one of the machines in the cluster. However, if this machine goes down for some reason, our monitoring system is down.

node_exporter

We have a database, cool, but where does the data come from? There are a variety of tools for this², but the one I'm using here is Prometheus' own tool - Node Exporter - that captures metrics from the system - things like CPU usage, memory usage, filesystem sizes, etc. This runs on each machine in the cluster.

Alertmanager

The last piece in the puzzle is alerting. The Prometheus server takes in alert rules written in PromQL - things like checking for low disk space, checking if certain services failed to run, high CPU or memory usage, etc.

It checks the metrics against the rules as they arrive. If a rule is met, it sends the alert to a tool called Alertmanager that handles sending of notifications via email or a chat platform. RC uses Zulip for communication, which has an integration for Alertmanager that I'm using in this case.

Grafana

Grafana can be integrated with Prometheus to visualize the metrics via graphs and dashboards. I've mainly been focused on getting alerting to work so far, so I am yet to try making a dashboard.

Adding Kubernetes to the mix

The reason I was recommended Prometheus in the first place is because it had been deployed within a Kubernetes cluster by a fellow Recurser.

The above setup would have worked just fine if I ran them as individual services directly on the machine. However, I went ahead with the Kubernetes option for two reasons:

I preferred using something that was already deployed over re-inventing the wheel
I'd been hearing a lot about Kubernetes, so this would finally be my introduction to the tool

One advantage of using Kubernetes is that it makes multiple machines operate as one big unit - you provide it a list of services to deploy, and it'll figure out which machine's resources to utilize and how. Except for Node Exporter which is deployed on all machines, other services like Grafana, Alertmanager and the Prometheus server are deployed automagically.

In my case they're deployed using Kubernetes' package manager, Helm. While these eases setup, it also adds some layers of complexity.

Accessing the Prometheus web interface locally now requires more steps than a direct install, as it is isolated from the main system and has its own network and IP address. So multiple port forwards would be required.

Making changes to configuration files is also harder. Kubernetes containers don't have persistent disk space, so it isn't possible to exec into the containers and change files directly³ like I would with a direct install. So I have to add the configuration to some external file and then pass that file during deployment.

For applications deployed using Helm, I have to modify something called a Helm chart. I don't completely understand what the various files are for, but one of them is values.yaml, where I would add custom configuration like the alert rules for example. This file is passed to the install command, which then applies the custom config.

This complexity was preventing me from testing stuff quickly, so I decided to break the project into two phases. The first phase was testing Prometheus and alerting using a direct install, which I'm almost done with. Once I have a working set of config files, the next phase would be to figure out the Helm install and adding my configuration to the values file.

TL;DR

The following diagram is based on my limited understanding of the above concepts.

┌───────────────┐   ┌──────────────────────────────────┐
│ node 1        │   │ kubernetes cluster               │
│               │   │                                  │
├───────────────┤   │                 ┌──────────────┐ │
│ node exporter │◀──┼──────┐          │ prometheus   │ │
└───────────────┘   │      │          │ server       │ │
┌───────────────┐   │      │  pull    │              │ │
│ node 2        │   │      │ metrics  │              │ │
│               │   │      ├──────────│              │ │
├───────────────┤   │      │          │              │ │
│ node exporter │◀──┼──────┤          │              │ │
└───────────────┘   │      │          ├──────────────┤ │
┌───────────────┐   │      │          │ alert rules  │ │
│ node 3        │   │      │          └──────────────┘ │
│               │   │      │                  │        │
├───────────────┤   │      │      rule  ┌─────┘        │
│ node exporter │◀──┼──────┤      met   │              │
└───────────────┘   │      │            ▼              │
┌───────────────┐   │      │    ┌──────────────┐       │      ┌─────────┐
│ node 4        │   │      │    │              │  fire alert  │         │
│               │   │      │    │ alertmanager │───────┼─────▶│  zulip  │
├───────────────┤   │      │    │              │       │      │         │
│ node exporter │◀──┼──────┘    └──────────────┘       │      └─────────┘
└───────────────┘   └──────────────────────────────────┘

Is baby's first monitoring system a bit complex? Yes.

Is baby learning new things and reaching the edge of their abilities thanks to the complexity? ALSO YES!

Notes

I knew about the cluster in my first batch, but the thought of contributing to it came to mind only a year later, thanks to some folks from a later batch starting a meeting to discuss stuff relating to the cluster. ↩
I found that metrics were already being captured on the machines using another tool while working on this. That tool didn't provide any support for alerts though, so I chose to switch to using Prometheus. ↩
Prometheus has a web interface, so I expected that I would be able to change configuration and alert rules from the interface directly. I didn't find a way to do so though, and had to edit the file and restart the service each time. ↩