Metrics & Monitoring

Prometheus is an open-source tool for collecting metrics and sending alerts. It was developed by SoundCloud.

What Are Prometheus Metrics?

Prometheus is an open-source tool for collecting metrics and sending alerts. It was developed by SoundCloud. It has the following primary components:

  • The core Prometheus app – This is responsible for scraping and storing metrics in an internal time series database, or sending data to a remote storage backend. The app allows you to retrieve the metrics when queried.
  • Exporters – These are add-ons that ingest data from various sources and produce scrapable metrics for the Prometheus app. Exporters are external programs purpose-built for specific hardware or applications.
  • AlertManager – A system that manages alerts with Prometheus.
  • Client Libraries – These can be used to instrument your applications for monitoring by Prometheus.

Prometheus monitoring works by identifying a target, which is an endpoint that supplies metrics for Prometheus to store. Targets may be physical endpoints or an exporter that attaches to a system and generates metrics from it. Endpoints are either supplied by a static configuration or discovered through a service discovery process.

When Prometheus has gathered a list of targets, it can start retrieving metrics. Metrics are retrieved via simple HTTP requests. The configuration directs Prometheus to a specific location on the target that provides a stream of text, which describes the metric and its current value.

What Metrics Does Prometheus Provide?

Prometheus monitors endpoints and offers four different types of metrics:

Counter

This cumulative metric is suitable for tracking the number of requests, errors or completed tasks. It cannot decrease, and must either go up or be reset to zero.

Counters should be used for:

  • Recording a value that only increases
  • Assessing the rate of increase (later queries can show how fast the value rises)

Use cases for counters include request count, tasks completed, and error count.

Gauge

This point-in-time metric can go both up and down. It is suitable for measuring current memory use and concurrent requests.

Gauges should be used for:

  • Recording a value that may go up or down
  • Cases where you don’t need to query the rate of the value

Use cases for gauges include queue size, memory usage, and the number of requests in progress.

Histogram

This metric is suitable for aggregated measures, including request durations, response sizes, and Apdex scores that measure application performance. Histograms sample observations and categorize data into buckets that you can customize.

Histograms should be used for:

  • Multiple measurements of a single value, allowing for the calculation of averages or percentiles
  • Values that can be approximate
  • A range of values that you determine in advance, by using default definitions in a histogram bucket, or your custom values

Use cases for histograms include request duration and response size.

Summary

This metric is suitable for accurate quartiles. A summary samples observations and provides a total count of observations, as well as a sum of observed values, and calculates quartiles.

Summaries should be used for:

Multiple measurements of a single value, allowing for the calculation of averages or percentiles
Values that can be approximate
A range of values that you cannot determine upfront, so histograms are not appropriate

Use cases for summaries include request duration and response size.

CPU Usage

The metric used here is “node_cpu_seconds_total”. This is a counter metric that counts the number of seconds the CPU has been running in a particular mode. The CPU has several modes such as iowait, idle, user, and system. Because the objective is to count usage, use a query that excludes idle time:

sum by (cpu)(node_cpu_seconds_total{mode!="idle"})

The sum function is used to combine all CPU modes. The result shows how many seconds the CPU has run from the start. To tell if the CPU has been busy or idle recently, use the rate function to calculate the growth rate of the counter:

(sum by (cpu)(rate(node_cpu_seconds_total{mode!="idle"}[5m]))*100

The above query produces the rate of increase over the last five minutes, which lets you see how much computing power the CPU is using. To get the result as a percentage, multiply the query by 100.

Memory Usage

The following query calculates the total percentage of used memory:

node_memory_Active_bytes/node_memory_MemTotal_bytes*100

To obtain the percentage of memory use, divide used memory by the sum and multiply by 100.

Free Disk

You need to know your free disk usage to understand when there needs to be more space on the infrastructure nodes. Again, the same memory usage method is used here, but with different metric names.

node_filesystem_avail_bytes/node_filesystem_size_bytes*100

For more here.