Notes from: Metrics, Metrics Everywhere

What we aim to provide is business value. Examples of this are improving old features or writing new features that user will love. It means having a fast, attractive site. Adding tests and making code cleaner for other developers. Business value is anything that makes it more likely for people to give us money.

We want to generate more business value, and that means we need to make better decisions about our code to help accomplish this.

Our code generates business value when it runs. Not when it is written, or committed, but runs. So in order to make better decisions about our code, we need to know what our code does when it runs. We need to measure it.

Our mental model is often wrong

Our perception is not the reality. There is always a gap between what we think, and what the code actually does. Mind the gap. We may think the code can’t possibly work, but it does. Or that it won’t fail. But it does.

We can’t know something until we measure it, and those metrics affect how we make decisions. Metrics give us clarity, and resolves confusion. Measuring makes our decisions better. But only if we are measuring the right thing.

Production

Continuously measure code in production. Not a test environment, or a QA environment, but production. And there are two questions we need to ask when writing code: “What does this code do that affects business value?” and “How can we measure it?”

Types of metrics

  • Gauges
  • Counters
  • Meters
  • Histograms
  • Timers

A gauge is an instantaneous value of something, like a size of a database table. A counter is an incrementing and decrementing value, like counting the number of active sessions on a server. A meter is the average rate of events over a period of time, like the number of requests per second.

When people talk about rates, they normally mean the mean rate, or the number of events over the total elapsed time. What we really care about is recency, so the mean rate isn’t really what we are looking for. To map our rate to something we care about, we can calculate the exponentially weighted moving average.

There is something in the algorithm called a smoothing factor. By playing with that factor, we can get very clear, recent metrics, and can give us feedback like: “We went from 3,000 requests a second to less than 500 a second”.

A histogram is a statistical deduction of values in a stream of data, and this will contain the min, max and standard deviation.

Really what we want are Quantiles – the median, and the 75th, 95th, 98th, 99th and 99.9th percentiles. But for high performance services, we can’t possibly keep all of these values. To deal with this volume, they use Reservoir sampling.

Reservoir sampling keeps a statistically representative sample of measurements as they happen. So instead of storing the entire stream of data, store a small sample that is representative of the whole data stream.

The canonical way of doing this is Vitter’s Algorithm R. When given a set of data, we can use Vitter’s Algorithm R to find the median, the 75th percentile and the 95th percentile. The algorithm produces uniform samples. But what we really want are quantiles. So we use something called Forward-decaying priority sampling. The gist is that it allows us to maintain a statistically representative sample of the last 5 minutes (5 is the granularity he cares about).

Timers are a histogram of durations and a meter of calls, for example the number of milliseconds to respond, specifically, the distribution of that data. This allows us to make conclusions like “At around 2,000 requests/second, our 99% latency jumps from 13ms to 453 ms”.

How to choose what to instrument

If it could affect the business value of your code, add a metric. Every one of yammer’s codebases export 40-50 metrics, each one between 1-10 measurements. Some of their largest services export around 2,000 values at a time.

Then have an automated process to collect the data. Yammer has a set a script that comes by every minute and requests the data via HTTP, with responses in JSON.

Then monitor the data. We could do this with Nagios, or Zabbix or whatever you’d like. But the critical thing is that if it affects the business value, someone gets woken up.

Then aggregate the data. We could use Ganglia, or Graphite, or Cacti or whatever you’d like. This allows you to place current values in historical context, and see long-term patterns in the behavior of your code. If you get this process setup, it allows you to go faster.

The goal is a fast decision-making cycle

The process allows us to shorten our decision-making cycle. The cycle looks a bit like this: observe the data, orient ourselves according to that information, which then allows us to decide, then act. This is known as the OODA loop.

A faster decision making cycle is a huge competitive, cultural, and cognitive advantage. If we iterate faster, we will beat the competition. Short decision making cycles ships fewer bugs, more features and happier users. Happier users should mean more money, and more business value.

To sum it up: We need to generate business value. Our code does this when it runs. In order to know our code is generating business value, we need metrics. Shorten the OODA cycle. If you Go, you can use Metrics, if you are running Ruby, you can use ruby-metrics, if you are JavaScript, you can use this metrics library. Other languages have their own, or you can build this.