Monitoring Beyond Averages
At Teaching Strategies, we put a lot of emphasis on the observability of our platform. We have been growing our digital footprint pretty rapidly — through new development, modernization, and acquisitions — and we strive to have holistic visibility into each user’s experience.
Organizationally, we use NewRelic as our primary observability tool. While, personally, I have not been a big fan of the product even as late as five years ago, the company has made great strides in improving their product to accommodate modern monitoring needs for large-scale software as a service (SaaS) systems. Perhaps one of the most notable features that got me excited was a change in the granularity of data that can be stored in the product, unlocking distributed tracing capabilities and allowing for deep analysis of individual data points.
The Value of 1 Second
One of the biggest mistakes most observability platforms make is presenting metrics averaged in 1-minute increments. Why does granularity matter? Baseline traffic on just one of our products is ~300 requests/second. That is 300 unique (and potentially very different) experiences per second. Averaging these requests across one minute only tells us that large portions of 1,800 unique requests in that minute (that is, requests that fall within the 50th percentile to be exact) were…ok. This, however, is not the question that we are trying to answer. It simply does not address the question that should matter the most — did all of our users have a pleasant experience?
Let me make a blanket statement here—averages (specifically arithmetic means) are misleading. Making a decision based on averages, especially when it comes to performance, will inevitably cause you to miss the outliers. And—as is true of most customer focused systems—in our outliers are poor user experiences. Luckily, NewRelic now offers the ability to visualize collected metrics in much more acceptable industry formats — percentile and histogram.
Math Is Important
For years now, the industry has realized that averages are very poor representations of reality, mainly due to the fact that our data is not normally distributed. Instead, we now focus on a percentage as a guiding benchmark threshold, which is certainly a move in the right direction. Usually, organizations set the 99th percentile as their acceptable norm. Less frequently it is set at the 90th percentile (which blows my mind for the reasons below). But much like averages, a downside of an “industry accepted” standard is widespread acceptance of it without a true understanding of the math behind it.
The APA defines percentile as “the location of a score in a distribution expressed as the percentage of cases in the data set with scores equal to or below the score in question.” Thus, if a score is said to be in the 90th percentile, this means that 90% of the scores in the distribution are equal to or lower than that score. The easiest way I found to explain it in laymen’s terms is to go back to average (mean) and think about it as the 50th percentile. We calculate average based on 50/50 distribution of data point values. With the 90th percentile, we would calculate the reported value at a 10/90 split, giving more weight to data points on further sides of the bell curve.
Increasing percentile vs average (mean)
Looking at the example above, we see a widely different picture when evaluating the average (50th percentile), the 95th percentile, and the 99th percentile. The higher the threshold, the higher your response time will appear, as it will cover the top X% of the distribution. Using a high percentile as a representative metric instead of using an average helps to tune your system to the majority of users. However, every percentage point above the set threshold potentially means that a user has had an experience that is not acceptable. A 90th percentile threshold assumes that you are “discarding” 10% of user experiences, which is, arguably, better than the 50% we discard when using averages—but this low number is of little consolation to the unlucky. In our case, 10% is equivalent to 30 user experiences per second; 1,800 per minute; 108,000 per hour; etc. Defensive architecture and programming certainly helps (granted, that’s a whole different post) but in most cases we tend to examine the 99th percentile and higher when it comes to acceptable norms.
A cool feature in NewRelic is the ability to set custom percentiles. We have graphs showing percentiles ranging from 50 to 99.9 to help us identify the impact of production issues on the user experience at a glance.
Every User Matters
Histograms, in my opinion, are the only accurate way to look at performance—and many other metrics. They provide a true representation of the distribution of individual data points and help us understand what the data actually means. Let’s go back to granularity and dissect that 1 minute a little more to look at individual request distribution.
According to average per-second reporting, our average response time is ~2.5 seconds, with the 99% percentile hovering around 3.3. Looking at the same 1-minute timeframe in a form of a histogram gives us a slightly different result.
1 minute of traffic
As expected, we see a bell-shaped curve distribution of our requests across the response time. In this view, we see that about 10% of our users experienced response times greater than the reported 2.5 seconds. In some cases, they experienced almost 2 times the response time. This gives us a great starting point to dive even deeper into the data and investigate defined sample sets to understand whether this is an anomaly that needs to be addressed or if it is the norm.
To isolate sample sets even further, we use different histogram dimensions to further break down traffic distribution, by things like URL, endpoint pattern, or singleton, by writing custom NRQL queries to dissect the data. In the example below, we look at the top 20 requests to analyze per-URL consistency of the response times.
SELECT histogram(duration,4,50) FROM Transaction FACET request.method, request.uri WHERE appName = '[YOUR APP]' and SINCE 36000 seconds ago LIMIT 20
Ultimately, this representation of data allows for quicker analysis and identification of problems at a glance, even more so than stacked percentile graphs.
Per-URL user perceived load time distribution
And once we isolate a potential problem, that’s where my next favorite feature comes in — using distributed tracing to profile individual requests to understand the cause and effect of the outliers.
tl;dr Know your math and don’t trust averages, even if others (like monitoring systems, for instance) tell you otherwise.
About the Author
Leon heads Engineering teams at Teaching Strategies. Leon’s two decades of expertise were concentrated on architecting and operating complex, web-based systems to withstand crushing traffic (often unexpectedly). Over the years, he had a somewhat unique opportunity to design and build systems that run some of the most trafficked systems in the world. He’s considered a professional naysayer by peers and has the opinion that nothing really works until it works for at least a million people.