Unknown Unknowns
Perhaps one of the all-time favorite talks I’ve ever given was Production Testing through Monitoring. The talk’s central theme is a promotion of business-first monitoring with upward correlation of supporting technical metrics. Today, I want to revisit the first part of the talk—the part where I make the argument that no matter how “complete” your test suite is, you will never catch 100% of your existing problems.
In 1967, three NASA astronauts tragically lost their lives in a fire during a launch pad training exercise for Apollo 1. There were technical reasons for why this happened, but underlying them all was the cause, described by former astronaut Frank Borman as a “failure of imagination.” The engineers had prepared for many possible situations, but no one had thought that this exercise could go wrong in this way.
In software development, the stakes are rarely this high, but the lesson is the same. Team members need to use every tool at their disposal, along with their imaginations, to think of bugs that might occur. Strong QA engineers are particularly good at this.
At Teaching Strategies, we operate a pretty complex multi-product ecosystem that changes constantly with additional product rollouts, new capabilities, and, of course, modernization projects to bring some of our more legacy components in line with modern business needs. Partially due to the complexity—and partially due to our need to ensure a healthy process—we have multiple levels of QA validation (from unit tests to manual validation to load and vulnerability testing) during which we validate both incremental changes, and complete full regression checks. And, even with all of this coverage, we’ve often run into peculiar cases that were not, and frankly, could not, be accounted for.
Some of it we catch through negative testing. An example of this in action came out of one of our recent efforts. The team built a report for school administrators that details the number and types of interactions that occur between teachers and family members. As they built, the team’s attention was focused on the complexities of data accuracy, filter functionality, and the overall UX. What we almost missed was a bug that prevented students who did not have their associated family members connected to their accounts from showing up in the report. Everyone was focused on what was included—and not on what was missing. If not for the imagination of a QA engineer, this bug would have been overlooked.
The reality of QA is that it is deterministic. It offers excellent coverage for known knowns (acceptance criteria conditions, defined use cases and workflows, etc.). Once QA engineers build enough domain knowledge, they become good at also covering for known unknowns (edge cases, less common work flows, etc.). But when it comes to unknown unknowns, QA alone, no matter how good the team may be, is not sufficient.
There are a number of examples I have at my fingertips, but I think this particular one offers a perfect representation of an unknown unknown.
There are never enough hours in a day
Anyone who has ever deployed a major product knows that the last day is pure chaos. Testers frantically race to finish tasks; branches must be reconciled, last minute issues must be force-merged—everything has to be ready to go. In this particular instance, as developers were scrambling to cross all the t’s and dot all the i’s and testers were running final smoke tests, one of the QA engineers noticed that something was causing users [MW1] [2] [MW3] to receive a pretty disruptive error when trying to log out. For context, the logout feature is one functionality that is pretty simple; moreover, it had been tested thoroughly up to that point—including just a few hours prior to the error—and no changes had been made to the code path in a while. Strange. But wait, we have error logs!
{
"dt": "2022-04-03T20:53:51",
"http": {
"method": "GET",
"script": "/logout",
"querystring": ""
},
"user_id": 123456,
"message": "Application onError",
"extended": {
"referral": "/abc",
"error": {
"type": "IdentityManager.logout",
"detail": "An exception occurred in an Application event handler: onRequest",
"message": "Could not logout completely",
"errorcode": "",
"extendedinfo": "{\"userID\":123456,\"effectiveOn\":\"2022-04-04T24:53:50Z\"}"
},
"name": "onRequest"
"level": "ERROR"
}
At a first glance everything looked kosher. And at a second look. And at a third. Except, under really close inspection, a team member noticed one thing. The date: 2020–04–04T24:53:50Z . Go ahead, I’ll wait while you figure it out.
Turns out, this was an objectively long day since it had at least 25 hours in it.
Looking at the code, the formatting string looked like this:
var t = timeFormat(utc, “kk:mm:ssZ”);
Since neither me nor any of my colleagues have ever seen kk syntax in any of the dozens of programming languages we have worked with, we went to the source (i.e., the docs). And we found a nicely documented case:
[hh] Hour in day (0-23)
[kk] Hour in day (1-24)
To make the matter even more interesting, all of the timestamps are in UTC, which means that the team could have only observed the issue between the hours of 8 p.m. and 9 p.m. ET (after daylight savings time)—times when (most) testers are home, relaxing, spending quality time with their families, and/or enjoying an adult beverage. We lucked out in this situation, as the testers had decided to run final tests late into the evening due to the complexity and criticality of the deployment. Granted, we would’ve likely caught the error with our monitors once the code was in production, but it certainly wouldn’t have been caught by any reasonable test suite, which ultimately would have caused degraded user experience.
As I emphasized in my talk, in no way am I preaching for less testing. But I am trying to increase awareness of the fact that not everything can—or will—be caught by testing, especially on the first go-around. Use your imagination to limit the number of unknown unknowns, then supplement your test suites with deep monitoring to detect anomalies and continuously improve your test coverage. Do both. Do them well.
About the Authors
Andy is a technology leader with experience ranging from Fortune 100 companies to VC backed startups. With an M. Ed in educational technology, he has spent the last eight years in technology and product leadership roles at ed-tech companies. Andy has been leading the QA team at Teaching Strategies for over four years, helping the team to deliver quality products that empower our users.
Leon heads Engineering teams at Teaching Strategies. Leon’s two decades of expertise were concentrated on architecting and operating complex, web-based systems to withstand crushing traffic (often unexpectedly). Over the years, he had a somewhat unique opportunity to design and build systems that run some of the most trafficked systems in the world. He’s considered a professional naysayer by peers and has the opinion that nothing really works until it works for at least a million people.