Back in 2018, John Allspaw wrote a piece about the over-valuing of “shallow incident data” for the Adaptive Capacity Labs blog. John argued that the industry spends too much time collecting greater and greater quantities of data about incidents like Mean Time to Resolution (MTTR) and Incident Frequency, very little of which was useful for understanding how those involved in responding to the incident did so. While I think he was right then (and remains so today), whether or not that’s the case is ultimately an empirical question.
My background is in philosophy, so while I’m interested in that empirical question and even what sorts of data and metrics people record about incidents, I want to look at the subject of shallow data from another angle: what is it about this data that makes it “shallow?” Or to put it another way: what qualitatively distinguishes the type of data that John calls “shallow” from the “deeper and thicker” data that he discusses later in that blog post?
This question gets us into the realm of metaphysics, or the subject of what the world is ‘really made of’ and ‘how it really works’. The basic idea is that if you asked what the world is or how something does what it does in principle, then you’d have asked a metaphysical question. In this case, the question is what in principle distinguishes “shallow data” from “deeper and thicker data.” (1)
I propose that the distinction lies in the difference between data about actual states of the world and data about potential states.
More specifically, the second type of data concerns the parameters that affect the actualization of potential states.
The former type of data, one will note, consists of metrics of events that are understood to have been completed. They have determinable and therefore quantifiable features like time-to-resolution. In his book Difference and Repetition, philosopher Gilles Deleuze calls this the “explication” of “extension.”
The latter type of data describe the factors that condition the production of the former: they are data about how features like time-to-resolution come to be. They therefore characterize the different factors that affect or modulate the process of becoming (or the actualization of potential) of any given shallow data. The philosopher Brian Massumi calls these affecting factors “ontopowers” because they are forces that affect the ontogenesis of determined, extended, actual beings; “ontogenesis” is the genesis or becoming of being. In this case, that being is the incident with its various spatially and temporally-extended features that are measured and collected as data.
What makes “shallow data” shallow, then, is that it is an effect. And those hoping to learn from incidents and engage in Resilience Engineering are only secondarily interested in effects; we’re interested in potential, or adaptive capacity, and how to maintain and modulate it. That’s why we want the deeper data about how that occurred in prior incidents, so that we can understand what affected that potential before and perhaps we can more self-consciously modulate it in the future.
To give an example of what this conception of Resilience Engineering does for us, let’s consider a canal. When a canal is first built, it’s done so based on certain parameters like the available sources of water, the intended destination, geographical features like regional topography, construction materials, and all-too-human considerations like the local land-use policies and whether Farmer Brown decides to sell or otherwise make her property available for it to traverse. These all coalesce to produce the particular measurable features of the water that flows through the canal like its depth or the surface’s width, and even its qualities like whether it’s drinkable on its own or needs to be treated before human consumption.
As these parameters shift (say the floodplain of a certain portion is below sea level and now climate change is threatening to flood the area) and we want to make adjustments to the canal to account for that, focusing on the currently-measurable data about the water in the canal may be of little help. Instead, one would want to investigate those other, deeper factors and how they came to affect one another and produce that data. One could then modulate the parameters and thereby adjust.
Thus, the focus of Resilience Engineering is on parameters and the forces that produce the parameters; the effect is only of secondary interest.
In Resilience Engineering for software, then, we want to make sure that we’re focusing on those parameters that keep our systems operational and how they’re produced (or learn about what was doing so, after an incident). These parameters are the results of things like communication patterns within and between teams, the timing of interactions between different technical components like auto-scaling rules and instance boot durations, organizational policies around deploying code changes, and the others that John lists in his original post. It’s through learning about these socio-technical factors, not by collecting simple output metrics like MTTR, that one can understand how the actual is produced from the potential and avoid over-valuing shallow data.
(1) To answer this question I’ll draw upon thinking by the philosopher Gilles Deleuze and one particular reader of his work (though there are many to whom I’m indebted), Brian Massumi; I won’t cite particular pages or sections from their works but will just point the interested reader to Deleuze’s Difference and Repetition and Massumi’s Ontopower and note them as key texts.