In the previous post of this series I talked about how Jens Rasmussen’s model for System Safety succinctly describes the different ways in which projects/organisations can fail and how a more codified version of this model identifies certain non-failed states as still being problematic. In this post I will dig a little deeper into the structure of failure or how it pertains to the structure of possible decision states and I will also elaborate over the fact that thought has historically lagged behind on those concepts.
On Trees and Grass
It is a question of method: the tracing should always be put back on the map.
This quote comes from the introductionss of A Thousand Plateaus by Deleuze and Guattari. In this jointly written book they set out a framework of thinking (the rhizome) and act upon it throughout the rest of the book. The introduction makes a strong case against a long tradition of the overcoded ontology also known as arborescent thinking (roots and trees). In fact this beast goes by different names: porphyrian tree, categorical thinking, differentiation and abstraction, genealogy, dialectical thinking (the first flavour of this dichotomous thinking that somewhat incorporates a process), Chomsky hierarchy, etc. The need for categorising is strong in humans since it reduces complexity into a simple model. This is not to say that categorising is bad, it’s just an incomplete view (a snapshot if you like) of what reality is. Categorising transfixes/codifies reality where it really is in constant change. This is where the rhizome (crab-grass and ginger) comes in as a more process-like way of thinking (an effective anti-genealogy).
Nature doesn’t work that way: in nature, roots are taproots with a more multiple, lateral and circular system of ramification, rather than a dichotomous one. Thought lags behind nature. The point is that a rhizome never allows itself to be overcoded.
On Paradigm Shifts
Historic materialism makes this symptom of root/tree thinking tangible: it is not a single “smart” person that makes a paradigm shift (it is not a single person who makes the system fail/question itself) but it is the time-spirit (the whole context or network of decisions that makes it possible to shift gears). For example Alonso Church and Alan Turing both attacked the Entscheidungsproblem by different means (Turing machines and lambda calculus) but nonetheless tackled it within the same year. Wallace and Darwin came to the same conclusion albeit with different interpretations. Unfortunately their theory is the pinnacle of modern arborescent thinking and it is in fact not the case altogether that the genome of different species can be traced back up a tree. Certain type C viruses can carry pieces of DNA from one host species to another species, all within one generation. With bacteria it is even less clear what is what and what came from what (with all them plasmids and bacteriophages flying around, a veritable orgy of DNA).
On Root-Cause Analysis
When we talk about failure and how we humans tend to look at them (in hindsight) we often come across the term Root-cause Analysis. This tendency to trace failure to its root is always a political enterprise and a fabrication when humans are involved. We can always find a causal factor (distal or proximal) to justify our explanation. In this political enterprise we give ourselves tunnel vision, we always tend to ignore or downplay other causations (or even never find the relevant causations). For a more detailed analysis see John Allspaw’s The infinite hows article/presentation or Sydney Dekker’s Field guide to Understanding ‘Human’ Error.
Let’s look at a real world example (from Sydney Dekker’s excellent book) to make this endeavour more salient.
- First finding:
Airplanes go belly flop when landing.
- First root-cause analysis:
There are bad apples among pilots, if a pilot did a belly flop in the past he’s more likely to bellyflop in future. Pilots are to blame. We should eliminate them from that particular plane. This did not help, the percentage of belly flops didn’t go down! So the investigation tried to look at the events from the point of view of the pilot.
- Second finding:
Two hydraulics switches for flaps and landing gear are designed very closely together in the cockpit. The switches are designed in a way that the pilots can easily confuse one switch for another thus pulling up the landing gear in stead of the flaps after landing.
- Second root-cause analysis:
The designer of the cockpit dashboard failed to take note of certain ergonomic problems in the design. The designer firm is to blame.
- Third finding:
While the design firm of the cockpit had their designers enrolled in an ergonomics course, the audit firm who audited the final design failed to notice this flaw.
- Third root-cause analysis:
Audit firm of the airplane design firm is to blame.
Now in this simple example we can always dig deeper into what happened (basically in order to avert responsibility), more often than not, we trace a single lineage in the “decision tree”. Of course the decision to design the two hydraulic switches close together is influenced by multiple factors/causations. So in a better analysis we’d move towards a more complex “decision tree”. In reality the fact that we arrive at a certain (failed) state is influenced by multiple factors that are more/less relevant and that are sometimes interplaying with each other. It is also important to note that different accounts (different people) of a failing state and its causations can point to different events, distributions of probability and pathways in the causation network. This is represented in Figure 1. where we can see that not all possible decisions leading up to the red node (5 in total) are followed through (only the blue, green and in more detail the orange pathways are found and politicised). So if we only represent and rearrange the orange root-cause analysis we’ll see that we are tracing along a hierarchy. In fact there are standardised methodologies (FTA and Ishikawa diagrams) for identifying the best ways to reduce risk and failure. This is typically used for analysing failure without a human component. When dealing with humans on the other hand these models fail miserably because of the inherent mismatch between the underlying process and its petrified tree-representation.
Analogously the finding that the root cause analysis should be put back on the map/network of decisions thereby opening up possible (lines-of-flight in Deleuzian speech) cut off points to annihilate future failure pathways (and thereby making the organisation more resilient). The more there are different explanations or tracings the more cut off points for future failures there are possible in a post-mortem. Again, the more there are possible viewpoint of how something went wrong, the more you will find possibilities to prevent failure in the future (retracing onto that failing pathway). The most obvious example in IT as an application to this theory is unit testing while fixing bugs. When you find a bug the first thing to do is write a unit test for the correct behaviour which will afterwards go green when you fixed the bug. If you automate this test each time you build, you have effectively annihilated a pathway into a “red” node.
In the previous post we stated that the Operating Point (OP) mostly reside on the working plane (it better be!) and only enters the failure plane briefly. These two very distinctive possible planes have equally complex decision networks and causal factors (rhizomes). So why do we insist on the fact that crossing into the failure plane must result from a single root cause? Deleuze has shown us that this arborescent thinking is a relic from a long history of failed causal thinking (in fact ontological thinking in general). It is in fact easy to pinpoint proximal factors in space and time when a failure occurs. But this thinking blindsides us from the distal factors of the failing decision. So asking what the root cause of failure is is just as bizarre as asking what the root cause of success is.
Root-cause analysis is a fine exercise of mapping a single arborescent path within a complex decision network. One shouldn’t forget it is indeed a single path. Different stories (i.e. different paths) can paint this complex rhizomatic network visible while a single thread cannot. We’ve seen that one part of resilience is eliminating the pathways into failure (memorising failed states) in this complex mesh that is decision making. In the next post will see that resilience entails much more than this memorisation.
- 1. & 2.. Gilles Deleuze & Félix Guattari – A Thousand Plateaus [Capitalism & Schizophrenia] (1988 – The Athlone Press)