Whilst attributed to Gene Kranz, Mission Controller for Apollo, the quote ‘Failure is not an option’ was actually penned by the screenwriters of the 1995 Apollo 13 film. Kranz just liked it so much that he claimed it back for the title of his autobiography.
In ‘To The Moon And Back’ i wrote that failure is not only an option, but possibly the most likely one. Systems tend towards both overconfidence in their stability, and a lack of imagination around the many ways that they may fail.
There are very few things that we can say with absolute certainty, but ‘all things must pass’ is one of them: all organisations fail, eventually, and all systems too. Not even beliefs are immune to the corrosive effects of evolving culture.
But how do things fail?
Today is a short reflection on that: three mechanisms of failure.
One mechanism is simply ‘Confusion’. Many things happen at once, and we do not spot them all, or we do not connect the dots, or we mis-categorise those things. Sometimes things happen in different parts of a system, causing local clarity, but organisational confusion, because the parts are not connected or, more likely, are connected in the wrong ways.
The funny thing about confusion is that we are not always aware when we have caught it: it may be masked by confidence, where we believe that we hold understanding but are, in reality, confused.
Confusion may be held at different levels, and within different systems: for example, all of our sensors may be functioning, and yet at a decision making level we are confused. So we may have all the data, and all the great people in the room and even the correct context for assessment, and yet confusion reigns.
Sometimes because we lack imagination: the ‘answer’ is there to see, but it does not match up against expectation, or it stretches the elasticity of our belief system too far, and so hence we ignore it, or it is too wide to comprehend.
Sometimes confusion is deliberately injected into systems: to flummox or hoax, to distract or deny. Either from outside, as an act of malice, or even from inside as a matter of indolence or neglect.
Confusion is catching: possibly from external systems or narratives, or the adoption of intact narratives that then confuse internal ones. Markets work this way: as structures of belief and hope more so than logic and prediction.
Confusion may not cause failure, but may mask it. It’s not a force that can be avoided as by necessity we must be ‘confused’ to learn, we must be ‘confusted’ to change: confusion may be a key component of the spaces that we experiment within.
Perhaps best to consider that we must understand our relationship with confusion, and how it relates to confidence, or overconfidence, and hence what is the cost of carrying confusion?
Cascades are something else: cascade effects happen when one failure impacts a subsequent layer of the system.
Cascades are a mechanism by which seemingly minor failures end up fracturing the system itself: whilst no one failure is considered catastrophic, the result nonetheless turns out to be so.
Understanding the principle of cascading failure is one of our mechanisms of resilience: if you are able to break the links, you may avoid failure. This is one aspect of failure, complexity and control that i am particularly interested in, because it may be quite a low tech thing to do.
The rush to action, through confusion, with certainty, may cause a cascade to occur, or extend. Something as simple as introducing delay, time, space, or gateways, may help to avoid it.
Alternatively, we can consider how the system itself is engineered: to be monolithic, or inherently fragmented, or to include crumple zones.
A monolithic system may be highly controlled, efficient, and effective. A fragmented one may be interconnected, but divergent, and controlled more through influence than directly. And crumple zones can be areas that collapse easily, taking energy or momentum out of failure.
An example would be agency workers: organisations that employ some people on contracts, and others through agencies, have a built in capability to react to the market by ‘collapsing’ the agency work. You see, none of this is particularly clever or complex in itself. The engineering of the organisation is something to consider with care.
In ‘The Socially Dynamic Organisation’ i talk about more diverse ecosystems and structures, Organisations that are lighter weight but more heavily interconnected, and more resilient through design, not mass alone.
Cascade failures may run not simply through formal structures (like staffing and fixed costs) but through knowledge itself: belief in the integrity of a market, reputation damage to a key individual, or social accountability of the Organisation itself.
Worth also noting that cascades of failure typically do not run in a straight line – not least because we seek to avoid those weaknesses as they are visible: rather like a crack propagating through glass, they can zig zag wildly, and the very connection of disparate, disconnected, or seemingly unrelated elements is where the weakness lies.
Combinant effects are catalysts or additive in nature: they may be part of, or the trigger for, cascades, but the sum is more than the parts. The best way to consider combinant effects is where two elements, in isolation, are innocuous, but together produce hitherto unseen effects. There is a notion of critical mass here, or a tipping point: you cannot spot the risk when the elements are isolated – indeed we may have tested individual elements of the system and demonstrated their safety – but together they cause a failure.
Combinant effects may be entirely unpredictable, or the cost of prediction may be prohibitive enough to prevent it occurring with any regularity, or they may be predictable but hidden within confusion, or occluded within existing frames of certainty.
In retrospect, combinant effects may give an illusion of being obvious, but that is the key benefit of hindsight.
An example would be the sinking of the Oceana cruise liner: all the mechanical systems were tested, and all the human systems were trained, but in the event, a part was missing (leading to flooding) and the captain was missing (frozen up in shock) and hence the system failed – famously it was the entertainers who stepped in to save the passengers.
The failed part alone should not have caused disaster, and the Captain existed within a hierarchy which should have been resilient, but together they produced shock and failure.
Perhaps one way to understand this is as ‘cascades’ running through low level systems and ‘combinant’ effects amplifying shock through systems, so the two may run in parallel. And, of course, generate confusion.
If you are interested in Failure, Complexity and Control, i am running an Open Workshop in June, as well as the free download of the book ‘To the Moon and Back – leadership reflections from Apollo’.