Tag: Systems Thinking

A Cascade of Sand: Complex Systems in a Complex Time

We live in a world filled with rapid change: governments topple, people rise and fall, and technology has created a connectedness the world has never experienced before. Joshua Cooper Ramo believes this environment has created an “‘avalanche of ceaseless change.”

In his book, The Age of the Unthinkable: Why the New World Disorder Constantly Surprises Us And What We Can Do About It he outlines what this new world looks like and gives us prescriptions on how best to deal with the disorder around us.

Ramo believes that we are entering a revolutionary age that will render seemingly fortified institutions weak, and weak movements strong. He feels we aren’t well prepared for these radical shifts as those in positions of power tend to have antiquated ideologies in dealing with issues. Generally, they treat anything complex as one dimensional.

Unfortunately, whether they are running corporations or foreign ministries or central banks, some of the best minds of our era are still in thrall to an older way of seeing and thinking. They are making repeated misjudgments about the world. In a way, it’s hard to blame them. Mostly they grew up at a time when the global order could largely be understood in simpler terms, when only nations really mattered, when you could think there was a predictable relationship between what you wanted and what you got. They came of age as part of a tradition that believed all international crises had beginnings and, if managed well, ends.

This is one of the main flaws of traditional thinking about managing conflict/change: we identify a problem, decide on a path forward, and implement that solution. We think in linear terms and see a finish line once the specific problem we have discovered is ‘solved.’

In this day and age (and probably in all days and ages, whether they realized it or not) we have to accept that the finish line is constantly moving and that, in fact, there never will be a finish line. Solving one problem may fix an issue for a time but it tends to also illuminate a litany of new problems. (Many of which were likely already present but hiding under the old problem you just “fixed”.)

In fact, our actions in trying to solve X will sometimes have a cascade effect because the world is actually a series of complex and interconnected systems.

Some great thinkers have spoken about these problems in the past. Ramo highlights some interesting quotes from the Nobel Prize speech that Austrian economist Friedrich August von Hayek gave in 1974, entitled The Pretence of Knowledge.

To treat complex phenomena as if they were simple, to pretend that you could hold the unknowable in the cleverly crafted structure of your ideas —he could think of nothing that was more dangerous. “There is much reason,” Hayek said, “to be apprehensive about the long-run dangers created in a much wider field by the uncritical acceptance of assertions which have the appearance of being scientific.”

Concluding his Nobel speech, Hayek warned, “If man is not to do more harm than good in his efforts to improve the social order, he will have to learn that in this, as in all other fields where essential complexity of an organized kind prevails, he cannot acquire the full knowledge which would make mastery of the events possible.” Politicians and thinkers would be wise not to try to bend history as “the craftsman shapes his handiwork, but rather to cultivate growth by providing the appropriate environment, in the manner a gardener does for his plants.”

This is an important distinction: the idea that we need to be gardeners instead of craftsmen. When we are merely creating something we have a sense of control; we have a plan and an end state. When the shelf is built, it's built.

Being a gardener is different. You have to prepare the environment; you have to nurture the plants and know when to leave them alone. You have to make sure the environment is hospitable to everything you want to grow (different plants have different needs), and after the harvest you aren’t done. You need to turn the earth and, in essence, start again. There is no end state if you want something to grow.

* * *

So, if most of the threats we face to today are so multifaceted and complex that we can’t use the majority of the strategies that have worked historically, how do we approach the problem? A Danish theoretical physicist named Per Bak had an interesting view of this which he termed self-organized criticality and it comes with an excellent experiment/metaphor that helps to explain the concept.

Bak’s research focused on answering the following question: if you created a cone of sand grain by grain, at what point would you create a little sand avalanche? This breakdown of the cone was inevitable but he wanted to know if he could somehow predict at what point this would happen.

Much like there is a precise temperature that water starts to boil, Bak hypothesized there was a specific point where the stack became unstable, and at this point adding a single grain of sand could trigger the avalanche.

In his work, Bak came to realize that the sandpile was inherently unpredictable. He discovered that there were times, even when the pile had reached a critical state, that an additional grain of sand would have no effect:

“Complex behavior in nature,” Bak explained, “reflects the tendency of large systems to evolve into a poised ‘critical’ state, way out of balance, where minor disturbances may lead to events, called avalanches, of all sizes.” What Bak was trying to study wasn’t simply stacks of sand, but rather the underlying physics of the world. And this was where the sandpile got interesting. He believed that sandpile energy, the energy of systems constantly poised on the edge of unpredictable change, was one of the fundamental forces of nature. He saw it everywhere, from physics (in the way tiny particles amassed and released energy) to the weather (in the assembly of clouds and the hard-to-predict onset of rainstorms) to biology (in the stutter-step evolution of mammals). Bak’s sandpile universe was violent —and history-making. It wasn’t that he didn’t see stability in the world, but that he saw stability as a passing phase, as a pause in a system of incredible —and unmappable —dynamism. Bak’s world was like a constantly spinning revolver in a game of Russian roulette, one random trigger-pull away from explosion.

Traditionally our thinking is very linear and if we start thinking of systems as more like sandpiles, we start to shift into nonlinear thinking. This means we can no longer assume that a given action will produce a given reaction: it may or may not depending on the precise initial conditions.

This dynamic sandpile energy demands that we accept the basic unpredictability of the global order —one of those intellectual leaps that sounds simple but that immediately junks a great deal of traditional thinking. It also produces (or should produce) a profound psychological shift in what we can and can’t expect from the world. Constant surprise and new ideas? Yes. Stable political order, less complexity, the survival of institutions built for an older world? No.

Ramo isn’t arguing that complex systems are incomprehensible and fundamentally flawed. These systems are manageable, they just require a divergence from the old ways of thinking, the linear way that didn’t account for all the invisible connections in the sand.

Look at something like the Internet; it’s a perfect example of a complex system with a seemingly infinite amount of connections, but it thrives. This system is constantly bombarded with unsuspected risk, but it is so malleable that it has yet to feel the force of an avalanche. The Internet was designed to thrive in a hostile environment and its complexity was embraced. Unfortunately, for every adaptive system like the Internet there seems to be a maladaptive system, ones so rigid they will surely break in a world of complexity.

The Age of the Unthinkable goes on to show us historical examples of systems that did indeed break; this helps to frame where we have been particularly fragile in the past and where the mistakes in our thinking may have been. In the back half of the book, Ramo outlines strategies he believes will help us become more Antifragile, he calls this “Deep Security”.

Implementing these strategies will likely be met with considerable resistance, many people in positions of power benefit from the systems staying as they are. Revolutions are never easy but, as we’ve shown, even one grain of sand can have a huge impact.

Margin of Safety: An Introduction to the Mental Model

Previously on Farnam Street, we covered the idea of Redundancy — a central concept in both the world of engineering and in practical life. Today we’re going to explore a related concept: Margin of Safety.

The margin of safety is another concept rooted in engineering and quality control. Let’s start there, then see where else our model might apply in practical life, and lastly, where it might have limitations.

* * *

Consider a highly-engineered jet engine part. If the part were to fail, the engine would also fail, perhaps at the worst possible moment—while in flight with passengers on board. Like most jet engine parts, let us assume the part is replaceable over time—though we don’t want to replace it too often (creating prohibitively high costs), we don’t expect it to last the lifetime of the engine. We design the part for 10,000 hours of average flying time.

That brings us to a central question: After how many hours of service do we replace this critical part? The easily available answer might be 9,999 hours. Why replace it any sooner than we have to? Wouldn’t that be a waste of money?

The first problem is, we know nothing of the composition of the 10,000 hours any individual part has gone through. Were they 10,000 particularly tough hours, filled with turbulent skies? Was it all relatively smooth sailing? Somewhere in the middle?

Just as importantly, how confident are we that the part will really last the full 10,000 hours? What if it had a slight flaw during manufacturing? What if we made an assumption about its reliability that was not conservative enough? What if the material degraded in bad weather to a degree we didn’t foresee?

The challenge is clear, and the implication obvious: we do not wait until the part has been in service for 9,999 hours. Perhaps at 7,000 hours, we seriously consider replacing the part, and we put a hard stop at 7,500 hours.

The difference between waiting until the last minute and replacing it comfortably early gives us a margin of safety. The sooner we replace the part, the more safety we have—by not pushing the boundaries, we leave ourselves a cushion. (Ever notice how your gas tank indicator goes on long before you’re really on empty? It’s the same idea.)

The principle is essential in bridge building. Let’s say we calculate that, on an average day, a proposed bridge will be required to support 5,000 tons at any one time. Do we build the structure to withstand 5,001 tons? I'm not interested in driving on that bridge. What if we get a day with much heavier traffic than usual? What if our calculations and estimates are little off? What if the material weakens over time at a rate faster than we imagined? To account for these, we build the bridge to support 20,000 tons. Only now do we have a margin of safety.

This fundamental engineering principle is useful in many practical areas of life, even for non-engineers. Let’s look at one we all face.

* * *

Take a couple earning $100,000 per year after taxes, or about $8,300 per month. In designing their life, they must necessarily decide what standard of living to enjoy. (The part which can be quantified, anyway.) What sort of monthly expenses should they allow themselves to accumulate?

One all-too-familiar approach is to build in monthly expenses approaching $8,000. A $4,000 mortgage, $1,000 worth of car payments, $1,000/month for private schools…and so on. The couple rationalizes that they have “earned” the right live large.

However, what if there are some massive unexpected expenditures thrown their way? (In the way life often does.) What if one of them lost their job and their combined monthly income dropped to $4,000?

The couple must ask themselves whether the ensuing misery is worth the lavish spending. If they kept up their $8,000/month spending habit after a loss of income, they would have to choose between two difficult paths: Rapidly eating into their savings or considerably downsizing their life. Either is likely to cause extreme misery from the loss of long-held luxuries.

Thinking in reverse, how can we avoid the potential misery?

A common refrain is to tell the couple to make sure they’ve stashed away some money in case of emergency, to provide a buffer. Often there is a specific multiple of current spending we’re told to have in reserve—perhaps 6-12 months. In this case, savings of $48,000-$96,000 should suffice.

However, is there a way we can build them a much larger margin for error?

Let’s say the couple decides instead to permanently limit their monthly spending to $4,000 by owning a smaller house, driving less expensive cars, and trusting their public schools. What happens?

Our margin of safety now compounds. Obviously, a savings rate exceeding 50% will rapidly accumulate in their favor — $4,300 put away by the first month, $8,600 by the second month, and so on. The mere act of systematically underspending their income rapidly gives them a cushion without much trying. If an unexpected expenditure comes up, they’ll almost certainly be ready.

The unseen benefit, and the extra margin of safety in this choice, comes if either spouse loses their income – either by choice (perhaps to care for a child) or by bad luck (health issues). In this case, not only has a high savings rate accumulated in their favor, but because their spending is systematically low, they are able to avoid tapping it altogether! Their savings simply stop growing temporarily while they live on one income. This sort of “belt and suspenders” solution is the essence of margin-of-safety thinking.

(On a side note: Let’s take it even one step further. Say their former $8,000 monthly spending rate meant they probably could not retire until age 70, given their current savings rate, investment choices, and desired lifestyle post-retirement. Reducing their needs to $4,000 not only provides them much needed savings, quickly accelerating their retirement date, but they now need even less to retire on in the first place. Retiring at 70 can start to look like retiring at 45 in a hurry.)

* * *

Clearly, the margin of safety model is very powerful and we’re wise to use it whenever possible to avoid failure. But it has limitations.

One obvious issue, most salient in the engineering world, comes in the tradeoff with time and money. Given an unlimited runway of time and the most expensive materials known to mankind, it’s likely that we could “fail-proof” many products to such a ridiculous degree as to be impractical in the modern world.

For example, it’s possible to imagine Boeing designing a plane that would have a fail rate indistinguishable from zero, with parts being replaced 10% into their useful lives, built with rare but super-strong materials, etc.—so long as the world was willing to pay $25,000 for a coach seat from Boston to Chicago. Given the impracticability of that scenario, our tradeoff has been to accept planes that are not “fail-proof,” but merely extremely unlikely to fail, in order to give the world safe enough air travel at an affordable cost. This tradeoff has been enormously wise and helpful to the world. Simply put, the margin-of-safety idea can be pushed into farce without careful judgment.

* * *

This bring us to another limitation of the model, which is the failure to engage in “total systems” thinking. I'm reminded of a quote I've used before at Farnam Street:

The reliability that matters is not the simple reliability of one component of a system, but the final reliability of the total control system.” — Garrett Hardin in Filters Against Folly

Let’s return to the Boeing analogy. Say we did design the safest and most reliable jet airplane imaginable, with parts that would not fail in one billion hours of flight time under the most difficult weather conditions imaginable on Earth—and then let it be piloted by a drug addict high on painkillers.

The problem is that the whole flight system includes much more than just the reliability of the plane itself. Just because we built in safety margins in one area does not mean the system will not fail. This illustrates not so much a failure of the model itself, but a common mistake in the way the model is applied.

* * *

Which brings us to a final issue with the margin of safety model—naïve extrapolation of past data. Let’s look at a common insurance scenario to illustrate this one.

Suppose we have a 100-year-old reinsurance company – PropCo – which reinsures major primary insurers in the event of property damage in California caused by a catastrophe – most worrying being an earthquake and its aftershocks. Throughout its entire (long) history, PropCo had never experienced a yearly loss on this sort of coverage worse than $1 billion. Most years saw no loss worse than $250 million, and in fact, many years had no losses at all – giving them comfortable profit margins.

Thinking like engineers, the directors of PropCo insisted that the company have such a strong financial position so that they could safely cover a loss twice as bad as anything ever encountered. Given their historical losses, the directors believed this extra capital would give PropCo a comfortable “margin of safety” against the worst case. Right?

However, our directors missed a few crucial details. The $1 billion loss, the insurer’s worst, had been incurred in the year 1994 during the Northridge earthquake. Since then, the building density of Californian cities had increased significantly, and due to ongoing budget issues and spreading fraud, strict building codes had not been enforced. Considerable inflation in the period since 1994 also ensured that losses per damaged square foot would be far higher than ever faced previously.

With these conditions present, let’s propose that California is hit with an earthquake reading 7.0 on the Richter scale, with an epicenter 10 miles outside of downtown LA. PropCo faces a bill of $5 billion – not twice as bad, but five times as bad as it had ever faced. In this case, PropCo fails.

This illustration (which recurs every so often in the insurance field) shows the limitation of naïvely assuming a margin of safety is present based on misleading or incomplete past data.

* * *

Margin of safety is an important component to some decisions. You can think of it as a reservoir to absorb errors or poor luck. Size matters. At least, in this case, bigger is better. And if you need a calculator to figure out how much room you have, you're doing something wrong.

Margin of safety is part of the Farnam Street Latticework of Mental Models.

An Introduction to the Mental Model of Redundancy (with examples)

“The reliability that matters is not the simple reliability of one component of a system,
but the final reliability of the total control system.”

Garrett Hardin

***

We learn from Engineering that critical systems often require back up systems to guarantee a certain level of performance and minimize downtime. These systems are resilient to adverse conditions and if one fails there is spare capacity or a backup system.

A simple example where you want to factor in a large margin of safety is a bridge. David Dodd, a longtime colleague of Benjamin Graham, observed “You build a bridge that 30,000-pound trucks can go across and then you drive 10,000-pound trucks across it. That is the way I like to go across bridges.”

Looking at failure, we can see many insights into redundancy.

There are many cases of failures where the presence of redundant systems would have averted catastrophe. On the other hand, there are cases of failure where the presence of redundancy caused failure.

How can redundancy cause failure?

First, in certain cases, the added benefits of redundancy are outweighed by the risks of added complexity. Since adding redundancy increases the complexity of a system, efforts to increase reliability and safety through redundant systems may backfire and inadvertently make systems more susceptible to failure. An example of how adding complexity to a system can increase the odds of failure can be found in the near-meltdown of the Femi reactor in 1996. This incident was caused by an emergency safety device which broke off and blocked a pipe stopping the flow of coolants into the reactor core. Luckily this was before the plant was active.

Second, redundancy with people can lead to social diffusion where people always assume it was someone else who had the responsibility.

Third, redundancy can lead to increasingly risky behavior.

* * *

In Reliability Engineering for Electronic Design, Norman Fuqua gives a great introduction to the concept of redundancy.

Websters defines redundancy as needless repetition. In reliability engineering, however, redundancy is defined as the existence of more than one means for accomplishing a given task. Thus all of these means must fail before there is a system failure.

Under certain circumstance during system design, it may become necessary to consider the use of redundancy to reduce the probability of system failure–to enhance systems reliability–by providing more than one functional path or operating element in areas that are critically important to system success. The use of redundancy is not a panacea to solve all reliability problems, nor is it a substitute for good initial design. By its very nature, redundancy implies increased complexity, increased weight and space, increased power consumption, and usually a more complicated system …

In Seeking Wisdom, Peter Bevelin mentioned some interesting quotes from Buffett and Munger that speak to the concept of redundancy/resilience from the perspective of business:

Charlie Munger
Of course you prefer a business that will prosper even if it is not managed well. We are not looking for mismanagement; we like the capacity to withstand it if we stumble into it….We try and operate so that it wouldn't be too awful for us if something really extreme happened – like interest rates at 1% or interest rates at 20%… We try to arrange [our affairs] so that no matter what happens, we'll never have to “go back to go.”

Warren Buffett uses the concept of margin of safety for investing and insurance:
We insist on a margin of safety in our purchase price. If we calculate the value of a common stock to be only slightly higher than its price, we're not interested in buying. We believe this margin-of-safety principle, so strongly emphasized by Ben Graham, to be the cornerstone of investment success.

David Dodd, on the same topic, writes:

You don't try to buy something for $80 million that you think is worth $83,400,000.

Buffett on Insurance:

If we can't tolerate a possible consequence, remote though it may be, we steer clear of planting its seeds.

The pitfalls of this business mandate an operating principle that too often is ignored: Though certain long-tail lines may prove profitable at combined ratios of 110 or 115, insurers will invariably find it unprofitable to price using those ratios as targets. Instead, prices must provide a healthy margin of safety against the societal trends that are forever springing expensive surprises on the insurance industry.

Confucius comments:

The superior man, when resting in safety, does not forget that danger may come. When in a state of security he does not forget the possibility of ruin. When all is orderly, he does not forget that disorder may come. Thus his person is not endangered, and his States and all their clans are preserved.

Warren Buffett talked about redundancy from a business perspective at the 2009 shareholder meeting:

Question: You've talked a lot about opportunity-costs. Can you discuss more important decisions over the past year?

Buffett: When both prices are moving and in certain cases intrinsic business value moving at a pace that's far greater than we've seen – it's tougher, more interesting and more challenging and can be more profitable. But, it's a different task than when things were moving at more leisurely pace. We faced that problem in September and October. We want to always keep a lot of money around. We have so many extra levels of safety we follow at Berkshire.

We got a call on Goldman on a Wednesday – that couldn't have been done the previous Wednesday or the next Wednesday. We were faced with opportunity-cost – and we sold something that under normal circumstances we wouldn't.

Jonathan Bendor, writing in Parallel Systems: Redundancy in Government, provides an example of how redundancy can reduce the risk of failure on cars.

Suppose an automobile had dual breaking (sic) circuits: each circuit can stop the car, and the circuits operate independently so that if one malfunctions it does not impair the other. If the probability of either one failing is 1/10, the probability of both failing simultaneously is (1/10)^2, or 1/100. Add a third independent circuit and the probability of the catastrophic failure of no brakes at all drops to (1/10)^3, or 1/1,000.

Airplane Design provides an insightful example. From the code of federal regulations:

The airplane systems and associated components, considered separately and in relation to other systems, must be designed so that the occurrence of any failure condition which would prevent the continued safe flight and landing of the airplane is extremely improbable, and the occurrence of any other failure conditions which would reduce the capacity of the airplane or the ability of the crew to cope with adverse operating conditions is improbable.

* * *

Ways redundancy can fail

In The Problem of Redundancy Problem: Why More Nuclear Security Forces May Produce Less Nuclear Security, Scott Sagan writes:

The first problem with redundancy is that adding extra components can inadvertently create a catastrophic common-mode error (a fault that causes all the components to fail). In complex systems, independence in theory (or in design) is not necessarily independence in fact. As long as there is some possibility of unplanned interactions between the components leading to common-mode errors, however, there will be inherent limits to the effectiveness of redundancy as a solution to reliability problems. The counterproductive effects of redundancy when extra components present even a small chance of producing a catastrophic common-mode error can be dramatic.

This danger is perhaps most easily understood through a simple example from the commercial aircraft industry. Aircraft manufacturers have to determine how many engines to use on jumbo jets. Cost is clearly an important factor entering their calculations. Yet so is safety, since each additional engine on an aircraft both increases the likelihood that the redundant engine will keep the plane in the air if all others fail in flight and increases the probability that a single engine will cause an accident, by blowing up or starting a fire that destroys all the other engines and the aircraft itself.

In (the image below) I assume that 40% of the time that each engine fails, it does so in a way (such as starting a catastrophic fire) that causes all the other engines to fail as well.

Aircraft manufacturers make similar calculations in order to estimate how many engines would maximize safety. Boeing, for example, used such an analysis to determine that, given the reliability of modern jet engines, putting two engines on the Boeing 777, rather than three or more engines as exist on many other long-range aircraft, would result in lower risks of serious accidents.

In more complex systems or organizations, however, it is often difficult to know when to stop adding redundant safety devices because of the inherent problem of predicting the probabilities of exceedingly rare events.

The second way in which redundancy can backfire is when diffusion of responsibility leads to “social shirking.”

This common phenomenon—in which individuals or groups reduce their reliability in the belief that others will take up the slack—is rarely examined in the technical literature on safety and reliability because of a “translation problem” that exists when transferring redundancy theory from purely mechanical systems to complex organizations. In mechanical engineering, the redundant units are usually inanimate objects, unaware of each other's existence. In organizations, however, we are usually analyzing redundant individuals, groups, or agencies, backup systems that are aware of one another.

The third basic way in which redundancy can be counterproductive is when the addition of extra components encourages individuals or organizations to increase production in dangerous ways. In most settings, individuals and organizations face both production pressures and pressure to be safe and secure. If improvements in safety and security, however, lead individuals to engage in inherently risky behavior—driving faster, flying higher, producing more nuclear energy, etc.—then expected increases in system reliability could be reduced or even eliminated. Research demonstrates, for example, that laws requiring “baby-proof” safety caps on aspirin bottles have led to an increase in child poisoning because parents leave the bottles outside the medicine cabinet.

* * *

Another example of people over-confident in redundant systems can be found in the Challenger Disaster:

A dramatic case in point is the January 1986 space shuttle Challenger explosion. A strong consensus about the basic technical cause of the accident emerged soon afterward with the publication of the Rogers Commission report: the unprecedented cold temperature at the Kennedy Space Center at the time of launch caused the failure of two critical O-rings on a joint in the shuttle's solid rocket booster, producing a plume of hot propellant gases that penetrated the shuttle's external fuel tank and ignited its mixture of liquid hydrogen and oxygen. In contrast to the technical consensus, a full understanding of why NASA officials and Morton Thiokol engineers decided to launch the shuttle that day, despite the dangerously cold weather, has been elusive.

The Challenger launch decision can be understood as a set of individuals overcompensating for improvements in space shuttle safety that had been produced through the use of redundant O-rings. This overcompensation interpretation differs significantly from both the traditional arguments that “production pressures” forced officials to break safety rules and consciously accept an increased risk of an accident to permit the launch to take place and Diane Vaughan's more recent argument, which focuses instead on how complex rules and engineering culture in NASA created “the normalization of deviance” in which risky operations were accepted unless it could be proven that they were extremely unsafe. The production pressures explanation—that high-ranking officials deliberately stretched the shuttle flight safety rules because of political pressure to have a successful launch that month—was an underlying theme of the Rogers Commission report and is still a widely held view today.(35) The problem with the simple production pressure explanation is that Thiokol engineers and NASA officials were perfectly aware that the resilience of an O-ring could be reduced by cold temperature and that the potential effects of the cold weather on shuttle safety were raised and analyzed, following the existing NASA safety rules, on the night of the Challenger launch decision.

Vaughan's argument focuses on a deeper organizational pathology: “the normalization of deviance.” Engineers and high-ranking officials had developed elaborate procedures for determining “acceptable risk” in all aspects of shuttle operations. These organizational procedures included detailed decision-making rules among launch officials and the development of specific criteria by which to judge what kinds of technical evidence could be used as an input to the decision. The Thiokol engineers who warned of the O-ring failure on the night before the launch lacked proper engineering data to support their views and, upon consideration of the existing evidence, key managers, therefore, unanimously voted to go ahead with the launch.

Production pressures were not the culprits, Vaughan insists.Well-meaning individuals were seeking to keep the risks of an accident to a minimum, and were just following the rules (p. 386). The problem with Vaughan's argument, however, is that she does not adequately explain why the engineers and mangers followed the rules that night. Why did they not demand more time to gather data, or protest the vote in favor of a launch, or more vigorously call for a postponement until that afternoon when the weather was expected to improve?

The answer is that the Challenger accident appears to be a tragic example of overcompensation. There were two O-rings present in the critical rocket booster joint: the primary O-ring and the secondary O-ring were listed as redundant safety components because they were designed so that the secondary O-ring would seal even if the first leaked because of “burn through” by hot gasses during a shuttle launch. One of the Marshall space center officials summarized the resulting belief: “We had faith in the tests. The data said that the primary would always push into the joint and seal . . . . And if we didn't have a primary seal in the worst case scenario, we had faith in the secondary” (p. 105).

This assumption was critical on the night of January 27, 1986 for all four senior Thiokol managers reversed their initial support for postponing the launch when a Marshall Space Center official reminded them of the backup secondary O-ring. “We were spending all of our time figuring out the probability of the primary seating,” one of the Thiokol managers later noted: “[t]he engineers, Boisjoly and Thompson, had expressed some question about how long it would take that [primary] O-ring to move, [had] accepted that as a possibility, not a probability, but it was possible. So, if their concern was a valid concern, what would happen? And the answer was, the secondary O-ring would seat”(p. 320).

In short, the Challenger decision makers failed to consider the possibility that the cold temperature would reduce the resilience of both O-rings in the booster joint since that low probability event had not been witnessed in the numerous tests that had been conducted. That is, however, exactly what happened on the night of unprecedented cold temperatures. Like many automobile drivers, these decision makers falsely believed that redundant safety devices allowed them to operate in more dangerous conditions without increasing the risk of a catastrophe.

Redundancy is part of the Farnam Street latticework of mental models.