Sunday, December 29, 2013

Chain of Events, Chain of Supply and Quality of Decisions in disasters- The case of Piper Alpha

Sometimes looking again at past events provides a new perspective that may not have been as apparent. The Piper Alpha offshore drilling platform incident in 1988 may be one of those unfortunate events that provides insight, for supply chain disruptions and for better decision making during response and recovery.

I had the chance to watch the presentation by Brian Appleton, the Technical Assessor to the Cullen Inquiry into the event. You can see it here by clicking on the picture, and I highly recommend it (note: it's a little long).Nat Geo does a great job with the details: LINK to MOVIE

 Appleton Video of Piper Incident

Below are a couple of slides on what the lessons learned are already. They are from a presentation by David Reynolds of Clyde and Co, during a 2013 conference by Lloyds (link). Of the 226 crewmembers, 167 were killed, 30 of whose bodies were never recovered. Only 59 men survived and most of these were scarred for life, not only from horrific burns, but from the memory of the explosion and fire on the Piper Alpha on 6th July 1988 and the loss of lifelong friends and workmates.

But there are three areas that may be worth looking at again, these have to do with the effects of the supply chain (in this case multiple platforms),  and decision quality of those in charge:

1) The effect of "chaining supplies": 
First, is in terms of communication across separate entities in a "chain". What caused the real damage at Piper, and made the situation astronomically worse was that the supply from the other two platforms (which were being routed to Piper to consolidate and send to shore) were never stopped! The continuous supply fed the fire, which weakened the gas lines, which caused the major gas line. Report of the disaster note:

"The Tartan (11 miles away) and Claymore (21 miles away) platforms continued to supply oil and gas, despite the flames from Piper being visible to them. If they had shut down the supplies to Piper, the fire and subsequent explosions would have been much less severe and may have been have been limited to the Gas Module. Although the explosion and fire caused by the escape of gas from the PSV blinds was the initial cause of the disaster, the failure and rupture of the gas risers were responsible for Piper's destruction and preventing the crewmembers evacuation. (Source).

Here's what is written about the accident: " Despite the fire on Piper being visible from both these platforms" but gas was still being supplied from Claymore and Tartan, and would continue for some time". With the gas supply from Tartan "there was no way of going back" as increased fire and explosions make more of the pipe of supply to fail and add fuel to the fire...the result is explosion after explosion.....

What does this mean in supply chain language: Well, to start this is the equivalent of a supplier providing parts that are causing downtime downstream for the manufacturer. But the manufacturer doesn't know the cause. Since the manufacturer is unable to communicate, and the supplier doesn't know (and doesn't realize the ramifications of his action), the problem escalates. Supplier keeps sending parts that cause more damage to the manufacturer. How Claymore and Tartan reacted is essentially the same, they never thought about their fresh fuel adding to the inferno. 

In a supply chain context, this is the equivalent of an oblivious supplier providing parts, even if the manufacturer can not use them. Except in this case the extra supply extended the problems being faced by the manufacturer. One way to look at it is below: 

2) The Butterfly effect: The massive inquiries on this disaster by the British authorities highlight the primary cause of failure as paper work. Quite sombering is their findings: Two work permits one on a pump and the other on its safety valve were the cause, simply because they were not kept together. The pump was back online, but without a safety valve and those who need to know did not know. The operators used it. This caused the first explosion, which then caused a loose fitting metal disk to cause a second explosion. The second explosion caused the firewalls to break up and shatter into piping. One of the piping was carrying condensate, which caused a larger explosion, which caused fuel leaks on to rubber matting that divers left on the rig. This provided a long lasting fire which caused high pressure pipes providing fuel from Titan to burst....from there, other pipes blew up one after another in a sequence as the rig became hotter and hotter. 

We could tell the cause of this regrettable story this way:

Faulty paperwork 
Caused the pump without the safety-valve to 
Cause an explosion to 
Cause the loose fitting metal disk to 
Caused a larger explosion to 
Cause the firewalls to burst, and fly around like bullets to 
Cause rupture to an oil pipe that dripped onto the improperly placed diving rubber mats to 
Cause a pool of crude oil, to 
Cause a hot enough fire to 
Cause a high pressure gas line to burst to 
Cause hotter and more ongoing fire to 
Cause other pipes carrying fuel from other platforms to 
Cause Piper Alpha and 176 of its crew to fall to the bottom of the ocean.

In short, it was a tightly linked chain of events in a complex system that caused the failure in Piper Alpha. 
A break anywhere in the chain could have made the damage much less. 

3) Decision quality, preparedness of decision makers: 
Many of those killed  died from asphyxiation in the accommodation because they decided to stay in the accommodation section of the platform. Most of those who took the risk of jumping off of the platform actually survived. Those who stayed apparently were never told that they ha better odds if they left the quarters. Of course, no safety manual would suggest for a worker to jump off the equivalent of 11 stories into a thick black smoked sea. However, if those in charge were able to read the situation better, perhaps they would have told others to follow those who jumped of which 59 survived. 

What do we learn? As systems become more complex, the probability of small events causing larger ones becomes more real. This is already known. A rogue trader can bring down a banking empire (Nick Leeson and Barings Bank). A faulty supplier using bad paint can cause major damage to a major toy manufacturer (Mattel) . A momentary lapse by a train operator kills people and shuts down New York rail system for an entire day (see a post below) and issues with batteries used for backup and start can down an entire fleet of airplanes (Boeing) for months. So, it should not be surprising to see small matters cause major issues. Of course, if we could predict these small issues, we would not have to deal with the aftermaths like these. So, it may be best to raise vigilance and ability to respond, or the cliche word: Resilience. Resilience at the individual level, and the work group level, at the organizational level and at the supply chain level can help - not just to avoid mishaps, but to be better prepared to deal with the aftermath. 

Brian Appleton's report and presentation on Piper Alpha purposefully mention how "The details of an industrial accident don't repeat themselves". Indeed it is these details that will be difficult to try and predict and control to the full extent. Rather, a bit of situational awareness from the part of the suppliers and managers may have helped limit the damage here. 

Thursday, December 5, 2013

New York Train Derailment, “loss of awareness argument” and the topic of near-misses

Learning from “Near Miss” Incidents and the “loss of awareness argument”
The “crying wolf” Story:
Maybe the Villagers’ should have asked more questions. 

Arash Azadegan, PhD and Andryi Petronchak, MBA

This week’s train derailment in New York City caused us to refocus on the topic of near misses. Reports suggest that the driver of the Metro-North Railroad commuter train “experienced loss of awareness”. Labelled as an episode of “highway hypnosis”, this particular accident killed four passengers, injured many others and created a havoc in the area’s rail system for days[i].
We have all had a similar situation happen to us. While driving a car, the cell phone starts to ring. The number on the cell phone screen is too important to not answer the call. We know that it’s wrong to answer the call while driving in high speeds, it may be unlawful, and it clearly is dangerous!

The car may have drifted into the other lane - slightly. We may have gotten scared for a few milliseconds, put the phone away, slowed down, and felt a moment of guilt. This was our version of a “momentary loss of awareness”. But it ended up as a faultless near-miss. Nothing happened, right? Even if it did, it would be causing damage to our own lives. We forgive ourselves, and go back to the norms of our lives. We may soon forget about the whole thing.
Now consider this: We are driving a vehicle at three hundred miles per hour. There are a couple of hundred people in our vehicle. Here, a “momentary loss of awareness” may have more serious consequences. It may mean derailment. If it does end up there, it would cause damage to other’s lives and livelihood. In that case, things are different. We can’t forgive ourselves, and we can’t got back to the norm. We may never forget about it – ever.

The hypothetical “distance” between guilt and no guilt, lost lives and kept lives are all decided during those small few seconds. These types of critical seconds don’t just happen on Sunday mornings on New York’s rail cars. They are what commercial pilots, cruise ship captains, hi-speed train drivers and other professions with people’s lives in their hands operate in every hour of every work day.  Added up, the social structure of our transportation system, service delivery system, and the chain of events supporting them deals with many many hundreds of these critical seconds every week. And why is it that we do not learn from these many hundreds and thousands of hours of experience? Why is it that near-misses do not provide enough of a basis to avoid the real catastrophe?
A near miss is an incident that does not cause a physical harm, sickness, or property damage, but has a high probability of serious impact on either people or material assets. In other words, “near miss” is a variation in a normal process that, if continued, could have a negative effect on people or valuables of a different kind. Some call it a “false alarm”. Often a lucky interruption in the sequence of mishaps during a near miss leads to the damage not taking hold. Near misses can be an effective source of input for organizational learning. The downside costs of the damage are not present. We have a near-miss each time we reach over to re-program the GPS device and the car swerves over to the next lane.
In “The psychology of the Near Miss” R.L. Reid [ii] gives an interesting definition of the term from the perspective of the game and gambling industry.
“A near miss is a special kind of failure to reach a goal, one that comes close to being successful. A shot at a target is said to hit the mark, or to be a near miss, or to go wide. In a game of skill, like shooting, a near miss gives useful feedback and encourages the player by indicating that success may be within reach. By contrast, in games of pure chance, such as lotteries and slot machine games, it gives no information that could be used by a player to increase the likelihood of future success”.

Reed suggests that how useful a “near miss” is can depend on how the information is collected, processes and concluded. Too many “near misses” go undetected because the systems are not in place to look for them. Often we consider them as a (un)lucky turn of events, that as Reed suggests, “give no information to increase the likelihood of success”. But running an operation is not about pure chance. Operating a car, a train or a company is more like a game of skill than a game of chance. So learning from near-misses should be part of the process of getting better.
So the question again comes up? Why do we have an issue learning from near-misses or even from false alarms? Why do companies (just like operators, pilots and even auto drivers) fail to incorporate these learnings into their processes? There are several possible reasons:

First, reporting any problem (personal or system related) can be difficult. Thinking, picturing, talking and exchanging hypothetical scenarios about near misses tends to raise the emotional flare-ups of past accidents and the uncomfortable memories that accompanied them. Of course, there is also the blame-game. Often the person raising an issue is among those who is blamed for the cause or volunteered to fix it. For many over-worked members of an organization, it’s best to not report anything!
Second, there is more to learning from near-misses than reporting them. By emphasizing on the importance of “near misses”, we don’t just encourage to report all the “near misses” that take place in the organization. That information also has to make sense from operational and occupational safety perspective.
Third, how to learn from and how to allocate resource to the reporting of near-misses can be debatable. How to properly interpret the incidents, and incorporate the conclusions into a risk management practices can be complicated.  “Digesting” the information the ultimate purpose of almost every business process, but each business has to find its own way of processing the information.
Fourth, by reporting too many minor near-misses we risk to devalue such information. Processing their real message becomes harder as we “overload” the system.  With too much noise, the real possible message gets lost. Ironically, we create another type of “loss of awareness” of what causes accidents. But this time instead of loss of awareness by the operator, in the entire system is lost in the details. 
Remember the boy from the folktale, screaming “Wolf!” to fool the villagers and amuse himself? When the real wolf attacked the boy, nobody took his scream for help seriously because of the people’s immunity to his annoying numerous jokes. In his blog titled “Public warnings--why crying wolf is downright bad” [iii], Gerald Baron writes about the potential harm from over-reporting.  He notes how the public were weary of the “crying wolves” in Italy over predicting possibilities of earthquake. A spokesperson is quoted to say: “If the risk is between zero and 40%, today they will tell us it’s 40, even if they think it is closer to zero. They’re protecting themselves, which is perfectly understandable.” In other words, “crying wolf” is merely a way for authorities to protect their jobs rather than people’s welfare.  The risk here is when there is a real 40% possibility earthquake. If the public translates the message as one with a close to zero likelihood, based on the previous false alarms, Large numbers of people will be caught off-guard and surprised by the actual event while in their state of denial.
So if it is so difficult to learn from near-misses may be we are better off to forget about them. Let’s look at this option: 
What would ignoring near-misses do to the organization’s culture, behavior and learning? Could it be that by ignoring near misses we create a breeding ground for becoming desensitized to recognizing causes of accidents?  Maybe not enough attention to near misses leads to “loss of awareness” by the entire organization would be the result?
Back to the “crying wolf” story, the villager’s strategy was to “not” learn from possible false alarms, because they were not providing any useful information. Interestingly, what the crying wolf story doesn’t include [the versions we could get our hands on] is the villagers asking any questions about details that may have suggested as to whether the boy was telling the truth or not – in any of the events. There was no investigation to the near-miss, the boy or the wolf!
So was the fault with the boy or with the villagers? While the wolf was clearly harmful, the boy was actually helpful – admittedly some of the time! So, maybe we need to give the “boy who cried wolf” a break, and consider the downsides of not listening to him, even if the message was sometimes incorrect.
These days, the problem with handling near-misses might not be in dealing with “too much reporting”, but in having the static, permanent, prone-to-bias “processing centers” responsible for delivering solutions and making decisions. These “processing centers” (safety committee, activist group, flight crew, members of a household, etc.) may have the leadership and authority built on rotational basis to avoid biased decisions. Every member of the organization should feel responsible for the safety of others.
Having a healthy combination of an active safety-related data collection and intelligent filtering of this information inflow may create an adaptive yet sensitive mechanism of responding to early warning signs in each organization that concerned about safety matters.
Let’s go back to our “crying wolf” example one more time. This time we can try to retell the story from the “near miss” perspective. The villagers verified the boy’s first call for help, found out where wolf came from, and reinforced the young sheppard with a guard dog with all the canine’s keen senses (!). In a modern version of the tale, there would even be wolf-sensoring electronic devices placed around the area!
What we are trying to say that each concerned organization must establish the system that is sensitive to warnings and is able to distinguish the false alert from legitimate alert. Organizational structure must foster responsible reporting of near misses by showing the benefits of such process in a form of displaying days without accident and/or monetary savings to the company budget that materialized because of accident prevention “near miss” reporting. We think that communicating of lessons that were learned “the hard way” (the powerful “Remember Charlie?” safety video, for example) would deliver a clear message about unparalleled value of accident prevention and “near miss” reporting.
This blog suggests for a reorientation: Events like the one in New York Metro-North Railroad have almost always put the focus on what DID happen. Case in point: A few dozen NTSB professionals are looking into the “cause” of the accident as we speak. But maybe the event should make us think about the times when it did “NOT” happen – or when we had the “near-miss”. 
Often times is a fine line between a near miss and the “real-miss”, what DID and did NOT happen. Often times, that fine line also tells us about passengers saved and passengers lost. More thorough collection, analysis and reporting of the near misses” is necessary. Afterall, it is too late to start looking for clues after the derailment if we really want to save lives.