By Ian Nimmo, User Centered Design Services, Inc.
"Why I believe that our current HMIs are dangerous, are not working for us as intended; they impact the performance of operators daily. They overstate alarm management and ignore control room design problems, which are resulting in catastrophic human error- caused accidents. As an industry we have pointed the finger solely at alarm management and missed the valuable lessons from major accidents, and we will never prevent these accidents until we acknowledge this error."
When there are problems in the processing industry, the industry responds to them with sound engineering solutions. This has been done by different disciplines for decades and has helped maintain a good safety record. However in recent years, some accidents have some very common threads, and engineers are having difficulty designing these errors out of systems.
The industry has tried to learn from major catastrophes such as Bhopal, Piper Alpha, Three Mile Island and Chernobyl, but it has been introduced to a new type of failure, which was categorized by Professor James Reason in his book Managing the Risks of Organizational Accidents. He stated in the first chapter that "organizational accidents are rare, but often catastrophic, events that occur within complex modern technologies," and that "organizational accidents are a product of recent times or, more specifically, altered the relationship between systems and their human elements. Organizational accidents may be truly accidental in the way in which the various contributing factors combine to cause the bad outcome, but there is nothing accidental about the existence of these precursors, nor in the conditions that created them."
The most important question to ask after the events that led to the large loss of life on the Piper Alpha Platform must be: How can an accident such as this be prevented from ever happening again? [Editor's note: a Google search of Piper Alpha will provide several pages of information, including reports, contemporary coverage of the disaster and videos.] One of the investigators clearly answered that question. He said the requirements for safe operation are
- Hazards must be recognized and understood.
- Equipment must be fit for purpose.
- Systems and procedures must be in place to maintain plant integrity.
- Competent staff must be employed.
- There must be emergency preparedness.
- Performance must be monitored.
As many of the more recent major accidents are reviewed, it is clear these lessons have not been applied, and the lessons learned in other industries, such as aviation, which has many parallels to the process industry, have been ignored. If the industry would only apply the lessons learned about situation awareness, it would have a different perspective on the alarm problem. It would better understand that alarms are just one of the tools that can be used for situation awareness.
Texaco Pembroke started as a simple instrument failure caused by an electrical storm. This was not the root cause of the accident that came later as the operators attempted to restart the refinery. During this period, one of the console operators on the cat cracker experienced a series of problems that get rolled into a term called human error. This terminology is partly to blame for the inability to fix the problem; a better term would be design-induced errors. The simplest instrumentation system design failed at Pembroke, as a level in a knock-out drum initiated an alarm to which the operator failed to respond. This is pretty basic. However, the designers did not consider the reliability of the operator and the consequences of a failure to respond.
The UK Health Safety Executive (HSE) analysis of the incident determined that the DCS displays conveyed limited information, and the operator became overwhelmed with alarm data as raw alarms were being presented at a rate of 20 to 30 per minute. The HSE determined that "the flood of alarms greatly decreased the chances of the operator's restoring control in the plant." In the final 11 minutes, the operator received 275 alarms, of which 87% were categorized as high priority requiring an operator response within 3 minutes for each alarm! Critical alarms were overlooked in the midst of other alarms.
So here the DCS HMI and the alarm system can be clearly identified as major causal factors behind this incident. To walk away from this incident and have a focus purely on the alarm system would be a grave mistake that many other plants would make.
Even the HSE's actions after the incident in some ways directed the industry to this conclusion, as it focused attention on alarm management as they called their Research Report 166/1998 "The Management of Alarm Systems." The industry responded and brought out a new guideline on alarm management through the work of the Abnormal Situation Management Consortium (www.asmconsortium.com) and EEMUA (www.eemua.org) with the EMMUA 191 document. Both documents were very effective and were really needed by the industry worldwide. However, nothing was done to raise the standards of the HMI (EMMUA 201Human Computer Interface Desktop Guidelines was not considered helpful), and this made it difficult to fully resolve the alarm issues, leaving the industry wide open to repeated incidents.