Stan: I remember the alarm panels in the good old days of analog controllers. You had to be selective as to the type and number of alarms because there were so few windows available due to cost and space. Each alarm demanded a specific action and was well documented. “First Out” sequences were common, so the operator knew the initiating event. Then the DCS came along, and we now had Lo-Lo, Lo, Hi-Hi, Hi and deviation alarms sitting there with default settings. They were waiting to be set. In the process of doing so, the configuration and process engineer started to come up with creative uses when the real thought should have been how not to use them.
Greg: Suddenly, our control rooms went from 60 alarms to 6,000 alarms. I tried to make alarms smarter, but the pervasive problem required much more than an ad hoc interest. I remember being impressed in a benchmarking visit to an even bigger chemical company that their goal was to make the alarm system smart enough to give the operator a single alarm identifying the root cause of the incident. Of course, we did not get details on the methodology and degree of success, but this started us thinking.
Stan: To the great benefit of the automation community and the process industry, Nick Sands, DuPont's manufacturing technology fellow and ISA Fellow, has a passion for not only addressing and fixing fundamental alarm management problems, but enabling alarm systems to improve process safety and performance more than ever envisioned. Throughout his career, Nick has been seeking to advance our profession by involvement in the International Society of Automation. It’s no surprise that his endeavors resulted in him being the co-chair of the ANSI/ISA-18.2-2016, “Management of Alarm Systems for the Process Industry,” standard, as well as the DuPont Global Alarm Management Leader.
Greg: I've had the pleasure of knowing Nick for many years. We share a common heritage and appreciation of taking advantage of fundamental chemical engineering principles as they apply to process control. One evening at ISA Automation Week 2011, Nick, Terry Tolliver and I did a tribute session to Greg Shinskey, acknowledging him as the source as most of the deep understanding of the relationship of process knowledge to process control, and realizing that advanced process control is really putting process knowledge into the control system. Nick has a great sense of humor as well as an incredible array of technical and practical skills. Here, we seek to make the most of an opportunity to learn from supporting documents and the extensive effort and expertise of the contributors to the ISA Standard 18.2, initiated in 2003 and updated in 2016.
Stan: We start with the ISA 18.2 definition of an alarm as being “an audible and/or visible means of indicating to the operator an equipment malfunction, process deviation or abnormal condition requiring a timely response.”
Greg: What really is alarm management?
Nick: While software programs can significantly help with the metrics and analysis of alarm systems including rationalization (review, selection, setting and prioritization of alarms), it is important to realize there's a life cycle with many activities to maximize performance.
Similar to safety instrumented systems (SIS), alarm systems require ongoing activities like training and testing. While some alarms are intended for safety, there are some differences between alarm functions and the safety instrumented functions (SIF) in a SIS. Every alarm requires an operator to take action, so activities are more weighted toward training. SIFs are usually independent of the operator, so the activities are more weighted toward testing. The way consequences are identified is often different as well. The consequence used to identify a SIF, perhaps in a hazard and operability study (HAZOP), would be the worst possible consequence. The consequence in alarm rationalization is the operator-preventable consequence. These can be very different. For example, when a SIF trips, it mitigates the safety consequence. In some cases—not necessarily often, but in some cases—there may be no operator-preventable consequence, so the trip does not generate an alarm, but an alert or message. Alerts and messages should be displayed separately from alarms, separating the urgent from the important.
ISA 18.2 Alarm Management Lifecycle includes practices to solve common alarm problems for new and existing plants, and builds on the work done by the Abnormal Situation Management (ASM) Consortium (www.asmconsortium.com) and the Engineering Equipment and Materials Users Association.
EEMUA 191, “Alarm Systems—A Guide to Design, Management and Procurement,” states the characteristics of a good alarm as being one that is relevant (not spurious or of low operational value), unique (not duplicating another alarm), timely (not too long before a response is needed or too late to do anything), prioritized (indicating the importance that the operator deal with the problem), understandable (having a message that is clear and easy to understand), diagnostic (identifying the problem that has occurred), advisory (indicative of the action to be taken) and focusing (drawing attention to the most important issues).
ISA 18.2 Alarm Management Lifecycle details the major stages of activities being executed sequentially as Philosophy, Identification, Rationalization, Detailed, Design, Implementation, Operation and Maintenance, with simultaneous Monitoring and Assessment leading to Management of Change and reentry into the Identification stage. In reality, there's a learning experience and an interaction between stages that should be used to improve the performance in each stage. The Audit process and review of incidents and performance metrics should lead to continuous improvement in the Philosophy document and all subsequent stages.
Stan: Anyone involved in the safety and operability of plants, whether a supplier or a user, needs to read and implement ISA 18.2. One of the benefits of ISA membership is view standards and technical reports for free on the ISA website by simply logging in with your ISA username and password.
Since as engineers we pride ourselves on being particularly rational, what are the high points of the Rationalization stage?
Nick: In Rationalization, you start with the alarm justification, asking questions such as, "Is it abnormal and do the operators need to respond?" If yes, you proceed to setpoint determination, prioritization, classification and documentation. You need to know how much time there is available to prevent consequences from occurring and the maximum time it takes an operator to respond. Is there enough time for the operator to complete all the actions required? If not, instrumentation may need to be faster, and the alarm setting may need to be moved closer to the operating point. If it's too close to the operator point, noise or inconsequential conditions can cause nuisance alarms.
Often, a pre-trip alarm is set for an interlock or SIF to give the operator a chance to prevent the trip. This only works if there's action the operator can take and enough time for the operator to do it.
Greg: SIS activation not only seriously affects process capacity and efficiency, but in the resulting shutdown and restart, equipment is stressed and hazards are introduced. More effort is generally beneficial to make sure the automation system and operator can prevent SIS activation despite practical limitations. An extremely large chemical intermediates plant went from two shutdowns per year to less than one per five years by using three transmitters and middle-signal selection, as well as better redundant valves.
Stan: Do you need an alarm as to whether the SIS is doing its job?
Nick: Each alarm is rationalized individually, so there's often an alarm for a trip, but sometimes there is not. But if there's an indication of a failure in a trip—for example, a valve not closing or a pump not stopping—those are serious conditions that require a response and are the highest priority alarms. I’ve been in an incident where the SIS failed, and luckily we had time to shut another valve.
Greg: What are some important aspects of Classification and Prioritization?
Nick: Classification is putting the alarm in a group based on a set of requirements, like training or reporting. It usually doesn't impact the operator. Prioritization is only for the operator (remember the definition of alarm), indicating which alarm to respond to first when there's more than one alarm. The ranking of urgency may go from Low to Medium to High (shown in different colors and with different symbols) depending upon the matrix of time to respond (e.g., from less than three minutes to more than 30 minutes) and severity of consequence (e.g., minor to catastrophic). When prioritization is done right, a high-priority alarm is a scary thing.
Stan: Philosophy wasn’t part of the chemical engineering curriculum. Tell us more about the Philosophy document.
Nick: The Philosophy document is the guide for alarm management at the plant, through the whole lifecycle. It details the roles and responsibilities, the performance metrics and targets, as well as the methods for rationalization, classification and prioritization. There's also guidance on basic and advanced alarm design and the use of suppression. We have a series of technical reports, four issued and three more on the way to help. The ongoing tasks of the organization are to look at metrics each month, and institute periodic refresher training on prioritization and rationalization and the many aspects of alarm management to ensure proficiency. There must be an alarm system owner, leader or guardian, whose tasks include management of change and training, making sure the alarm system performance is maintained, especially in terms of consequence prevention, despite turnover, with retirement of experienced operators being a particular problem.
Greg: Management of change in my life is alarming. What does it involve here?
Nick: The guardian should be involved in approval of any new or modified alarms. Often, plants and projects readily revert back to the old way of doing things. All of the activities and people responsible for alarms must work together. Ideally, the guardian is in the operations group, where he has a better understanding and appreciation of what the alarm system means to the operators. It's most important that this person connect with a passion to the Alarm Management Lifecycle. Support from management and engineering is needed.
Stan: How is alarm suppression done effectively and intelligently?
Nick: Suppression by design to keep alarms relevant involves logic based on process, equipment and automation system knowledge.
Shelving involves the operator temporarily suppressing an alarm with engineering controls to ensure the alarm is unsuppressed after a specified time. Shelving can be dangerous if people aren't trained on how to do it. For example, operators may shelve the alarm for eight hours (entire shift), resulting in the alarm never appearing on the alarm summary report for the shift. When I asked in a course how many people used shelving, 80% of the hands went up. When I asked how many had been trained in its proper use, only a few hands were raised. Intelligent initiation and time limitation must prevent lack of protection for probable consequences. Periodic training is critical.
Out-of-service is used for instruments that have to be repaired to be functional. The notification and maintenance schedule must be clear as to when the repairs will be done.
Greg: What are some lessons learned?
Nick: I worked on alarm management from the ground up for 18 years, and made some progress. But for the last eight years, I had support from the top down, and that has impacted all our sites in a sustainable way. The best approach is through a company operational excellence program, safety program or quality program. Be proactive. You don’t want to wait until you have an incident to motivate you.
You need a management leader to provide the drive. It can be quite challenging to get the financial implication. The correlation between “uptime” takes time, and the “first pass yield” may be lost in the noise. These process metrics and alarm rates can improve due to process control improvement (PCI). The benefit is clear, but the business case is difficult. Happier operators are common. People can really see and feel the difference on a daily basis. Operators are better able to focus on what really needs attention.
Top 10 places I have or haven't been
- 10. I've been in many places, but never in Kahoots. Like many engineers, I can function quite well without other people. You have to be with someone to be in Kahoots.
- 9. I've been in Cognito. I've attended a lot of parties where no one recognized me. This is great because I don’t have to explain what I do. It’s tough enough to do that to my boss.
- 8. I've not been in Sane, although some may think so when reading my humorous books.
- 7. I'm always looking to go to Conclusions, but if you have to jump to them, I can’t do this due to my age and being an engineer for so long.
- 6. I have been in Doubt. This is a natural result of the scientific method of not avoiding what proves you're wrong.
- 5. Sometimes I'm in Capable, and I go there more often as I find fewer and fewer people who appreciate my expertise.
- 4. I don’t like to be in Flexible. Technology increases knowledge and there are exceptions to every rule. My career is one big learning experience looking for new perspectives.
- 3. Fortunately, I am in Tuitive. Tools and solutions are getting more and more complex.
- 2. I am in Coherent based on the glazed eyes I see when I explain my profession.
- 1. With regard to the future, I'm planning on being in Suspense. Who knows whether there will be any management left controlling purse strings, who will understand what I do.