zzzz652

How to perform alarm rationalization

April 13, 2005
If you don’t deal with your computer-based control alarms in a disciplined manner, uncontrolled alarm growth can result, which can lead to out-of-control alarm systems. Find out what the top 10 reasons are to rationalizing your alarm management system.

By William L. Mostia Jr., PE

MODERN CONTROL SYSTEMS have brought us  many benefits, but along with the benefits have come problems. One of the major benefits has been the increased information available to the operator, while one of the problems is what to do with all that information. 

Computer-based control systems also have increased the level of abstraction of the process. The operator has more and more information, but with a smaller and smaller window to look through, resulting in a higher and higher level of abstraction. Increased complexity and sophistication, increased automation, control concentration and separation, and additional layers of control have further increased the level of abstraction. Some systems are so abstract they approach the complexity of a video game.

To compensate for this abstraction, control systems have provided additional operator interface functions, and system designers have increased the number of alarms and alerts to help keep the operator informed. Alarms increase the amount of information going directly to the operator but they often are a source of operator overload and confusion.

In older control systems, hardwired panels were used to provide alarm annunciation. The panels were large but limited in capacity, and so by their very nature tended to limit the number of alarms. In modern control systems, alarms are generally software-driven and are essentially "free" for existing process variables. Little incentive to limit their creation has led to a laissez-faire attitude toward alarms. We can configure a new alarm at the flick of a finger, and there has been a lot of flicking going on.

Also, regulations from OSHA and the EPA as well as voluntary programs such as ISO 9000 and 14000 have led to the addition of alarms, sometimes with little consideration of the effect on alarm loads at the system level.

Some notable examples of alarms causing problems include the Three Mile Island accident in 1979, where important alarms were missed; the Texaco refinery explosion at Milford Haven in 1994, where, in the 10 minutes prior to the explosion, two operators had to respond to 275 alarms, peaking at three per second; and the recent Esso Longford gas plant explosion in Australia, where some experts concluded that operators routinely ignored alarms leading up to the explosion because, in the past, ignoring them had no negative impact.

Alarming Growth
Alarm growth is a natural outcome of the increased information load and abstraction of the modern control system. However, if alarms are not dealt with in a disciplined manner, uncontrolled alarm growth can result, which can lead to out-of-control alarm systems. If your alarm system has one or more of these characteristics, it may be out of control:

  1. Many alarms during abnormal situations
  2. Many alarms on during normal operation
  3. High alarm loading rates (alarms per unit time, alarms per operator, alarms per event, etc.)
  4. Incidents or near-incidents where operators missed key data provided (or not) by the alarm system
  5. A large number of high-priority alarms
  6. Alarms that are on for long periods of time
  7. Alarms going off and on regularly or intermittently (chattering or transient)
  8. Lost count of the number of alarms
  9. Lost track of alarm setpoints or why they were set there in the first place
  10. Don't know which alarms are safety, operational, environmental, informational, etc.
  11. Operators don't know what particular alarms mean or that there may be inappropriate alarms
  12. Operators don't know what to do when a particular alarm occurs
  13. Don't know when the alarms were tested last
  14. Alarms that are not useful and even confusing or obscuring
  15. A large number of defeated alarms
  16. No procedure or policy on alarm creation; i.e., anyone can create an alarm or change the limits on his or her own authority
  17. Alarm documentation is out of date or nonexistent
  18. No written procedures or policies on alarms.

Elements of Alarm Management
Driven by hazardous events, out-of-control alarm systems, and desires to optimize alarm systems, effective alarm management has moved to the front of the minds of many users and manufacturers. There can be both problems and treasures buried in the alarm system. Alarm management can be viewed as part of a larger scheme of normal operations and critical or abnormal condition management.

Alarm management is the management of alarms throughout their lifecycles. Some of the things required of alarm management are:

  • An alarm philosophy
  • Alarm ownership
  • Controls on the creation of alarms and their functionality (from alarm activation to operator detection, operator action to perform, time in which to perform the action, etc.)
  • Pre-alarm management
  • Alarm change management
  • Alarm prioritization
  • Alarm definitions
  • Alarm organization and presentation
  • Alarm filtering and suppression
  • Informational alert management
  • Alarm operating procedures
  • Training
  • Alarm maintenance and testing
  • Alarm rationalization for existing systems and as part of continuous improvement.

It should be noted that the control system equipment's capabilities, as well as any third-party add-ons you have or will have, can affect what you can do with an alarm system and thus can impact alarm management procedures and practices and alarm rationalization.

Rationalization Is Key
Alarm rationalization is the systematic process of optimizing the alarm database for the safe and efficient operation of the facility. This process normally results a reduction in the total number of alarms, the prioritization of alarms, the validation of alarm parameters, the evaluation of alarm organization and presentation, evaluation of alarm functionality, etc. (see Table I).

Table I: Top 10 Reasons to Rationalize

Rationalization also can, in some cases, identify the need for new alarms or changes in the process, equipment, or instrumentation. It can be done to fine-tune an existing good alarm system, but it is more commonly done where the alarm system has gotten out of control.

Note that alarm rationalization is not a one-shot process. The forces of chaos are out there looking for any opportunity to take control of any complex system, and alarm systems are no exception. Over time, people will come and go, the process will change, operating philosophies will change, marketing will stick its nose in things, the hardware system will change, improve, degrade, etc. All these are opportunities for changes or lack of changes to the alarm system, and they indicate the need for periodic alarm rationalization. Training, procedures, procedural controls, and auditing are some of the tools used to maintain an optimum alarm system for effective and safe operation of a plant.

How to Rationalize Alarms
Alarm rationalization is a structured process that generally involves an approach similar to that of a HAZOP team, with representatives from operations, maintenance, engineering, and safety. It is important to have operator input on this team. It is also important to have an organized plan to perform the alarm rationalization, with an established procedure and practices.

While alarm rationalization will vary from company to company and plant to plant, the methodology generally consists of eight basic steps (Figure 1). These steps are presented serially but in fact can overlap or run in parallel in some cases.

Figure 1: An Eight-Step Program

The steps on the path to alarm rationalization are shown in series, but in some cases may overlap or be performed in parallel.

1.  Develop an Alarm Management Procedure
A consistent, comprehensive alarm management procedure/philosophy is necessary before beginning alarm rationalization.

The procedure typically contains a plant alarm philosophy, alarm type identification methodology (operational, safety, environmental, etc.), risk identification methodology and method of prioritization of alarms, alarm functionality requirements (presentation, organization, pre-alarms, operational time requirements, etc.), alarm filtering or suppression methods, identification of undesirable alarm types and how to handle them, methods of setpoint determination, testing requirements, alarm sequences, acceptable alarm metrics, documentation requirements, etc.

While alarm rationalization is only a part of the overall alarm management, it is absolutely necessary to have defined consistent alarm practices to apply to the alarm database.

2.  Develop Alarm System Metrics
It is hard to measure your progress if you don't know where you start and where you end up. To measure the progress of alarm rationalization, you need to develop alarm system metrics for your system. Typical metrics can be total number of alarms, alarms per operator, alarms per hour, alarms per identified abnormal situation, fraction of unacknowledged alarms, average time for an alarm to return to normal, average number of active acknowledged alarms, number of chattering alarms, number of standing alarms, number of nuisance alarms, number of disabled or shelved alarms, etc.

Operational metrics also can be used, such as total production rate, off-spec production, number of upsets, uptime, workload, etc.

Some safety and environmental metrics that could be used are the number of plant shutdowns, number of incidents/near misses, releases to the atmosphere, releases to flare, etc.

Operators and operating staff interviews also should be used as metrics. These people are on the front line and have good firsthand information on how the plant is operating.

3.  Benchmark the Existing Alarm System
This is a two-part process. First is system benchmarking, which consists of an alarm operational analysis (statistical analysis of the existing system) and identification of the current values or states of other alarm rationalization benchmarks.

Statistical analysis of the current alarm activity of the existing system should look at alarm frequency, duration, occurrence, bursts (floods), correlation (or not) with process or equipment activity, correlation with other alarms, acknowledgement time, etc. This type of analysis can be a static analysis of the alarm and operational logs, a dynamic online analysis via resident software that monitors alarm activity in real time, or both.

Static analysis typically has the advantages of working on all systems; working with existing prior history to determine metrics; being flexible in the metrics used; correlating with past or current operational, process, or equipment events; and using a human to review the data who may (or may not) catch trends or issues that would escape an automatic system.

Disadvantages are the analysis is typically done by hand and can be cumbersome, and some trends may be missed.

Dynamic analysis advantages typically include statistical alarm activity detected online with built-in alarm metrics calculated both in snapshot and over time, built-in alarm-system expertise that can complement the user's capabilities, automatic documentation to meet regulatory requirements such as 21 CFR Part 11, and management of alarm parameters. Dynamic analysis can be used to monitor the system after rationalization to detect downstream issues, provide alarm and process event analysis, and, in some cases, analyze manually input alarm data.

Disadvantages are long-term online analysis requires long online time; automatic correlation with operational, process, or equipment events may not be available or may have to be entered manually; and some desired alarm metrics may not be available from the system.

It should be noted that not all aspects of alarm analysis/rationalization may be provided by one method, so a combination of static and dynamic methods may be required.

Benchmarking can include identification of the values or states of the non-alarm system attributes such as operational, environmental, safety, and commercial goals. This benchmarking should also include interviews with operators and operating staff before rationalization.

The second part pf benchmarking is to analyze individual alarm operational performance to see what the alarm is doing on a day-to-day basis and to find individual bad actors or groups of bad actors.

This benchmarking step should not only give alarm system metrics but should identify alarm conditions such as alarm floods, operator alarm overload, standing alarms, transient alarms, nuisance alarms, chattering alarms, redundant alarms, obscuring alarms, alarm (symptom) vs. process or equipment activity, ignored alarms, defeated (shelved) alarms, process issues related to alarms, faulty field devices, etc.

4.  Identify and Analyze Individual Alarms
In alarm rationalization, an important step is identifying all the alarms in the system and their parameters (setpoints, directions, deadbands, test intervals, priorities, etc.).

Identifying individual alarms is only a beginning step because we also need to know the functionality of the alarm, the alarm's relationship to other alarms (correlation), and its relationship to process and equipment conditions. An alarm isn't simply exceeding a limit, then providing a visual and audible response. Defining an alarm like this can get you into trouble fast.

Analysis of the alarm functionality includes determining what the purpose of the alarm is, what the consequences and risk associated with the alarm are, how important the alarm is, what the alarm indicates, how the alarm is organized and presented, what other alarms will be active at the same time (situations where the alarm and other alarms will activate), alarm correlation, alarm patterns, what action is expected of the operator, how will the operator know what to do, how much time is there to detect and accomplish the action (and if it is reasonable), what the expected action to bring the process to the desired state is, process or equipment conditions that affect the desired performance of the alarm, how the alarm fits in to troubleshooting, etc.

Data gathered on the alarm in the initial benchmarking step will be used in this step to characterize individual alarm performance and correlation with other alarms, the process, and equipment for this analysis.

5.  Prioritize Alarms
Here you determine the importance or significance of the alarm through a ranking scheme. This is normally done by risk analysis to determine the importance that the operator detect and perform the expected action when the alarm occurs. Prioritization helps ensure the operator knows the importance of the alarm itself as well as the importance of the alarm in relationship to other alarms.

Figure 2: Prioritize in Proportion
According to EEMUA Guideline 191 (see sidebar, "Few Solid Guidelines"), the number of alarms of each priority--high, medium, low--should be proportioned as shown.

The prioritization scheme is generally limited to the control system's capabilities and any third-party alarm management software on the system. The number of prioritization levels should be kept to a minimum to minimize operator confusion. The number of alarms prioritized in each category (high, medium, low) generally can be visualized as a triangle (Figure 2) with EEMUA Guideline 191 alarm proportions (see sidebar, "Few Solid Guidelines"). Informational alerts should be kept out of the prioritization scheme if possible.

6.  Rationalize the Alarm Database
This workhorse step is where the alarm management procedure and other methods are applied to the information gathered in the previous steps to optimize the alarm system. Other methods can be statistical analysis; alarm filtering and suppression; bad actor(s) identification; logic, event, and fault trees; alarm organization and presentation; operator dialogs; adaptive alarm techniques; and expert systems.

This step considers alarms not only individually but on a system level as well as correlated to process, equipment, and system events and abnormal conditions. This step is further complicated by the fact that the process is not static--the process that the operator and the alarm system see may change based on operating conditions, and case scenarios may need to be developed against which you can analyze the alarm database.

7.  Implement the Rationalization
This is where the alarm rationalization is implemented on the control system. While one might assume this is a simple step of modifying the control system as the result of the previous steps, it is not that simple. Since the rationalization may add or remove alarms; change presentation, organization, or setpoints; add or modify procedures; modify or add training; etc.; it is necessary to have an implementation plan that involves the operators and operating staff and other appropriate personnel.

Failure to involve the operating staff can lead at the least to some upset people or loss of some of your expected gains with alarm rationalization. At worst it can result in a hazardous event due to the operators not understanding the new system (one of the things that alarm rationalization is supposed to improve).

Remember that the system on which you are implementing the alarm rationalization can affect the outcome. Such things as prioritization scheme, alarm filtering, screen organization, alarm presentation, documentation capabilities, alarm procedure capabilities, and other alarm capabilities of the control system will impact some of what you can do in your alarm rationalization.

Also, third-party software products for some control systems can enhance your alarm capabilities, but poor application can give you a worse system than you started with.

8.  Benchmark the New Alarm System
Once the alarm rationalization is implemented, benchmark the final product to determine the success of the rationalization effort. (It also makes management happy to see the results of a successful project--they like numbers.) Do this by determining the alarm system benchmark metrics identified in step 2 using the same methods as used in step 3.

Don't forget the operators and operating staff interviews. These people will know about improvements that the mathematical benchmarking will not tell you.

Life After Rationalization
A good audit process is necessary to ensure your alarm system stays manageable and in control. Online dynamic alarm monitoring systems can assist in this. If you are not following a comprehensive alarm management procedure, your alarm system may go out of control again in the future.

There are great opportunities for improvement in our alarm systems. Due to their complexity, alarm rationalization requires careful planning and organization and is generally a team effort.

The market for alarm management and rationalization products is still underdeveloped but the marketplace is growing with improvements by control system manufacturers and third-party products. This is being driven by more and more people realizing that there are issues and problems with their alarm systems as well as opportunities leading to the desire to optimize their alarm systems.

Few Solid Guidelines

There are not a lot of solid guidelines in the area of alarm management/rationalization. Probably the best known is the British Engineering Equipment and Materials Users Assn. (EEMUA), which has a guideline, "Alarm Systems, a Guide to Design, Management and Procurement, EEMUA Publication No. 191." This guideline has some metrics for good performance of an alarm system.

The British Health & Safety Executive (HSE) also has a couple of guidelines that may be of some assistance (see "Better Alarm Handling," and the Technical Measure Document, "Control Systems."

ISA offers ISA TR91.00.02, "Criticality Classification Guideline," which provides guidance in defining types of systems, as well as ANSI/ISA S18.1 1992, "Annunciator Sequences and Specifications."

IEC has IEC 61508, "Functional Safety of Electrical/Electronic/Programmable Electronic Safety-Related Systems," and IEC 61511, "Functional Safety: Safety Instrumented Systems for the Process Industry Sector," which address safety-related and safety instrumented systems that may have some application where credit is taken for alarms during risk assessment.

The Abnormal Situation Management (ASM) consortium of major companies has done considerable work in abnormal situation management and has a number of good articles.

Another good presentation on alarm systems is "Useful and Usable Alarm Systems: 43 Recommendations."

Suppliers Exist

There a number of suppliers and the market is growing for alarm management and rationalization products and related services. The key is to find the products, services, and experience that most closely match your in-house capabilities and philosophy.

Some of the companies that supply alarm management and rationalization products and/or services (in alphabetical order) are Control Arts, Exida, Honeywell, Matrikon, Process Automation Services, Process Systems Consultants, and TiPS.

There also are some companies that provide expert systems for improving the operator decision process and detecting developing problems before the alarm stage, thus reducing the alarm load. Examples are Gensym and Nexus Engineering.

Many DCS and SCADA systems and vendors provide alarm management features or products. It generally is more cost-effective to use whatever features your system provides rather than a third-party add-on. However, some of these are somewhat primitive and don't provide the necessary functionality by themselves, though they are generally improving.

There also some add-on alarm products on the market that enhance a control system's basic alarm capabilities by providing online alarm management (alarm controls, alarm parameter management, alarm documentation, alarm system auditing, change control, etc.), alarm filtering, cause-and-effect analysis, alarm patterns, and dynamic reconfiguration of the alarm system for varying operating conditions.

References:

1. Johannes Koene & Hiranmayee Vedam, "Alarm Management and Rationalization," Third Annual Conference on Loss Prevention, 2000.

2. A. Nochur, H. Vedam, & J. Koene, "Alarm Performance Metrics," IFAC 2001.

3. Edward Marszal, "The Longford Gas Plant Explosion: Could Alarm Management Have Prevented This Accident?" Exida, 2003.

4. W.H. Smith, C.R. Howard, & A.G. Foord, "Alarms Management--Priority, Floods, Tears, or Gain?" 4-sight Consulting, 2003.

5. Yoshitaka Yuki & Kimikazu Takahashi, "Event Analysis Based on Causal Relation of Events, Alarms, and Operator Actions," ISA, 1999.

6. E.H. Bristol, "Improved Process Control Alarm Operations," ISA, 1999.

7. Yoshitaka Yuki & Jim Parks, "Alarm and Event Analysis for detecting Productivity Bottlenecks," ISA, 1999.

8. "Use Critical Condition Management to Improve Your Bottom Line," ARC Strategies, ARC Advisory Group, April 2002.

9. Donald Campbell Brown & Manus O'Donnell, "Too Much of a Good Thing? Alarm Management Experience in BP Oil, Part I: Generic Problems With DCS Alarm Systems," www.asmconsortium.com.

10. Dick Perry, "Alarm Systems and Their Role in Abnormal Situation Management, Part II of IV," Instrument and Controls, SAIMC, July 2000, www.instrumentation.co.za.

11. C.T. Mattiasson, "The Alarm System From the Operator Perspective," ASM Consortium, www.asmconsortium.com.

12. D. Shook, "Alarm Management White Paper," Matrikon.

William L. (Bill) Mostia Jr., PE, of Exida, League City, Texas, has more than 25 years experience applying safety, instrumentation, and control systems in process facilities. He may be reached at [email protected].