A lasting plan for managing alarms

Alarm management seems to be a never-ending task: company X hires consultant Y to repair their badly designed and overloaded alarm management system for $1 million, and then 18 months later the same company is searching for a different vendor to help them with their poorly designed and overloaded alarm management system.

Many companies start out on the right track. They hear that standards and guidelines like EEMUA 191 talk about a lifecycle model, and the first step in that model is an alarm philosophy document. So they pay someone to write an alarm philosophy document, but when it's complete, it’s useless to the alarm rationalization process because it doesn't tell how to address common issues that can save time and have a very big impact.

An alarm philosophy is a policy with rules and guidelines that can be enforced. An effective philosophy document will guide the rationalization process and describe procedures that will keep alarms under control on a continuing basis. The following are the key elements.

Define and prioritize alarms

An alarm philosophy document must define the difference between what is and what isn’t an alarm. In rationalization meetings, someone often convinces the team that it needs to keep an alarm that clearly does not meet the criteria, so this high-level information must be easily extracted, put on a wall poster and be constantly in front of the rationalization team.

Non-alarms that provide useful information for the operators are called notifications or user alerts. They are not alarms, and during an abnormal operating condition or emergency, they can be silenced and ignored.

The philosophy should have a clear and practical method for the rationalization team to prioritize alarms. It should be easy to use and help the team determine the time to respond to the alarm as well as capture the consequences of failing to respond in a timely manner.

Set appropriate limits

Many alarms system fail for lack of an engineering solution for selecting alarm limits (setpoints). No one method can address this issue. Some alarms are recommended by equipment providers and are designed to keep equipment working within healthy, normal operating conditions. These are difficult to change and risk the loss of warranty if changed.

Some alarms are derived from the process operating envelope. Initially, the envelope is often undefined, so process engineers use their best judgment to determine set and trip points. After a period of operation, historical trends can help determine more accurate limits. A new analytical tool called CVE produced by PPCL does this using "Geometric Process Control" (Figure 1).

Geometry determines limits

Figure 1: PPCL’s CVE tool allows the rationalization team to see the operating envelope, how it changes based on operations and how the alarms are placed against this variability.
Credit: PPCL

The tool allows the rationalization team to see the operating envelope, how it changes based on operations, and how the alarms are placed against this variability. It allows users to identify Grade 1 product, then protect it based on tight limits. This is very powerful and not just for identifying the alarm limits—it can speed up a rationalization project significantly by clarifying alarm relationships and quickly identifying problems. The tool also opens the door to process improvements, equipment condition monitoring, problem solving and process stewardship.

Procedures should include a formal process to determine correct pressure, temperature, level and flow alarm setpoints for each alarm priority. The process should accommodate the need to adjust pressure and flow requirements based on the discovery of imminent integrity threats (e.g., discovery of immediate repair conditions during integrity assessments and notifications). It should also verify that field alarm setpoints are consistent with control center alarm setpoints, or a rationale for any offset. (Some operators intentionally offset field and control room alarm setpoints, so controllers are alerted and can take action before critical field thresholds are breached.)

The philosophy document should describe the difference between managed alarms and unmanaged alarms. The methodologies and maintenance of safety-related alarms (managed) should be described and the minimum requirements defined. For example, the methodology may call for layers of protection analysis (LOPA), clearly defining “safety layer” or “layers of protection,” their contribution to safety, how it's guaranteed through mean time between failure (MTBF) and mean time to repair (MTTR), and what testing is required to meet the standards.

One of the least understood elements of alarm management is the time to respond to the alarm (Figure 2). Once a variable crosses the zone from "normal" into "abnormal" operations, the clock is started and the steps are sequential. The alarm parameter is set to alarm "unacknowledged." The response time for the operator to acknowledge seeing the alarm, which often involves just silencing the alarm, is the “acknowledge delay time.” The alarm state is then changed to "acknowledged" and the operator is then theoretically in the state we define as "detection."

Time to respond

Figure 2: Analysis of response must consider the maximum operator delay time. If the operator doesn't respond within this time, the consequences will be realized. This is critical for determining the required response time and setting alarm priority.
Credit: User Centered Design Services Inc.

In many cases, there is a delay from acknowledging the alarm to continuing to diagnose the cause and required correction. We call this the “operator response” delay. During this period, the operator uses the alarm name descriptor to understand the alarm, and may have to use the HMI to determine which of several potential problems has caused it. If the operator is unfamiliar with this alarm, he may have to refer to an alarm response worksheet normally developed during the rationalization process.

Once the operator selects a course of action and makes adjustments, the process control system responds to the change request, but there is often a delay as the signal goes out to the field and, for example, operates a solenoid that causes a valve to move. This is known as the process dead time. After the valve has moved, it takes additional time for the adjusted flow, level, temperature or pressure to return to normal changed. This is called the process response delay or time. When the process variable crosses the normal operating line, the alarm is classified as "return to normal."

Figure 2 highlights the maximum operator delay time—if the operator does not respond within this time, the consequences will be realized. This is critical for determining the required response time and setting alarm priority. Much of this data can be obtained by reviewing historical trends and observing the alarm and the operator responses.

Rationalization procedure

Four topics are extremely important and should not be overlooked in the philosophy document:

Alarm management overview
Alarm management lifecycle
Alarm design principles
Alarm management rationalization methodology, including a risk matrix and wall posters.

The rationalization team can further develop and implement them through the following procedure:

Appoint an alarm champion in charge of enforcing the alarm philosophy and maintaining the system. Plants often record lots of data and many have invested in alarm management tools that provide statistical analysis including frequency of alarms, lists of duplicate alarms, bad actors, frequency of alarm floods, and many more interesting facts about the performance of your alarm system. Most of the systems can provide weekly, monthly and annual reports that can be analyzed to determine the quality of your alarm system and how it impacts operator performance.

Designate a responsible person to manage reporting and analysis, create action items, follow up on maintenance activities to ensure rationalization is in-line with standards defined in your philosophy, and provide executive summaries to management on performance and progress. This person should ensure that alarm enforcement is working and that suppressed or shelved alarms are being managed as prescribed in the alarm philosophy. All elements of the alarm management system in the philosophy should be audited and continuously improved.

Train operator, engineering, safety, automation, management and HR personnel. For the philosophy to be successful, all plant personnel should be aware of the philosophy document. This includes engineering project managers, who often hire third parties to implement projects that often involve adding new alarms. If not correctly managed, these projects often provide a whole new batch of unrationalized alarms.

Companies often train only the operators or the initial team. Computer-based training (CBT) must be developed for new employees along with a refresher training program to keep people up to date, so as they go onto an alarm management project, the foundational investment you made in the philosophy and rationalization methodology is not wasted. The CBT should cover the use of the rationalization procedure, wall charts and a sample alarm rationalization exercise.

Provide tools for managing alarms, enforcing the philosophy and maintaining the system. The important considerations when selecting an alarm management analysis tool are:

Can I get my historical data into it without too much difficulty?
Does it allow me to visualize and analyze my problems?
Is it easy to generate the daily, weekly, monthly, quarterly and annual reports I need based on my philosophy?
Is the software easy to maintain as my system evolves?

Continue to have alarm review meetings with the operators for the life of the alarm system. At the end of the day, if the operators across all shifts don’t take ownership and keep up to date on what has been fixed and what progress is being made, the project will fail. They have to be able to see the benefits and get excited that this is something worth investing money and their time.

Most projects like this fail due to one of two things: lack of money or lack of resources to ensure the effectiveness of the rationalization team. We have been on many projects where only one or two operators are provided. Part way through the week, someone gets sick and they have to go back, so they can cover night shift, and you’re lost.

To ensure success, the project should be set up just like any other project that the company takes on. It should have goals; it should have identified and confirmed resources; it should have a project plan; and opportunities and potential problem areas should be identified upfront. Progress and progress reports should be tracked—it's important that individuals are given responsibility and held accountable like on any other project.

Implement dynamic alarming to manage upset conditions where alarm floods are inevitable. Many companies struggle over the concepts of dynamic alarming, and are confused about when to do it. Some do the basic alarm configuration and solve bad actors, and when they run out of steam, they turn to dynamic alarm techniques to address the more difficult issues such as alarm floods. It’s most important to first follow the rules and not skip any of the required steps: document every alarm, filling in the alarm response sheet discussed earlier. Where possible, integrate it into a pull-down menu on Level 3 of the HMI graphics. The result will be an improvement in the alarm frequency.

One alarm tool manufacturer believes you should begin by grouping alarms around unit operations and set the alarms based of operating modes; i.e., startup, normal operations, product change and shutdown. This will set up preconfigured alarm states and dynamically suppress or “auto shelve” the alarms based on plant state. However, automatically detecting plant state can be a challenge, so there has to be an operator override or operator instruction to confirm that plant state.

Auto-shelving can have a big impact. For example, if you detect that a unit operation such as a compressor has tripped, the associated alarms can be shelved as they are superfluous to the operator after the trip.

Put the philosophy to work

The alarm philosophy document is a policy, with rules and guidelines that should be enforced. To get the desired results:

Perform a staffing study to make sure you have the right number of operators and check to see if the workload is balanced.
Appoint an alarm champion, someone in charge of enforcing the alarm philosophy and maintaining the system.
Train the operators, engineers, safety, automation, managers and HR, they all need to know and understand the philosophy.
Provide tools for managing alarms, enforcing the philosophy, and maintaining the system.
Have continued alarm review meetings with the operators for the life of the alarm system.
Implement dynamic alarming to manage upset conditions where alarm floods are inevitable.

Make alarms manageable

The objective of an alarm philosophy is to control daily alarms and to reduce the size and frequency of alarm floods. When the system performs effectively, the operator workload is not burdened by the alarm system, and we can consider alarms to be within normal operations.

Our ultimate goal is to be able to demonstrate that the operator has the capacity to detect, diagnose and respond to alarms in a specified and timely manner to protect the plant, personnel and community from the consequences the alarms are designed to prevent.

Authors

Ian Nimmo is the owner and Stephen Maddox is a human factors design consultant at User Centered Design Services, Inc.