By Ian Nimmo, User Centered Design Services LLC
NO DOUBT EVERYONE who has a distributed control system (DCS) has encountered alarm management issues. the reason is simple: A DCS makes over-alarming all too easy. The DCS arrived in a marketplace that had enforced restrictions on alarms due to physical space limitations. To do anything with raw alarms was almost impossible; the only features found in the electro-mechanical alarm annunciator box were first-up alarm indication and the ability to suppress the alarm by removing the electronics from the box.
The DCS provided flexibility, allowing almost unlimited alarms, multiple alarm types, including bad-process-variable monitoring, and a host of options, from various filtering techniques to relational dynamic alarm rationalization. Unfortunately, the DCS came without any discipline or cost implications for adding alarms. To be fair to DCS manufacturers, they recognized the potential for over-alarming. Even the very first version of the Honeywell DCS, for instance, gave advice about configuring and prioritizing alarms that is not too different from current guidelines. However, no one read it or implemented this guidance.
The problem was compounded by lack of leadership and ownership, an issue that still exists at many sites. In the past, after process and equipment engineers specified alarms, the control engineers installing the DCS added alarms according to what could be done, not what should be done. Operators demanded alarms based on ease of monitoring, trying to make up for deficiencies in the human computer interface (HCI) and the loss of the big picture that occurred when the panel was replaced by a 15-in. keyhole window to the process.
Meanwhile, multidiscipline process hazard assessment teams conducting hazard and operability analyses added alarms to deal with deficiencies in the design. As a result, a process plant might well go from 150 physical alarms to 14,000 DCS alarms.
Management has been reluctant to pay for redesign of the alarm-management system and the HCI, feeling they had already paid for the design once and couldnt justify paying again. However, management knows that bad design has impacted operators performance, even to the point of demanding extra staffing to deal with the flood of alarms that occur during every disturbance. This has caused minor incidents because of operators missing critical information or making errors due to stress and overwork. Unfortunately, it often takes a big incident (such as the explosion and fires at the Texaco refinery in Milford Haven, Wales, which is quoted in all the alarm-management guidelines) to force companies to readdress the issue.
The Solution
How do we resolve this problem? The answer requires careful review of the causes we have just outlined and starts with clear ownership of the problem. Some sites have given a single person responsibility to manage alarms, while other sites have made a multidiscipline team responsible.
The key to success is to establish responsibility. Success should be based on performance, not just on the number of alarms eliminated but also on the effects the improvement to operators jobs and the impact on the running of the plant.
Any problem that involves costs, people and other resources clearly calls for project management. However, because of the lack of ownership and accountability and the absence of performance expectations, alarm-management projects rarely start with formal project management. Not surprisingly, a lot of these projects fail due to poor understanding of the scope of the problem, lack of resources and money, loss of momentum and no identifiable return on investment.
Many managers get frustrated with engineers because the engineers dont define the problem and the real cost implications to do the project correctly. the engineers attack the problem without a plan and wonder why they dont get the support of the organization to address all the issues that surface.
Often, a frustrated manager starts an alarm-management project by engaging a control engineer. the manager thinks the problem is limited to control system configuration and that a few sit-down discussions with the operators will resolve it an approach similar to that of other control system problems. the control engineer soon finds that some operators will not part with any alarms, even though the alarm system hampers their efforts during a disturbance.
The control engineer then involves a process engineer. However, they both are overwhelmed by the size and complexity of the problem. the control engineer reads recent articles about the subject and discovers that they need statistical tools to better understand what is happening within the alarm system.
So, they get trial copies of some software tools to assess the frequency of alarms and find that just 12 alarms produced 53% of the activations in the system, and one particular alarm caused 123 alarm activations in a four-hour period. they decided to focus on the Top 10 bad actor list, and put the remaining two on the list of the next 10.
The two engineers discovered that fixing just 10 of those 12 alarms was a challenging and time-consuming task. Some alarms required physical instrumentation modification. Some needed configuration changes that would demand a better understanding of why the alarm existed, what its limits should be and when the alarm is not useful and should be suppressed. Some of the alarms were just not necessary. Others required the alarm priority to be changed, which in the context of the thousands of alarms, still would have little effect.
Critical Needs
This little exercise demonstrated to the engineers that to solve this alarm-management problem would require a multidiscipline team, with some members permanently on the team and others, such as rotating-equipment and programmable-logic-controller (PLC) specialists and other subject matter experts (SMEs), part-time as required. the experts would add knowledge of equipment, safety and environmental issues or process technology, which are the main reasons alarms exist. the software tool would prove essential throughout the project and the life-cycle of the alarm system; buying it would mean justifying a capital authorization.
The full-time team would need a process engineer, a control engineer, a supervisor with good experience of the plant and a couple of knowledgeable operators. Dealing with the first 10 alarms, the team would discover they required up-to-date documentation, including accurate piping and instrumentation diagrams (P&IDs), DCS data sheets, PLC ladder-logic diagrams, procedures, etc. In addition, the team often would need to call upon other resources either to validate the documentation or find missing information, which introduced costs, delays and resource planning issues.
The team also would discover that initially they could only get through a handful of alarms per day. After that, they realistically would be able to handle between 15 and 30 alarms daily, given the documentation checks, new documentation generation and management of change (MOC). One team found it could only do five alarms per day because of other commitments and that it would take more than three years just to get through the current alarm database.
Such a long-term, resource-dependent project with an initial capital expenditure requires formal project management. It is critical to success and should start with a program that addresses these questions:
- Who owns the problem?
- What is the purpose of the project?
- What results should it achieve?
- What resources are needed?
- What is the sites readiness to change existing practices?
- What other project-management issues related to time, cost, resources and other priorities might arise?
As part of this, the team should explicitly develop:
- A statement of the intent of the project
- The projects objectives and performance expectations
- The work breakdown structure
A Benchmark
Companies recently have put a lot of effort into resolving alarm-management issues, and the best of the best are doing this by first measuring a sites performance against a benchmark. This benchmark usually is derived from guidelines issued by the United Kingdoms Engineering Equipment and Materials Users Association (EEMUA) and has been broken down by one company into five classifications of performance:
- Overloaded
- Reactive
- Stable
- Robust
- Predictive.
Doing a gap analysis at the outset of the project can put the sites performance into perspective against the benchmark. It also provides the opportunity to set goals and milestones for improvement. the target depends upon how bad a problem the site has and how determined the company is to resolve it. This comes back to those initial project management questions: Whats the purpose of the project and what results are to be achieved? Some firms just want to deal with the initial problem, others strive for a best-practice solution, whereas others might have a mandate from a regulator that determines the goal.
Most companies wont tolerate an overloaded system and dont want but often have a reactive system that is not useful during disturbances. So, the initial phase of the project usually is to achieve a stable system, which is defined as one that is reliable during normal operations and that provides some advance warning of a disturbance. It still has problems during a big disturbance, though. Achieving this may simply involve resolving alarm configuration problems, removing duplications and continuously addressing the Top 10 bad actors that contribute most to alarm floods. This phase often is the first milestone for the project team.
Attaining robust performance typically is handled as a separate phase. It requires another level of design that frequently involves dynamic alarm-management strategies and additional software, and thus capital. After all this investment in time and resources, companies want to see a return. So achieving a robust design may be the goal.
Some firms make reaching predictive performance a stretch goal, but only best-in-class companies attain it. Judging by the quality and content of papers from BP, they have clearly demonstrated that predictive performance and the EEMUA-guidelines performance standard is achievable.
Realism
Estimating the cost of an alarm-management project can be difficult for an engineer who has never done it before. However, a wealth of knowledge and experience is available.
Some common omissions in estimates include:
- The cost of hiring an SME to get the project on the right track and of educating the whole team on best and adequate practices. It is very expensive to start an alarm-management project, fail and start over again.
- The cost of software tools that are necessary to analyze an alarm database and identify problems, document solutions and implement change.
- The internal cost of team members to participate in workshops, alarm objective analysis, DCS reconfiguration, modifications to PLC databases, changes to safety instrumented systems, fixing instrumentation and plant equipment, updates to training manuals, procedures and MOC systems, enhancements to the HCI, and process changes.
In addition, if the goal is to achieve predictive performance, the company may have to invest in state-estimation-prediction technology and new application software to do dynamic alarm handling.
Success depends upon getting the right people on the project; unfortunately, experts are always in high demand, especially in progressive process plants. therefore, using people effectively should be part of the project plan. Some members of the project team may always take part, but others may be called upon for advice only as needed.
Learn From Failure
One alarm-management project failed due to lack of knowledge and the teams inability to read PLC ladder logic. A large number of alarms originated from the PLC and then went to a DCS for presentation to operators. the team was not able to determine the source of the alarms or understand their objectives. No one had identified the need for a PLC specialist on the team. This limited the ability to address the real issues, extended the time to achieve anything, and caused so much frustration that the team lost momentum and gave up.
Another team failed because of lack of correct information and knowledgeable personnel. A DCS manufacturer was employed to resolve alarm-management issues. Unfortunately, the P&IDs given to the vendor were out of date and equipment that was out of service was not flagged. the DCS database did not reflect plant changes and, as modifications to the plant occurred, tags were reused without changing the descriptors. the vendor used its budget rationalizing and resolving inconsistencies before realizing it had focused on resolving problems with out-of-service equipment. Any of the operators working in the plant would have picked up on these errors.
Most alarm-management projects have some degree of success because of the size and nature of the problems. However, few achieve the initial goal of resolving all of the alarm-management problems. This often is caused by poor estimation of the time required and by the complexity of the attendant effects of alarm management, such as the impact on procedures, training and the HCI.
Some of the classic examples of good alarm-management projects, such as the Woodside project in Karratha, Australia, have been implemented during the last five years, but there is still work to do. In alarm management, there is no such thing as a quick fix. Different companies address the problems from different angles. Some start fresh and add alarms based on objective analysis to a new empty database. Others work with the problem database reducing it slowly over time, first by removing duplication and then unnecessary alarms. An experienced multidiscipline team usually can address about 30 alarms per day.
Too many companies do not understand that resolving alarm problems is an iterative process and involves data collection and analysis, fixing physical problems, identifying and eliminating unnecessary alarms, making enhancements to required alarms (such as filtering, suppressing or changing alarm type from deviation to physical or vice versa), creating and updating documentation, performing MOC reviews, implementing, testing and starting over.
Good alarm-management projects start by focusing on the end or life-cycle aspects, with a goal of transitioning to a new way to maintain good alarm-management practices.
Some of the reasons a project can fail are summarized in the sidebar. In addition, alarm-management projects commonly suffer from specific problems, including:
- Lack of state-of-the-art knowledge and reluctance to pay for an SME up front
- No understanding of the real problem
- Lack of ownership
- No formal project management
- Unreadiness to change existing practices
- Establishment of unrealistic goals
- Inadequate budget due to poor estimate of costs and required resources
- Lacking the right people at the right time
- Inaccurate, out-of-date information
- Unrealistic time expectations
- Focus on project, not life-cycle, aspects; and
- Inability to reward success and accept failure.
This list certainly is not comprehensive.
A Winning Approach
Successful projects have a clear alarm philosophy that is well documented and understood by all disciplines. they also have team members with a good understanding of the EEMUA guidelines, even if they dont agree with all of them. the most successful projects have a realistic view of the size and types of problems to be encountered, a definite focus on return on investment, and a management mandate to solve the problem once and for all.
A successful project provides tangible benefits. the project manager can see a difference in the day-to-day operation of the plant. Equipment runs better, incidents are dramatically reduced and production targets are consistently achieved; the only alarm that sounds is one that requires an operator to take action.
Ian Nimmo is president and founder of User Centered Design Services, Anthem, Ariz., a firm that specializes in abnormal-situation management, alarm management and other control issues. E-mail him at [email protected].