Design for safety instrumented systems reliability

The concept of "Design for SIS Reliability" evolved from a paper, “Will the Real Reliability Stand Up?,” I gave at the 72nd annual Instrumentation Symposium for the Process Industries (2017) at TexasA&MUniversity. It explored the concept that we've gotten off the track of good reliability engineering design by concentrating too much on designing to the ANSI/ISA 84.00.02/IEC 61511 standards by emphasizing SIS safety integrity level (SIL), selecting equipment using third-party approval and designing to the SIS standards, but not putting enough emphasis on using good reliability engineering techniques and practices.

Design for Reliability is not a new concept, and it's been used in product development in other industries. In the book, The Professional's Guide to Maintenance and Reliability Terminology, authors Ramesh Gulati, Jerry Kahn and Robert Baldwin define Design for Reliability as “improving the reliability of an asset, process or product by using reliability analysis techniques to design out potential problems during the development phase,” where reliability analysis techniques include, for example, failure modes and effects analysis (FMEA), design FMEA (DFMEA) and reliability by design (RBD).

Much of the design of SISs rely on traditional instrument design and meeting the SIS standards. Designing to meet standards is a top-down design methodology, which is prone to a syndrome similar to “teach to the test,” where the standard is treated as the maximum and the design is the minimum to meet the standard—no less, no more. The basic concept of Design for Reliability as applied to SIS is a bottom-up, holistic approach that is not a minimum, which removes or minimizes the effects of failure modes, and improves reliability by using reliability engineering techniques during the design phase. In addition, due to the lifecycle nature of process industry safety systems, the concept will be extended to using Design for Reliability techniques throughout the safety system’s lifecycle to sustain and improve reliability, especially if the design proves inadequate for the application or any of the design assumptions prove to be invalid. The concepts and principles in this article can also be applied to designing instrument and control systems (I&CS) and other SISs such a safety controls, alarms and interlocks (SCAI).

A little history

Over the past three decades or so, development of the SIS standards ANSI/ISA S84 and later IEC/ISA 61511 has driven SIS design. SIS design practices have evolved from relay logic to general-purpose programmable logic controllers (PLC), and later to specialized safety PLCs; and from pneumatic process transmitters and “dumb” electronic transmitters to “smart” digital transmitters. Along the way, we moved from pneumatic process controls to electronic and digital single-loop controllers, and onward to modern distributed control systems (DCS).

Safety systems, commonly called emergency shutdown systems (ESD) in those days, really came to the industry forefront in the late 1970s and early 1980s, when the PLC was introduced into the process industries. Seeing the advantages of the PLC, it was quickly applied to discrete and sequential controls and as a programmable substitute for relay logic safety systems. Prior to that, emergency shutdown systems were fairly simple and consisted of relay and pneumatic logic. Concerns regarding the use of the PLC in safety systems rapidly became apparent, and the ISA S84 standard committee was formed in 1984 to address the matter. A driving concern was to ensure that people took care in applying this new PLC technology to safety systems. In the late 1980s and early 1990s, the ISA 84 committee realized the importance of field instrumentation in achieving reliable safety systems, and the standard was expanded to include the design of field instrumentation.

[sidebar id =1]

In March 1996, ISA took a major step for SIS when it issued the ISA S84 standard (later ANSI/ISA 84.01),“Application of Safety Instrumented Systems for the Process Industries.” It introduced concepts such as dividing failures into two types, random and systematic, where random failures are spontaneous failures of a component, and systematic failures are due to mistakes and errors of omission and commission in safety lifecycle activities. The new SIS standard also defined reliability as the “probability that a system can perform a defined function under stated conditions for a given period of time.” While reliability can be considered from a qualitative perspective, this definition generally implies a mathematical approach to reliability through the use of the term “probability,” rather than a more holistic reliability engineering approach of eliminating unreliability in a design. I always liked the qualitative definition I learned in the U.S. Marine Corps: “Works fine. Lasts a long time.”

ISA S84 also introduced the SIL concept, which is defined as: “One of three possible discrete integrity levels (SIL 1, SIL 2, SIL 3) of SISs. SILs are defined in terms of probability of failure on demand (PFD).” However, it's interesting to note that S84-1996 did not explicitly require a quantitative assessment of SIL (B14.1)—that came later. The PFD was calculated in the form of PFD average (PFDavg), and was based on the component architecture, statistical random failure rates of the components in the safety loop, the proof test interval, and the mean time to repair (MTTR). While there are interesting aspects in the evolution of safety system design, some that illustrate challenges in designing to the standard concept are:

Random vs. systematic. Much of the emphasis in design of SISs has involved meeting SIL, and centered on the random side of things. The systematic side was generally expected to be covered if you followed the safety lifecycle. This was probably because it's easy to calculate a relatively simple safety metric verses dealing with the mostly non-random human errors that comprise systematic errors.

Selection of equipment. In 1996, ISA S84 required equipment to be “user approved,” but later in the first edition of 61511, two choices were given: equipment approved to standard IEC 61508 (by third parties) or “proven-in-use” (2004-15), and later on (in IEC 61511 2nd edition-2015) by “prior use.” Because of the difficulty in documenting “proven-in use” or “prior use,” many users and specifiers, particularly E&C firms, chose to specify equipment third-party-approved to IEC 61508.

The third-party approvers to IEC 61508 are for-profit firms, which raises a natural conflict of interest, and the question of what standards, other than the approval standards, these firms should comply with to make sure approval practices are performed correctly. Who guards the guards?

Also, we're now getting a good sample of approved equipment and their reports, and people are starting to question some failure rate numbers and some statements in the reports. One of the approval agencies has even questioned the failure rate numbers of another because there was an order of magnitude difference in the failure rates—obviously, the one complaining wasn’t the one with the best (lowest) failure rate number.

Definition of reliability. While the 1996 version of the ANSI/ISA 84 standard had a definition of reliability in it, ANSI/ISA-84.00.01-2004 Part 1 (IEC 61511-1 Mod) 1st edition and 2nd edition don't define reliability. Common use of the term “reliability” for SIS in industry evolved to refer to spurious failures and trip rates, possibly from a reference to the SIS having an overt failure and failsafe action, but it seems to have left out the covert failure aspect of reliability. Reliability covers both safety integrity and spurious failures, and SIL and spurious trip rate (STR) are simply reliability metrics.

It's interesting that the number of books on design of SIS can probably be counted on one hand, and most are geared toward designing to meet the standards rather than the broader goal of designing a reliable SIS. There are probably even fewer books on design of reliable instrument and control systems.

This lack of emphasis on reliability has led designers to concentrate on the easier task of putting together an SIS by meeting SIL and by selecting approved equipment rather than dealing with more difficult aspects of reliability engineering. These include eliminating failure modes (both random and systematic) by using systematic methodologies such as FMEA; determination and use of more reliable components; reliability of process interface designs for low-demand service; attacking lifecycle issues (for example, many accidents involve instruments and controls not working for long periods of time); doing detailed modeling of failure modes for components used in SIS to help in the selection of equipment; minimizing failures in design; and developing more detailed SIS good design practices.

Compromise by consensus. Consensus standards are compromises because of disagreements among committee members or vested interests. Compromises are commonly reached by using “weasel” words, making the standard vague, and/or leaving out areas of disagreement. The ISA 84 committee has written technical reports that provide more guidance on the standard, provide lessons learned, and address areas the main standard has not covered, e.g. digital communication, wireless, fire and gas (F&G), and cybersecurity, etc. Nevertheless, there are many areas where people can pretty much do what they want, or avoid what they don’t want to do with little justification.

Much of design is geared to the functional or success side, such as a design doing what it's designed to do. Less is geared toward the failure or unsuccessful side of design, e.g. how to design to eliminate, prevent, tolerate or minimize the effects of failures. This is where design for reliability comes in, particularly in safety systems, where a covert failure can hide for years in a low-demand system before becoming a dangerous failure when a demand occurs. For low-demand safety systems, eliminating or minimizing failure modes is key to having a reliable system.

Design for reliability engineering concepts

The definition of design for reliability given above referred to using “reliability analysis techniques” to improve the system design, which I expanded to include using these techniques for improvement over the system lifecycle. Some of these concepts and principles are:

Functional/reliability design. Engineers primarily learn functional design in college. Functional design is designing to perform a specified function. They learn less about designing to minimize failures and designing to last a long time (dependability, reliability, etc.). Once out of college, engineers initially learn from mentorship and by copying what engineers have done before them, with some training typically thrown in. If the engineers had good mentors and examples, they're on their way to being good engineers. If not, it can be an uphill battle. This is where standards based on consensus best practices can help develop reliable, consistent SIS designs.

Membership in professional societies, participation in the standard-making processes, attending symposiums, training and other professional opportunities can help engineers improve their SIS design skills. Never turn down a chance to learn something new. However, it's difficult for anyone to learn to be a good engineer based on experience alone—you need it, but you simply don't have enough time to get all the experience you need. The same goes for mistakes, which are typically some of the best learning experiences. You don't have the time to make all the mistakes, so learn from the your mistakes and from mistakes of others at every opportunity.

One of the most common failures of a functional design is when the pre-design or front end engineering design (FEED) stage fails to specify the correct or full functional design specification, which leads to an incorrect or poor design.

Failure analysis. Failure analysis methodologies include analysis of potential failure modes using structured analysis tools such as FMEA; DFMEA; failure modes and effects, diagnostics and analysis (FMEDA), failure modes and effects criticality analysis (FMECA) etc. FMEDA, FMECA and DFMEA are variations of the original FMEA methodology, where the technique is expanded for additional analysis coverage. The original FMEA and FMEDA methodologies can stand in good stead for SIS systems. FMEDA is commonly used today in the SIS world for random hardware failures. The area where we need to expand FMEA use is for systematic errors, failures and faults.

Failure mechanisms. Once we identify the failure modes to analyze, we need to investigate how things fail (failure physics) relative to their service. Most failure analysis in the SIS world is done at the macro level. For example, for a solenoid, we might look at a “stuck” failure as a failure mode. In some cases, we may go further and consider instrument air issues (H₂O, particulates, etc.), temperature cycling, and outlet pluggage as some of the failure mechanisms. Most people would typically not look at what types of solenoid designs (plug, diaphragm, spool, etc.) are best to prevent or minimize these types of failures; what external influences act on these types of failures; which is the best design for low-demand system; or what design or technology is best for a low-demand system that has a long test interval to minimize failures, etc. Another area where exploration of failure modes and mechanism is needed is safety block valves. We see all types of valves and actuators used in safety systems, but which ones are best for low-demand service or long proof test intervals? We need to know more specific information about these types of failure mechanisms for SIS instrumentation in low-demand service, and in low-demand service for long test intervals—failure of instruments that just sit and wait. Remember Newton’s First Law: “Things at rest tend to stay at rest.”

[sidebar id =2]

Another area where failure mechanisms are important is the instrument interface to the process. For example, pluggage is a common potential problem. What are the failure mechanisms, and how do we reduce the potential for pluggage within the defined test interval (purging, heat tracing, larger interface path, remote seals, etc.)? How does the process interface affect the combined sensor failure rates, and how do we eliminate or reduce the effects of failures in these interfaces?

One way to analyze failures is to provide failure mechanism models for devices to see where we can better address the device failure modes: internal and external, random and systematic. Figure 1 is an example of a macro failure mode model for a process transmitter, which can be used to start a failure mechanism analysis.

Robustness. The ability to withstand or overcome adverse conditions is important in designing system that will operate over long periods when the original operating conditions or assumption may have changed or when unexpected operating conditions may occur. System failures can be illustrated by looking at a system’s strength compared to the stress a system may experience and the likelihood of failures (Figure 2).

A wise man named Ed Marszal once told me, “When in doubt, build it stout,” which illustrates the point that when we design things, we always do so under some level of uncertainty. It would follow from good engineering practice that we should reduce the uncertainty in the design and application, which will improve reliability, and design robust SISs relative to the level of uncertainty we're operating under, realizing that our system life expectancies are 20-plus years under varying operating conditions.

Resilience is the ability of a system to recover quickly and continue operating even when there's an equipment failure, power outage, or other failure. Resilience can be designed into the hardware and software (e.g. fault tolerance, improved situational awareness, etc.), but in most I&CSs, resilience or its lack comes from humans, and is due to human inherent flexibility or inflexibility. Resilience is developed or increased by training, experience, ease of operation, effective procedures based on process and I&CS failure modes (FMEA is a useful technique for this), operational discipline, and ability of the I&CS to give the operator a clear, situational awareness. The more reliable and effective the I&CS, the safer the plant. Operators must be taken into the safety equation because they can create safety as well as increase operating risk by their actions.

[sidebar id =3]

Maintainability is a measure of the ease and rapidity of restoring a system or equipment to operational status following a failure or fault. Maintainability is a design as well as maintenance property that can have a long-term effect on the reliability of SIS and I&CS. In many accidents, a common thread emerges of poor instrumentation maintenance and instruments out of service or broken. Poor maintenance can result from lack of resources, cutting corners, short stepping maintenance actions, incompetence, instrument technology level above local maintenance capability, unreliable instruments, poor design, and instruments that are hard to maintain. Several of these that stand out for our discussion are instrument technology level, unreliable instruments, and poor designs that increase the difficulty of maintenance and increase the amount of maintenance. Ease of maintenance is important to maximize the system availability, repair rate, and effective use of manpower. Instruments and SIS that are easy to maintain are more likely to be maintained successfully, and there's less chance of systematic errors. Some questions to be asked are:

• How does design effect maintainability?

• What electrical and mechanical designs improve maintainability?

• What environmental and cultural conditions affect maintainability?

• What should be done to help ensure that all the SIS and SCAI system instruments are in good repair and operating properly?

Identifying instrument bad actors and common cause failure conditions through root cause analysis is important to maintaining and improving reliability through the systems’ lifecycle. In some cases, a redesign may be necessary to maintain the SIS's required reliability (safety integrity, spurious failures, and other failures/reliability issues). Designing reliable SISs must consider maintainability at the design stage and throughout the system lifecycle.

[sidebar id =4]

Operability is the ease with which the operator can operate the SIS. The safety integrated function (SIF) typically operates automatically without operator intervention. This doesn't mean that the operator does not have interactions and responsibilities through the SIS operational/status HMI and maintenance screens. The more effective the HMI screens are, the more effective the operator’s interaction with the SIS. This involves using modern HMI design practices and consideration of SIS situational awareness for the operator.

SIL and safety availability. SIL is a key safety metric for SIS, and has substantial coverage in the technical literature. One of the key parameters is failure rate. There are a wide numerical range of failure rates available (typically several orders of magnitude), which allows “cherry picking” of numbers to get the results that some people may desire. There are still many types of instruments that have sparse failure rate data sets, particularly in the electrical equipment area. The Instrument Reliability Network (IRN) is attempting to address this issue by collecting failure rate data over time from the network’s membership. There still needs to be more validation that SIL-based statistical calculations adequately predict the level of reliability to provide the risk reduction we expect to achieve.

Sustainability is maintaining the reliability and functionality of the SIS throughout its useful life. Part of sustainability is a long, useful life, which is not currently a big design consideration for SIS. The designers typically accept the manufacturer’s useful life or ignore it as a design or operating parameter. Obviously, we want useful life to be as long as is reasonably possible, so it's an important consideration. Some questions on this issue are:

What do we do when we reach the end of the manufacturer’s useful life specification, but do not appear to be in wearout?
At or near the end of useful life, do we replace with new or rebuild?
If the manufacturer’s useful life is three to five years, what do we do at three years?
Which designs or technology types have longer useful lives?
How does demand service affect useful life?
How does a device in low-demand service “wear out?”

Testability. SIS and SCAI systems are distinctly different from most I&CSs because they are required to be periodically proof-tested. Design of the SIS and SCAI systems must be such that testing can be accomplished at the desired/required frequency with good efficiency. The ease of testing reduces the time required to accomplish the proof test. Instruments or systems that are easy to test are more likely to have all the test steps performed successfully, and there's less chance of systematic errors. If on-line testing is required, being able to safety and effectively test the system with a minimum outage is a key design consideration. Remember that successful testing is a key action to maintaining the reliability of SIS and SCAI systems.

Conclusions

For SIS designers to provide more reliable designs, we must use reliability engineering techniques and other good engineering practices. We must take the holistic perspective that reliability is not only meeting the SIS standards, but also eliminating failures and taking care of all the little things that make a system reliable day in and day out. Eliminating or minimizing the effects of hardware failures and systematic failures that occur during the pre-design and design stages, as well as downstream of the design stage, are key to improving the reliability of the SIS throughout its lifecycle. More reliability information and data are needed regarding failures and instrument designs specific to low-demand service and environments including long-test-interval failure modes. Also needed is information on which instrument designs are more effective in low-demand service. Design and lifecycle techniques should be used to make sure that instrument reliability is maintained over the system life. The concept of design for SIS reliability using reliability engineering practices and good engineering practices can be used to help ensure that we design more reliable SISs and maintain them over the SIS lifecycle.