The concept of "Design for SIS Reliability" evolved from a paper, “Will the Real Reliability Stand Up?,” I gave at the 72nd annual Instrumentation Symposium for the Process Industries (2017) at TexasA&MUniversity. It explored the concept that we've gotten off the track of good reliability engineering design by concentrating too much on designing to the ANSI/ISA 84.00.02/IEC 61511 standards by emphasizing SIS safety integrity level (SIL), selecting equipment using third-party approval and designing to the SIS standards, but not putting enough emphasis on using good reliability engineering techniques and practices.
Design for Reliability is not a new concept, and it's been used in product development in other industries. In the book, The Professional's Guide to Maintenance and Reliability Terminology, authors Ramesh Gulati, Jerry Kahn and Robert Baldwin define Design for Reliability as “improving the reliability of an asset, process or product by using reliability analysis techniques to design out potential problems during the development phase,” where reliability analysis techniques include, for example, failure modes and effects analysis (FMEA), design FMEA (DFMEA) and reliability by design (RBD).
Much of the design of SISs rely on traditional instrument design and meeting the SIS standards. Designing to meet standards is a top-down design methodology, which is prone to a syndrome similar to “teach to the test,” where the standard is treated as the maximum and the design is the minimum to meet the standard—no less, no more. The basic concept of Design for Reliability as applied to SIS is a bottom-up, holistic approach that is not a minimum, which removes or minimizes the effects of failure modes, and improves reliability by using reliability engineering techniques during the design phase. In addition, due to the lifecycle nature of process industry safety systems, the concept will be extended to using Design for Reliability techniques throughout the safety system’s lifecycle to sustain and improve reliability, especially if the design proves inadequate for the application or any of the design assumptions prove to be invalid. The concepts and principles in this article can also be applied to designing instrument and control systems (I&CS) and other SISs such a safety controls, alarms and interlocks (SCAI).
A little history
Over the past three decades or so, development of the SIS standards ANSI/ISA S84 and later IEC/ISA 61511 has driven SIS design. SIS design practices have evolved from relay logic to general-purpose programmable logic controllers (PLC), and later to specialized safety PLCs; and from pneumatic process transmitters and “dumb” electronic transmitters to “smart” digital transmitters. Along the way, we moved from pneumatic process controls to electronic and digital single-loop controllers, and onward to modern distributed control systems (DCS).
Safety systems, commonly called emergency shutdown systems (ESD) in those days, really came to the industry forefront in the late 1970s and early 1980s, when the PLC was introduced into the process industries. Seeing the advantages of the PLC, it was quickly applied to discrete and sequential controls and as a programmable substitute for relay logic safety systems. Prior to that, emergency shutdown systems were fairly simple and consisted of relay and pneumatic logic. Concerns regarding the use of the PLC in safety systems rapidly became apparent, and the ISA S84 standard committee was formed in 1984 to address the matter. A driving concern was to ensure that people took care in applying this new PLC technology to safety systems. In the late 1980s and early 1990s, the ISA 84 committee realized the importance of field instrumentation in achieving reliable safety systems, and the standard was expanded to include the design of field instrumentation.
In March 1996, ISA took a major step for SIS when it issued the ISA S84 standard (later ANSI/ISA 84.01),“Application of Safety Instrumented Systems for the Process Industries.” It introduced concepts such as dividing failures into two types, random and systematic, where random failures are spontaneous failures of a component, and systematic failures are due to mistakes and errors of omission and commission in safety lifecycle activities. The new SIS standard also defined reliability as the “probability that a system can perform a defined function under stated conditions for a given period of time.” While reliability can be considered from a qualitative perspective, this definition generally implies a mathematical approach to reliability through the use of the term “probability,” rather than a more holistic reliability engineering approach of eliminating unreliability in a design. I always liked the qualitative definition I learned in the U.S. Marine Corps: “Works fine. Lasts a long time.”
ISA S84 also introduced the SIL concept, which is defined as: “One of three possible discrete integrity levels (SIL 1, SIL 2, SIL 3) of SISs. SILs are defined in terms of probability of failure on demand (PFD).” However, it's interesting to note that S84-1996 did not explicitly require a quantitative assessment of SIL (B14.1)—that came later. The PFD was calculated in the form of PFD average (PFDavg), and was based on the component architecture, statistical random failure rates of the components in the safety loop, the proof test interval, and the mean time to repair (MTTR). While there are interesting aspects in the evolution of safety system design, some that illustrate challenges in designing to the standard concept are:
Random vs. systematic. Much of the emphasis in design of SISs has involved meeting SIL, and centered on the random side of things. The systematic side was generally expected to be covered if you followed the safety lifecycle. This was probably because it's easy to calculate a relatively simple safety metric verses dealing with the mostly non-random human errors that comprise systematic errors.
Selection of equipment. In 1996, ISA S84 required equipment to be “user approved,” but later in the first edition of 61511, two choices were given: equipment approved to standard IEC 61508 (by third parties) or “proven-in-use” (2004-15), and later on (in IEC 61511 2nd edition-2015) by “prior use.” Because of the difficulty in documenting “proven-in use” or “prior use,” many users and specifiers, particularly E&C firms, chose to specify equipment third-party-approved to IEC 61508.
The third-party approvers to IEC 61508 are for-profit firms, which raises a natural conflict of interest, and the question of what standards, other than the approval standards, these firms should comply with to make sure approval practices are performed correctly. Who guards the guards?
Also, we're now getting a good sample of approved equipment and their reports, and people are starting to question some failure rate numbers and some statements in the reports. One of the approval agencies has even questioned the failure rate numbers of another because there was an order of magnitude difference in the failure rates—obviously, the one complaining wasn’t the one with the best (lowest) failure rate number.
Definition of reliability. While the 1996 version of the ANSI/ISA 84 standard had a definition of reliability in it, ANSI/ISA-84.00.01-2004 Part 1 (IEC 61511-1 Mod) 1st edition and 2nd edition don't define reliability. Common use of the term “reliability” for SIS in industry evolved to refer to spurious failures and trip rates, possibly from a reference to the SIS having an overt failure and failsafe action, but it seems to have left out the covert failure aspect of reliability. Reliability covers both safety integrity and spurious failures, and SIL and spurious trip rate (STR) are simply reliability metrics.
It's interesting that the number of books on design of SIS can probably be counted on one hand, and most are geared toward designing to meet the standards rather than the broader goal of designing a reliable SIS. There are probably even fewer books on design of reliable instrument and control systems.
This lack of emphasis on reliability has led designers to concentrate on the easier task of putting together an SIS by meeting SIL and by selecting approved equipment rather than dealing with more difficult aspects of reliability engineering. These include eliminating failure modes (both random and systematic) by using systematic methodologies such as FMEA; determination and use of more reliable components; reliability of process interface designs for low-demand service; attacking lifecycle issues (for example, many accidents involve instruments and controls not working for long periods of time); doing detailed modeling of failure modes for components used in SIS to help in the selection of equipment; minimizing failures in design; and developing more detailed SIS good design practices.
Compromise by consensus. Consensus standards are compromises because of disagreements among committee members or vested interests. Compromises are commonly reached by using “weasel” words, making the standard vague, and/or leaving out areas of disagreement. The ISA 84 committee has written technical reports that provide more guidance on the standard, provide lessons learned, and address areas the main standard has not covered, e.g. digital communication, wireless, fire and gas (F&G), and cybersecurity, etc. Nevertheless, there are many areas where people can pretty much do what they want, or avoid what they don’t want to do with little justification.
Much of design is geared to the functional or success side, such as a design doing what it's designed to do. Less is geared toward the failure or unsuccessful side of design, e.g. how to design to eliminate, prevent, tolerate or minimize the effects of failures. This is where design for reliability comes in, particularly in safety systems, where a covert failure can hide for years in a low-demand system before becoming a dangerous failure when a demand occurs. For low-demand safety systems, eliminating or minimizing failure modes is key to having a reliable system.
Design for reliability engineering concepts
The definition of design for reliability given above referred to using “reliability analysis techniques” to improve the system design, which I expanded to include using these techniques for improvement over the system lifecycle. Some of these concepts and principles are:
Functional/reliability design. Engineers primarily learn functional design in college. Functional design is designing to perform a specified function. They learn less about designing to minimize failures and designing to last a long time (dependability, reliability, etc.). Once out of college, engineers initially learn from mentorship and by copying what engineers have done before them, with some training typically thrown in. If the engineers had good mentors and examples, they're on their way to being good engineers. If not, it can be an uphill battle. This is where standards based on consensus best practices can help develop reliable, consistent SIS designs.
Membership in professional societies, participation in the standard-making processes, attending symposiums, training and other professional opportunities can help engineers improve their SIS design skills. Never turn down a chance to learn something new. However, it's difficult for anyone to learn to be a good engineer based on experience alone—you need it, but you simply don't have enough time to get all the experience you need. The same goes for mistakes, which are typically some of the best learning experiences. You don't have the time to make all the mistakes, so learn from the your mistakes and from mistakes of others at every opportunity.
One of the most common failures of a functional design is when the pre-design or front end engineering design (FEED) stage fails to specify the correct or full functional design specification, which leads to an incorrect or poor design.
Failure analysis. Failure analysis methodologies include analysis of potential failure modes using structured analysis tools such as FMEA; DFMEA; failure modes and effects, diagnostics and analysis (FMEDA), failure modes and effects criticality analysis (FMECA) etc. FMEDA, FMECA and DFMEA are variations of the original FMEA methodology, where the technique is expanded for additional analysis coverage. The original FMEA and FMEDA methodologies can stand in good stead for SIS systems. FMEDA is commonly used today in the SIS world for random hardware failures. The area where we need to expand FMEA use is for systematic errors, failures and faults.