A false sense of resilience?

Feb. 15, 2022

6 min read

1660316768930 Afalsesenseofresiliencehero

The resilience of an industrial process is measured by its ability to maintain performance expectations in the face of often unpredictable developments—ranging from weather extremes and variations in raw material quality to operator errors and equipment failures. Further, if developments result in an unavoidable shutdown, how quickly can the process be brought back online? Industrial controllers themselves play a central role in ensuring resilient response to variability, but when it comes to the software applications that provide necessary oversight and visibility into these processes, many plants put their trust in redundant half-measures that do little to improve overall resilience, according to industry veteran Pete Diffley. Nowadays, as leader of global partnerships for Trihedral Engineering, he’s campaigning for a bigger picture view of resilience that accounts for the shortcomings of simple primary-plus-backup approaches to redundancy plus the true cost of recovery and downtime that process control system engineers sometimes neglect in their design decisions.

Pete Diffley

Senior Manager, Global Partnerships, Trihedral Engineering

Q: You’re careful to draw a distinction between redundancy and resilience. How are the two different?

A: At its simplest, redundancy is defined as the inclusion of extra components in a system design that are not strictly necessary for basic functionality; they’re there in case of failure. Resilience, on the other hand, is the demonstrable ability to recover quickly from difficulties. Redundancy can contribute to increased resilience, but in the realm of the application software used for human-machine interface (HMI) and supervisory control and data acquisition (SCADA), system designers often settle for a parallel, primary-plus-backup arrangement that actually does little to increase resilience if the primary goes down.

Q: How does this relationship between redundancy and resilience play out in the context of industrial control systems?

A: Really, it's down to how systems react when challenged with a component failure or a communication outage. Having an extra component to account for a single failure or maintenance requirement in the system is sometimes referred to as N+1 redundancy. This is the baseline for redundancy.

Many of the most critical industrial processes have long employed high-availability, safety-instrumented-system (SIS) architectures with complete duplication of components. Each system can run independently of the others and still meet full load capacity if either is shut down, either through failure or planned maintenance. These always on, “2N” systems increase resilience significantly when deployed correctly. Originally, DCS systems were the only systems to deploy this approach.

Building further on this architecture, and now a highly achievable model to employ, is 2N+1. This is the gold standard for system resilience, and is designed to bring potentially hazardous processes to a safe state if conditions venture outside a designated operations window, but also ensure that the failure of a SIS component won’t result in a catastrophic shutdown. It may also employ two-out-of-three (2oo3) voting among three redundant sensors, for example, and is often referred to as triple modular redundancy (TMR). This has been around since the 1960s (used in space craft and airliners) but originally was considered too expensive to deploy except in the most critical processes.

Properly implemented and maintained, redundant components can increase resilience of a system but can also leave facilities more vulnerable than they think when not.

Q: You’ve alluded to redundancy strategies that may give a false sense of security to operations personnel. Can you give some examples?

A: For HMI/SCADA application software, there’s often no back-up employed, and when there is one, it may not be prepared to automatically jump into the breach in the event of a primary failure. In contrast with the schemes described previously, the backup software often is installed but not running on a backup server; it takes time to boot up—if it comes to life at all. It may not even have been tested recently. Further, it may have been under-designed relative to the primary, allowing the process to only limp along until the primary is restored.

I’ve even seen instances where the “backup” is a second virtual machine running on the same physical server as the primary. This sort of blind vulnerability to common-cause failure reminds me of the survivalist adage that “two is one, one is none.” Adventurers in inhospitable climes know that systems tend to fail when challenged, so they may tote a ferrocerium rod, waterproof matches, magnifying glass plus a compass with magnifying lens. That’s four different tools for starting a fire—and totally appropriate when your life may depend on it.

Q: Can you distill these observations into a set of strategic recommendations for those of our readers who are under increasing pressure to design truly resilient systems?

A: First, I think it’s important to look at one’s entire system and all its interconnections, identifying any single points of failure. For example, datacenters and cloud applications are, for the most part, highly resilient. But what about your connection to it? If your cloud datacenter goes down and you’re not connected to its backup, you’re out of luck. Remember too, that in terms of resilience, duplicate parallel systems can't equal the performance or responsiveness of triplicate (or more) hot standbys or that of geographically independent systems.

And if your current systems are less resilient than you’d like them to be, remember that someone, at some point, made a decision about what level of downtime was acceptable. Take another look at the cost of downtime, at the cost of recovery after a hiccup. True 2N and 2N+1 redundant solutions have come way down in cost in recent years, so chances are the math will favor investing in a more resilient approach, far more than leaving it to chance.

Trihedral’s VTScada software, for example, has long offered continuously synchronized, error-checked redundancy among an essentially unlimited number of hot backup instantiations implemented with only a few mouse clicks. It's immensely powerful, and if a backup instantiation needs to take over, validated data is automatically backfilled into the primary application when it comes back online. There are countless examples of critical systems running for decades with zero-downtime.

VTScada effectively scales that 2N+1 resilience exemplified by TMR safety systems and redundant DCS controllers to the HMI/SCADA realm, and takes it to a whole new level—from a handful of controllers and PCs to global systems with millions of datapoints.