What isn’t understood about control system cyber security can lead to catastrophic failures

I began writing this blog before I read that Dragos/GE were collaborating on three whitepapers targeting engineers/process architects. Dragos/GE are hoping that it gets them more engaged with OT and IT security – so that they can make further contributions to that important conversation. That’s good. It’s an important conversation to have, especially since the OT and IT communities have made as much progress as they have in the direction of convergence. I’d like to use the occasion of their whitepapers as an opportunity to urge that we not forget other challenges that haven’t come as close to being addressed.

Some history

I believe the biggest overall problem control system/operations has with the cyber security community is that community’s focus on “protecting the network,” rather than “protecting the operational systems/process.” But we’re in a position to make some progress here, too.

Consider some history. The electric grid is approximately 100 years old. In its first 80 years, the grid was monitored and controlled without any Internet Protocol-based networks. The grid can operate without the Internet. However, the Internet cannot operate without power. The purpose of control system cyber security is to protect the control systems and the processes they monitor and control. Networks are a support function. To take one example, the only US utility I am aware of whose control systems were targeted and compromised lost SCADA capability for 2 weeks. However, the utility did not lose power. It only lost its view into operations. Consequently, its customers were unaware anything had happened. The impact was on the utility only: it had to man every substation and reconstitute the SCADA systems that were completely overcome by malware.

The gap between protecting networks and protecting operational processes can be traced back to the move of cyber security from the operational organizations to the IT (and now the OT) organizations following the 9/11 disaster. Before then, the physical equipment such as turbines, pumps, motors, protective relays and their associated processes were the sole responsibility of the relevant technical organizations. When cyber security was transferred to IT (and now OT), the responsibility for cyber security of the control system field devices (e.g., process sensors, actuators, drives, etc.) and the engineering equipment “fell off the table.”

Taking ground truth for granted

Unfortunately, that is often the case to this day. I attended the 2018 IOT World Conference in Santa Clara, CA. The concept of cyber security of process sensors at the sensor level was foreign to everyone I met. All of those people assumed process sensors were secure, authenticated, and correct. The other problem was that IT approaches were often substituted for engineering approaches. This can be seen from approaches such as the Common Vulnerability Scoring System (CVSS) for rating vulnerabilities where the scores are assigned to flaws affecting industrial control systems (ICS). The same can be seen by DHS ICSCERT assigning criticality to vulnerabilities. However, neither approach addresses what is the impact of the vulnerability on the actual plant equipment and process. That is, what is the impact of a cyber vulnerability on a specific pump, valve, motor, relay, etc. Additionally, who does IT/OT notify when they find malware and who makes the recommendations on what happens next? Facility Operations needs to be closely involved in any decision affecting facility operations.

It’s true that Board-level awareness of cybersecurity has risen over the last few years. But that awareness, while a good thing, is generally awareness of IT network security. IT security is better known, and every organization has front-office IT networks. Cyber security experts now even sit on many Boards. However, they generally have an IT background, and not an operational one.

OT and control systems aren’t necessarily the same thing

OT experts are becoming more numerous. The term OT - Operational Technology – has become a rather nebulous expression that’s been applied to all non-IT assets. Gartner’s IT Glossary defines OT as “hardware and software that detects or causes a change through the direct monitoring and/or control of physical devices, processes and events in the enterprise.” Wikipedia defines OT as the “use of computers to monitor or alter the physical state of a system, such as the control system for a power station or the control network for a rail system.” The term has become established to demonstrate the technological and functional differences between traditional IT systems and Industrial Control Systems environment, the so-called "IT in the non-carpeted areas". In reality, the OT focus has been on the control system networks as opposed to the control system equipment. Consequently, it’s not clear that engineers/process architects would consider themselves OT.

There continues to be a lack of lack of awareness of control system cyber security by the engineers and technicians responsible for the control systems From my experience with some exceptions, the engineers and technicians who are responsible for turbines, motors, process sensors, relays, etc. do not consider themselves to be OT. Control systems are complex systems with complex systems interactions. Generally, detailed systems interaction studies are performed including Failure Modes and Affects Analyses (FMEA), Hazard Operations analyses (HazOps), etc. What is often missing from these analyses are cyber considerations. Additionally, OT cyber security upgrades with modern communication capabilities generally do not include systems interaction studies.

A need for another convergence in security

The convergence between IT and OT is easy to understand: they’re both fundamentally concerned with networks.

There remains, however, a lack of adequate cooperation/coordination between OT, control systems, safety, and forensics organizations. This helps to explain why there continues to be so many unintentional cyber incidents from “bolting on” cyber security to legacy systems. Because of the lack of control system cyber forensics and training at the control device layer, the most dangerous control system cyber incidents, those that can cause physical damage, are not being identified.

This is a real problem, and not just a theoretical possibility. The recent Lion Air airline tragedy amply demonstrates what happens when sensors are incorrect for any reason. The sensors gave incorrect input to the controller which kept trying to point the nose down. Each time the pilots tried to override the controls, the automatic controls overrode the pilots’ manual attempt until the plane crashed killing all 189 people on-board. It is irrelevant whether this was malicious or unintentional – the people are still dead.

Safety demands more than assured networks. No matter how well secured communications are, if the sensors and actuators that constitute the ground truth of any industrial process are compromised or defective, you won’t have a safe, reliable, or optimized process.

Consider one example. On December 12, 2018, Ken Crowther from GE contributed to the following thread on SCADSEC: ”I believe part of this mitigation language is to communicate the implications of CVSS, which not all readers of ICS-CERT advisories understand. We have to do this internally to translate the implications of the certain vulnerabilities (e.g., with AV:A in the score) in the context of threats we are seeing. Ultimately, all vulnerabilities need to be fixed, but not all are urgent to the point of degrading/pausing operations to fix them. If we over-hyped every vulnerability (or don't take precautions to avoid overhyping vulnerabilities) then operators could either spend themselves into oblivion to fix everything or start ignoring everything - and that is not a good strategy to manage risk. The other thing is to remember that ICS-CERT does not always write these advisories from scratch. There is typically a collaboration between ICS-CERT and the vendor. (We will frequently draft materials for ICS-CERT.) When our equipment is deployed, we provide deployment guidance. It would be pretty standard for us to recommend to ICS-CERT to begin recommendation by stating that operators should look up and follow deployment guidance. This particular advisory was for a fairly broad set of equipment, as such they can't point to any specific deployment guidance, but instead point to general deployment principles.” Without understanding the impact of the cyber vulnerability on the actual control system devices, what is an engineer supposed to do with the vulnerability information?

A plea for a distinction

Safety and security are often thought of as synonymous. They are not. You can be cyber secure without being safe. This is because safety is really dependent on the Level 0,1 devices and instrumentation networks not the higher-level Internet Protocol Ethernet networks. The real safety and reliability impacts come from manipulating physics, not data.

The Aurora vulnerability uses cyber (electronic communications) to reclose breakers out-of-phase with the grid, which causes physical damage to any alternating current (AC) rotating equipment and transformers connected to the affected substation. There is no malware involved and therefore would likely not be detected by network monitoring. Unlike opening substation breakers which can lead to hours to days outages such as with the 2015 and 2016 Ukrainian cyber attacks, reclosing breakers out-of-phase with the grid can lead to months and potentially even year long outages.

The Stuxnet payload was actually modified control system logic, not malware. Malware can cause impacts such as the Ukrainian cyber attacks – hours to day-long outages or downtime. Manipulating physics causes physical damage leading to months of downtime. Moreover, there is minimal cyber forensics or training to detect these issues which can easily be mistaken for unintentional events or “glitches”. Unless cyber threats can explicitly compromise safety and reliability, I do not believe operations will really care about security. That is not say that IT won’t be interested as there is substantial high-value data that can be compromised. That’s very important, too, and nothing I’ve said here should be taken as downplaying that importance. But it’s not the whole story.

Industrial deployments are designed with process sensors (e.g., temperature, pressure, level, flow, voltage, current, etc.) to monitor and feed safety interlocks assuring that system safety is maintained. However, legacy process sensors have no cyber security, authentication, and adequate real time process sensor forensics is generally unavailable. An example is the temperature sensors that monitor turbine/generator safety systems to prevent the generators from operating in unstable or unsafe conditions. These process sensors are integral to the operation of the system and cannot be bypassed. If the temperature sensor is inoperable for any reason, it can prevent the turbine from restarting, whether in automatic or manual. The lack of generator availability can prevent grid restart from occurring.

There have been numerous catastrophic (multi-million- to multi-billion dollar) control system cyber incidents, many of which have resulted in injuries and deaths. In general, they have not come from network problems, but from compromises or problems with control system devices.

Before it’s too late, we’d do well to start addressing the existential problems in the physical world, in addition to the important data problems in cyberspace.

Joe Weiss