Control system cyber incident hunting – input for a playbook on control system cyber incident investigations

Aug. 19, 2019
It is important to train engineers and IT/OT and expand the scope from network threat hunting to include ICS incident hunting. Perhaps we as an industry could collaborate on this important, but missing, task. 

People who work in cyber security talk a good deal about “threat hunting”. That is, detecting and chasing down threats in IT and OT Internet Protocol (IP) networks. There appears to be two prevailing assumptions: all control systems use, or are connected to, IP networks and whatever happens to the IP networks directly translates into actual physical impacts. This is not necessarily the case with either assumption. As an example of the first, Ukraine operated their grids WITHOUT IP networks for 6-9 months after the grid cyber attacks because they couldn’t trust their IP networks.  However, even if there are no IP network connections there are still control system device cyber vulnerabilities that need to be monitored and addressed which is the flaw in the 2nd assumption.

For control systems (I have deliberately not used the term “industrial - ICS” as control systems apply to all control and safety system applications), something even more basic has been overlooked: control system incident hunting. It is not easy to determine if a control system’s upset condition is a malfunction, unintentional cyber incident, or malicious cyber attack. Attacks, accidents, and failures can look very much alike. What they have in common is they produce physical effects. Identifying and understanding those physical impacts should be central to control system cyber security. What follows are some thoughts on the beginning of a framework for control system cyber incident investigation that can help untangle control system cyber incidents from other more probable causes of failure.

Cyber incident: steps toward a definition

It is first important to understand what is meant by a cyber incident. I have used the NIST definition of electronic communication between systems that impacts Confidentiality (C), Integrity (I), or Availability (A). Note that Safety (S) is missing and needs to be added. Also recognize that Availability may be different than Reliability and Productivity which are also important for control system applications. The NIST definition does not mention the word “malicious”. For control systems, the distinction between malicious and unintentional is not as important as understanding if systems are working as they should.  People, equipment, and the environment can be at risk in a control system cyber incident.

A brief history of control system cyber incidents

History can tell us something about recognizing ICS cyber incidents. The Applied Control Solutions (ACS) database mirrors the history of ICS cyber security. In fact, the concept predates it. When I first got to the Electric Power Research Institute (EPRI) in the late 1980s, I managed the Nuclear Instrumentation and Diagnostics Program. While managing a sensor health monitoring program and attempting to reduce testing intervals, a common cause, non-detectable failure in pressure, level, and flow measurements in nuclear safety applications was identified. In order to get an order of magnitude of the problem, I started collecting nuclear plant safety incidents that were sensor-related. This was possible because the Nuclear Regulatory Commission (NRC) requires documentation of abnormal events. However, in terms of details, the documentation was spotty at best, and I was forced to “read between the lines” – my first time being a “manual intrusion detection system”. I found more than 200 of these failures, including one that directly affected the Three Mile Island event and another that prevented a safety relief valve from lifting. That is, the pressure sensor could not reach its safety setpoint so the safety system didn’t work - this amounted essentially to an unintentional early version of Triton. These cases are where my focus on sensor issues started. However, because of the lack of real time raw sensor system monitoring capabilities, I didn’t pursue this avenue for cyber security until about 2017.

Fast forward to the start of the EPRI Cyber Security Program in 2000. I would get calls from colleagues stating they had unforeseen issues with their control systems and HMIs. When I left EPRI in 2002, I started the ICS Cyber Security Conference and began to informally document these incidents. At each ICS Cyber Security Conference through the 2017 time frame (I am no longer associated with that Conference), control system engineers provided actual control system cyber case histories. These case histories were not from IT experts as the incidents tended not to fall within the scope of the IT staffs’ responsibilities or expertise. This is worth mentioning because of the tendency of many organizations to assume that cyber security is a matter for the IT department to handle. Additionally, I would provide sample case histories when I would give control system cyber security presentations and seminars. Almost every time, at least one person would tell me they had experienced similar situations but didn’t know why.

What was also interesting was that a large percentage of the cases in the ACS database were not IP-network-related, but instead were control system issues. And, there were few, if any, control system cyber forensics or control system cyber security policies to address such incidents. This is why it so important to have IT/OT and engineering work together.

To date, the database consists of more than 1,170 incidents worldwide from power generation of all types, electric transmission and distribution, water and wastewater, oil and gas, chemicals, manufacturing, land and marine transportation, space, medical, food and beverage, and defense. There have been more than 1,500 deaths and more than $70BillionUS in direct impacts. Many of the cases keep recurring, others are cases I never expected could occur. I have not tried to automate the process of detecting potential events. Consequently, the cases are as I find them, which is also why I have not tried to determine trends other than to say cases keep occurring. All of the incidents in the database are unclassified, although many are business confidential, which is why the database will not be made public.

There was an interesting footnote at a recent cyber security workshop when a control system vendor mentioned they had access to numerous incidents but couldn’t release or discuss them even to their own customers because of confidentiality considerations. How can that issue be resolved?

Recent incidents and a rush to judgment

August 9, 2019, a significant portion of the power grid in England and Wales failed. Authorities were quick to bring power back online and to identify the cause: a near simultaneous tripping of a wind farm and a gas generation plant. They were even quicker to tell the public they were confident there had been no malicious activities or cyber attack. That’s surely possible, but I wondered in my blog and Linked-in notification how they could have ruled out a cyber incident so quickly and with such apparent confidence - https://www.controlglobal.com/blogs/unfettered/we-cant-detect-a-cyber-attack-that-trips-a-plant-but-we-immediately-identify-an-outage-as-not-being-a-cyber-attack/.

Idaho National Laboratory’s Andy Bochman responded to my Linked-in notification: “The rushed public declaration of ‘not cyber related’ almost certainly confirms a lack of significant cyber forensics capabilities, IMHO. In the realm of ‘cyber due diligence’ or the lack thereof, the recent transformer explosions in Madison, Wisconsin, outages in London and NYC, and nat gas explosions north of Boston last year had regulators and other stakeholders asking whether cyber attacks were partly or fully to blame. In each there were incident investigations, but it’s also important for the right types of folks to be asking the right cyber questions to the right individuals to confirm or rule out cyber vectors ... and that without a doubt does not happen within 24 hours or even a week's time. Do you reckon a playbook of sorts could be constructed to aid investigators in situations like these? Not cyber incident response where it seems to be a given that the event is cyber related, but cyber due diligence, where owners and other interested parties are asking the question: was this or was this not caused in part or in full by a cyber attack or cyber-related accident.”

Develop a cyber incident playbook

Andy is right. There is a need for a control system cyber security incident playbook to help with control system cyber incident investigations. Sanitized cases from the ACS database can help form the input for the playbook and, in fact, has been used as such. In the 2007-8 time frame, NIST’s Ron Ross tasked Marshall Abrams from MITRE (IT and NIST800-53 expertise) and myself (control system expertise) to analyze three publicly-identified control system cyber incidents to help non-Federal entities justify using NIST SP800-53 (recognizing NIST SP800-53 was an IT-based set of requirements). Consequently, we analyzed the 1999 Bellingham, WA Olympic Pipeline Rupture, the 2000 Maroochyshire wastewater attack, and the 2007 Browns Ferry 3 broadcast storm. The results are on the NIST website and in my book, Protecting Industrial Control Systems from Electronic Threats. We asked the following:

- What do we know now that wasn’t known at the time of the event or initial event analysis?

- What NIST SP800-53 controls were violated that enabled the event to occur?

- If the NIST SP800-53 controls were followed, could the event have been prevented or at least minimized?

- If not, what additional controls were necessary?

In 2015, I did a similar study for the International Atomic Energy Agency (IAEA). The intent was to have real cases to teach nuclear plant system engineers what to look for that could be cyber-related when an upset condition occurs. Consequently, I took 3 of the more than 30 nuclear plant control system cyber incidents that had real physical impacts, occurred in non-nuclear facilities, were not identified as being cyber-related, and most importantly were not IP-network-related.

In response to Andy’s request for a cyber incident playbook, using experts from multiple disciplines and relevant control system cyber incident case histories can be very valuable to get IT, OT, system engineers, network threat hunters, and forensics experts on the same page and to “connect the dots”.  As an example of the need, from a control system cyber perspective, the 1999 Bellingham gasoline pipeline rupture was similar to the 2010 San Bruno natural gas pipeline rupture but was missed as the focus was on piping integrity. Another example of why deep expertise is needed is the recently released Dragos report– “CRASHOVERRIDE: Reassessing the 2016 Ukraine Electric Power Event as a Protection-Focused Attack”.  Dragos involved SEL and ORNL which should have been sufficient. However, what was missing was a discussion, or even mention, of the Aurora vulnerability. That is, the reclosing of breakers out-of-phase with the grid which can cause significant long-term equipment damage.

The commonality of control system cyber incidents across industries shouldn’t be surprising as multiple industries use similar control system equipment from common control system vendors using common control system protocols. Unfortunately, the incidents keep recurring and the “dots” are not being connected. It is important for engineers and IT/OT to work together (this is not just cross-training and IT/OT convergence) and expand the scope from network threat hunting to include ICS incident hunting. I encourage further discussion on these topics especially with ICSJWG and other ICS cyber conferences coming up. Perhaps we as an industry could collaborate on this important, but missing, task.

Joe Weiss