Thousands of deaths from control system cyber incidents and most did not involve IP network issues

June 2, 2022
The term “cybernetics” is defined as the science of communications and automatic control systems in both machines and living things. Today, the term cybernetics has been transformed to the term “cyber” which generally doesn’t always address the physical nature of devices that control physics. There have been thousands of deaths from malicious and unintentional control system cyber incidents. These deadly control system cyber incidents continue to recur and process sensors often play a role in those incidents. However, these fatal incidents often are not Internet Network (IP)-related. The continuing focus on IP networks ignores these real incidents. Yet as can be seen with Stuxnet and Triton, attackers aren’t stupid and can exploit these types of issues. Will it be possible to return to some of the early principles of cybernetics and recognize that problems can be induced in any feedback loop, whether it’s networked or not. To put it harshly, how many more people are going to die before control system cyber security is treated as a systems engineering issue and the engineering vs network security culture gap overcome? Apparently, there will be a Solarium Commission 2.0. The first Solarium Commission did not address the unique issues associated with control systems and process sensors. Will Solarium Commission 2.0 address these issues before more people die?

The term “cybernetics” is defined as the science of communications and automatic control systems in both machines and living things. Today, the term cybernetics has been transformed to the term “cyber” which effectively doesn’t always address the physical nature of devices that control physics.

The U.S. Government Accountability Office (GAO) in GAO-21-477 defines a cyber incident as “an event that jeopardizes the cybersecurity of an information system or the information the system processes, stores, or transmits; or an event that violates security policies, procedures, or acceptable use policies, whether resulting from malicious activity or not. Cyber incidents, including cyberattacks, can damage information technology assets, create losses related to business disruption and theft, release sensitive information, and expose entities to liability from customers, suppliers, employees, and shareholders.”  The incidents being discussed affect integrity and availability.

The following cases demonstrate that deadly control system cyber incidents continue to recur and that sensors often play a role in those incidents. Moreover, as the causes of these fatal incidents often are not Internet Network (IP)-related, the continuing focus on IP networks ignores these real incidents. That includes the continuing focus on ransomware. Yet as can be seen with Stuxnet and Triton, attackers aren’t stupid and can exploit these types of issues. Let’s review some of these cases and see how they fall under the GAO’s definition of “cyber incident.”

The Wenzhou train collision

This collision between two trains occurred in July 2011, in Wenzhou, China.  Forty people were killed and at least 192 were injured. The first train, D3115, was stopped by the Automatic Train Protection (ATP) signal system. The driver of train D3115 worked to override the ATP, and after more than seven minutes of waiting got the train moving again when he succeeded in overriding the ATP. As train D3115 entered the next section of track, where the track circuits that indicate the presence of a train were working correctly, the control center now saw that the track section was occupied. But the driver of following train, D301, had already been given instructions to proceed onto the section of track where D3115 had been stopped. The control center issued those instructions while it had a false indication that the track was unoccupied. Despite a message from the control center that D301 should proceed with caution, less than half a minute later train D301 running at 99 km/h (62 mph) collided with train D3115. The was an embarrassment and a setback to China’s high speed rail ambitions.

Compare that crash to the Washington DC Metro incident that occurred two years earlier where one train crashed into another on the same track between the Fort Totten and Takoma stations. The trains were running on automatic, with all movements controlled by a central computer. The crash killed nine and injured at least 70 others.

Bayer CropScience explosion

An explosion and fire occurred at the Bayer CropScience plant in Institute, West Virginia.  The explosion was caused by a runaway reaction that created extremely high heat and pressure in the residue treater which ruptured and flew 50ft in the air.  Two operators died. 

There were significant lapses in the plant’s process safety management. This included inadequate training on new equipment and a need to override critical safety systems because the heater could not produce the required temperature for safe operations. 

Before starting up, Bayer had recently upgraded the computer control system for the Methomyl and Larvin unit.  The control screens of the Siemens system Bayer installed were completely different from the Honeywell system it replaced. Operators were not adequately trained. Operators used a workaround to deal with the longstanding heater problem. According to a Chemical Safety Board (CSB) investigation, the system was started prematurely because of business pressures to resume production of the pesticides methomyl and Larvin, The startup took place before valve lineups, equipment checkouts, a pre-startup safety review, and computer and process sensor calibrations were complete.

CSB investigators also found that the company failed to perform a thorough Process Hazard Analysis, or PHA, as required by regulation. This resulted in numerous critical omissions, including an overly complex Standard Operating Procedure (SOP) that was not reviewed and approved, incomplete operator training on a new computer control system, and inadequate control of process safeguards. A principal cause of the accident, the report states, was the intentional overriding of an interlock system that was designed to prevent adding methomyl process residue into the residue treater vessel before filling the vessel with clean solvent and heating it to the minimum safe operating temperature. Furthermore, the investigation found that critical operating equipment and instruments were not installed before the restart and were discovered to be missing after the startup began. Bayer’s Methomyl-Larvin unit MIC (methyl isocyanate) gas monitoring system was not in service as the startup ensued, yet Bayer emergency personnel presumed it was functioning and claimed no toxic MIC was released during the incident.

The West Virginia facility was a clone of Bhopal. Union Carbide India’s Bhopal plant was located near a large population center and was at ground level with the local population. The Bhopal disaster also released MIC. Over 500,000 people were exposed to the MIC gas. The highly toxic substance made its way into and around the small towns located near the plant and killed approximately 16,000 and caused more than 550,000 injuries. Like the Institute case, the control and safety systems didn’t perform as designed and process sensors were either out of calibration or ignored even when the data were correct. As a result of Bhopal, Union Carbide which had been one of the largest chemical companies in the world is now owned by Dow Chemical Company and Bayer CropScience.

There are other clones of the Institute and Bhopal MIC facilities. I used to drive by the Bayer CropScience facility which is just outside of Kansas City on the Missouri River while I was working on Kansas City Power & Light (now Evergy’s) Hawthorn power plant.

It should also be noted that both Stuxnet and Triton intentionally tried to cause these types of failures.

Angle-of-Attack sensors

These incidents come from commercial aviation. An angle-of-attack sensor indirectly measures the amount of lift generated by an aircraft’s wings. The angle-of-attack is the angle between the wing and oncoming air. The sensor’s main purpose is to warn pilots when the plane could aerodynamically stall from too little lift, leading to a loss of control.

The risks posed by a faulty angle-of-attack sensor are amplified by the increasing role of cockpit automation. It is an example of how the same technology that makes aircraft safer — automated software — can be undone by a seemingly small problem. Placing too much trust in the sensors can cause trouble. One of the most serious crashes tied to angle-of-attack sensors occurred in 2008, when XL Airways Germany Flight 888T hit the Mediterranean Sea, killing seven people. French authorities blamed water-soaked angle-of-attack sensors on the Airbus 320, saying they generated inaccurate readings and set up a chain of events that resulted in a stall.

In 2014, an Airbus fell from 31,000 feet before the pilots were able to regain control and raise the aircraft’s nose. A call to a ground crew determined that the plane’s angle-of-attack sensors must have been malfunctioning, causing the Airbus’s anti-stall software to force the plane’s nose down. The pilots turned off the problematic unit and continued the flight. Aviation authorities in Europe and the United States eventually ordered the replacement of angle-of-attack sensors on many Airbus models.

The angle-of-attack sensors in the more recent Boeing MAX 737 crashes in Indonesia and Ethiopia sent the wrong signals to new software (should have been safety but wasn’t) on the flight that was designed to automatically dip the plane’s nose to prevent a stall. Angle-of-attack sensors have been flagged as having problems more than 50 times on U.S. commercial airplanes over the past five years, although no U.S. accidents have occurred over millions of miles flown. That makes it a relatively unusual problem, but also one with magnified importance because of its prominent role in flight software. Federal Aviation Administration (FAA) reports include 19 reported cases of sensor trouble on Boeing aircraft, such as an American Airlines flight last year that declared a midflight emergency when the plane’s stall-warning system alarmed, despite airspeed being normal. The affected Boeing 737-800 landed safely. Maintenance crews replaced three parts, including the angle-of-attack sensor, according to the FAA database. There have been more than 500 deaths from angle-of-attack plane crashes. (https://www.heraldnet.com/nation-world/not-just-the-737-angle-of-attack-sensors-have-had-problems/#:~:text=Angle-of-attack%20sensors%20have%20been%20flagged%20as%20having%20problems,the%20Federal%20Aviation%20Administration%E2%80%99s%20Service%20Difficulty%20Reporting%20database)

Fatal cyber incidents often don’t involve malware

While reviewing my control system cyber incident database (not public), there have been thousands of deaths directly from control system cyber incidents (see the above examples). The incidents include malicious and unintentional cases. Most of the unintentional cases could have also been caused maliciously. These fatalities have occurred globally in natural gas and gasoline pipelines, refining, chemicals, medical devices, mining, ships, rail (mass transit and interstate), planes, manufacturing, hydro, cranes, and off-shore oil platforms. Many of these incidents were directly attributable to process sensor issues. 

Many of the incidents were collected from published process safety cases where cyber wasn’t addressed. Conversely, some in the cyber community haven’t paid close attention to such incidents because they didn’t involve malware. That kind of oversight may be changing. It’s widely understood, for example, that cyberattacks are at least as likely to depend upon social engineering or stolen credentials as they are to use malware. Still, the culture gap between engineering and cyber security continues unabated.

What is not included

The cases in this blog didn’t include control system cyber incidents that “only” resulted in injuries, regardless of how severe. Water/wastewater cases were not addressed, because I am not aware of any direct deaths from water/wastewater cyber incidents even though many people have been hospitalized. Electric cases were not addressed, as there have been no direct deaths from cyber-related substation explosions, etc. I don’t consider heart attacks, loss of required medical device availability due to loss of power, etc. a direct control system cyber impact. Control system cyber incidents with food production that only caused hospitalizations were also not counted. In these cases causality is too difficult to determine.

Distribution reclosures that utilize Bluetooth communications have contributed to several Northern and Southern California wildfires resulting in hundreds of deaths and billions of dollars in damages. However, it is unclear if the Bluetooth communications played a role in causing the fires, so I have not counted those cases (https://www.controlglobal.com/blogs/unfettered/electric-distribution-reclosers-can-be-cyber-compromised-to-cause-devastating-wildfires).

Control system cyber security, not just Operational Technology (OT) network cyber security, is a systems engineering issue. But, this is not being adequately addressed by the network security community in government and private industry. Several recent events stimulated me to write this blog:

  • There was an Infragrad webinar May 31, 2022 on water/wastewater cyber security. Andrew Ginter from Waterfall mentioned there were 9 OT cyber incidents in 2020, 22 OT incidents in 2021, and he expects about 50 in 2022. According to Andrew, almost all were ransomware. When asked, Andrew was not aware of other cyber incidents that could have affected physical infrastructure. However, ransomware is not going to directly kill anyone or damage physical equipment.
  • The May 2022 DNP report on energy cyber security where when asked to specify the cyber-attack consequences that respondents see as a top concern for their organization, they point first to disrupted services and operations (57%), reputational damage (42%), data breach (41%), and a corresponding hit to profits (39%). These aren’t operational or safety issues.  In comparison, just 32%, 27%, 24%, and 16% of respondents, respectively, describe loss of automation systems, equipment damage, loss of life, and environmental damage as top concerns. How is it possible that more than 75% of the respondents felt that loss-of-life and environmental damage from a cyber-attack weren’t top concerns for their organizations? These results are inexplicable for anyone responsible for reliable and safe facility operations. https://www.controlglobal.com/blogs/unfettered/the-survey-results-of-the-2022-dnv-energy-cyber-security-report-are-grossly-misleading/
  • The CISA focus continues to be on network security including ransomware as can be seen from the continuing guidance including “Shields Up”.
  • The May 18, 2022 DOE/MITRE Common Attack Pattern Enumeration and Classification (CEPAC) virtual meeting did not address the lack of cyber security in process sensors. 

Will it be possible to return to some of the early principles of cybernetics and recognize that problems can be induced in any feedback loop, whether it’s networked or not. To put it harshly, how many more people are going to die before control system cyber security is treated as a systems engineering issue and the engineering vs network security culture gap overcome? Apparently, there will be a Solarium Commission 2.0. The first Solarium Commission did not address the unique issues associated with control systems and process sensors. Will Solarium Commission 2.0 address these issues before more people die?

Joe Weiss