Cassandra coefficient and ICS cyber – is this why the system is broken

Chapter 9 of Richard Clarke and R.P. Eddy’s book, Warning – Finding Cassandras to Stop Catastrophes, is defining the “Casandra Coefficient”. In reading the chapter, many of the issues that have prevented industry from adequately addressing ICS cyber security became very clear to me. It will interesting to see how people who read this blog see themselves and compare it to Dale Peterson’s views.

The four components of the Cassandra Coefficient are (1) the warning, the threat or risk in question, (2) the decision makers or audience who must react, (3) the predictor or possible Cassandra, and (4) the critics who disparage or reject the warning.

I have chosen actual cases to validate the book’s hypotheses. The overwhelming response to almost of all of the questions as to why ICS cyber security can be catastrophic and why there is a need for a Cassandra is the children’s book – The Emperor Wears No Clothes.

 Warning

-Response availability - Is this a problem that could have been mitigated with some response?

(Dale's Rating: High)The answer to this question is clearly yes, although you will find significant disagreement among credible experts on the best way to prevent or mitigate a catastrophe caused by an ICS cyber incident.

(My rating: Very High)My answer is also yes assuming appropriate forensics, training and guidance are available. Approaches like the cyber kill chain can help. However, incidents that are not traditional malware-related may not have the appropriate guidance and training. As an example, the Bellingham, WA Olympic Pipeline Rupture was a broadcast storm that resulted in a loss of safety event. However, the guidance from NTSB doesn’t address the cyber and safety issues. Consequently, this guidance is insufficient and broadcasts on control systems continue to occur such as Browns Ferry 3. The Aurora vulnerability is another example where the test occurred 10 years ago and hardware solutions were developed to mitigate this vulnerability. However, they are not being installed as recommended. As a result, there already has been at least one Aurora event that has caused equipment damage that could have been prevented if the mitigation was installed.

 - Initial Occurrence Syndrome - It has never happened before

(Dale's Rating: Medium) We have seen Stuxnet, Ukrainian Power Grids, German Steel Mills, Merck, and many others who have been impacted by ICS cyber incidents. The IOT botnet Mirai caused an economic impact. Definitions of a catastrophe vary, but compared to what is possible we have not seen a catastrophe caused by an ICS cyber incident. Many companies and organizations remained unconvinced that it can happen to them, but the larger organizations are no longer in denial that it could happen. They may still rely on the fact that it never happened before as an excuse to reduce the action and amount spent as Richard Clarke discussed in his S4x17 Keynote.

(My rating: Very High)The lack of control system cyber forensics and the reticence to call an event “cyber” has resulted in catastrophic events not being labeled as cyber. NERC is a major offender. As mentioned, the Bellingham pipe rupture occurred in 1999. The Maroochyshire wastewater hack was in 2000. The 2003 Northeast Blackout had a number of cyber implications and consequently many of the recommendations of the Final Report of the 2003 Northeast Outage were cyber. The Stuxnet vulnerability with Siemens PLCs was demonstrated at the 2008 Siemens Automation User Group in Chicago. Yet when Stuxnet was discovered in the July 2010 timeframe, people were totally surprised. 

- Erroneous Consensus - Experts Mistakenly Agree Risk is Low

(Dale's Rating: Low)It's hard to say there is a consensus in the ICS security space about the likelihood and shape of a cyber incident caused catastrophe. There are scenarios, but those scenarios are highly debated. The US Electric Grid is a good example of widely varying views by experts on the likelihood of widespread and sustained outages.If I had to judge, there is more of a consensus that a catastrophe will happen, with a wide range of estimates on the where, what and when. There is not a consensus that the risk is low.

(My rating: High)When a vulnerability or event is first detected, it is generally viewed as a “one-off” specific to that application. Like Stuxnet, it took a long time to understand the potential magnitude and risk of compromising PLCs. It is still not understood by many that the methodology for compromising the Siemens PLCs can be applied to almost any vendor’s PLC in any application in any industry. When the Rosemount oil loss failure was identified in the late 1980’s (I found this as part of an EPRI project to eliminate response time testing of pressure transmitters in nuclear safety applications), the nuclear industry felt it was a one-off problem. The result was a Rosemount transmitter with oil loss misled the nuclear plant operator and contributed to the Three Mile Island (TMI) accident. 

- Magnitude Overload - The sheer size of the problem overwhelms

(Dale's Rating: Ultra High) There are multiple sectors in the ICS world that could cause a catastrophe in terms of loss of life, economic impact to companies, sectors and regions, and environmental damage. While there are commonalities, when you look at stopping potential high consequence incidents, the actions required are very sector specific and more likely even sub-sector specific. Then there are the large number of participants in each sub-sector. It's daunting. If tasked with preventing an ICS / IoT cyber incident caused catastrophe at a single Plant, the rating would be Low or at most Medium. But when you consider all ICS in all sectors it is magnitude overload. After 9/11 I expected the DHS to create a prioritized list of the most critical infrastructure and start working on addressing the risk of an ICS cyber incident caused catastrophe. I'm told there is a list, but that's where any serious and effective government effort stopped.

(My Rating: Ultra High) I agree with Dale. This problem also describes the industry response to the NERC CIP process - “We can’t eat the whole elephant at once”. The lack of security in process sensors makes this overwhelming problem even more overwhelming. In a recent article in Naked Security, Galena Antova of Claroty stated: “…it is not simple or quick to fix flaws in sensors. Engineers know it takes years to design and it can take 25 to 35 years to replace the architecture of ICS equipment.” Consequently, she is shying away from the problem.

- Outlandishness - Does it appear to be science fiction

(Dale's Rating: Medium) It does not rate high because similar cyber incidents that have not resulted in catastrophes have happened, and talk to a forthcoming engineer and you will sometimes hear how they could cause a catastrophe. It does not rate low because a number of the scenarios presented at conferences are dreamed up by hackers / researchers lacking in the knowledge of the engineering, automation and safety systems. There are many outlandish scenarios.

(My Rating: High) It is my belief that to cause major damage the attack needs to be thought of as science fiction such that defenders don’t think about potential threats. Before the Aurora test, ask cyber security personnel if bits and bytes could cause kinetic damage and put the grid at risk for months by simply opening and then reclosing breakers. Before Stuxnet, ask cyber security personnel if you could, via a USB, take remote control of a process, change the operator displays so the operator is unaware, and then change the control logic back so the operator is clueless. Before Volkswagen was caught with their cheat device, ask the same questions. 

- Invisible Obvious - Not seen because it is too obvious

(Dale's Rating: Low) We seem to be past the surprise that control and configuration of a physical process allows you to control and alter the physical process. 

(My Rating: High) See the responses from the previous questions. Certainly, process sensors fall in this category.

Decision Makers or Audience

- Diffusion of Responsibility - Whose job is it?

(Dale's Rating: High) This could be another ultra high like Magnitude Overload. This requires large numbers of companies running ICS to all solve the problem. An argument could be made that regulation would reduce the primary responsibility to government, but even in this case there are multiple sectors that would require regulation in countries all around the world. At best regulation would reduce the rating to Medium.

(My Rating: Ultra High) Without senior management buy-in, there is no chance for ICS cyber security to succeed. Currently, cyber security is on the minds of senior leadership, but all too often it is IT security and resides at the CIO or CISO level. In many cases, the CIO and/or CISO doesn’t have Operations under their purview. Additionally, there are still major culture gaps between IT and ICS and Safety and Security organizations not just in end-user organizations but also at the control system suppliers. 

- Agenda Inertia - Focus on issues already in the plan

(Dale's Rating: Medium) ICS cybersecurity has slowly bubbled up to the C-level and board agendas. This can be seen in a variety of polls of what executives are most concerned about. Adding IoT increase this rating, and IoT on its own would be High. Yes, there is a lot of talk about it, but it is not driving IoT vendor development or purchase and deployment of IoT.

(My Rating: High) This goes back to “is it science fiction?”  The reason there are so many ICS cyber incidents (malicious and unintentional), is that the vulnerabilities were not part of the plan. An example is the 2015 Ukrainian cyber attacks against the power grid where NERC stated they would not change any requirements because the Ukrainian cyber attacks were outside existing NERC CIP scope. My project with the International Atomic Energy Agency (IAEA) was to develop scenario-based training to train control system engineers how to think about plant upsets potentially being cyber-related. 

- Complexity Mismatch - Technical understanding required by decision makers

(Dale's Rating: Medium) This should be low, but the ICS community has done a poor job. It is not difficult to tell a non-technical Board Member or COO that if a skilled attacker can take out or modify this control or safety component the Plant will be down for xx days and cost yy to get back to operations. Many ICS asset owners have these numbers from the safety/protection program, and they are easily understood.

(My Rating: High) Often a warning of catastrophe requires an explanation or interpretation by experts. As Dale mentions about safety, the safety community still does not understand the implications of cyber threats to safety systems (see my blog on the Bellingham, WA pipe rupture.) The same issues can said of Stuxnet before July 2010 and Aurora to this day.

- Ideological Response Rejection - Reject only available response

(Dale's Rating: Unknown) If you believe that government regulation is the only way to prevent or mitigate ICS cyber incident caused catastrophes, then this would be rated a Medium. Since I don't believe there is only one approach to addressing this problem, it is not possible to rate this sub-category. If pushed, I would put it at a medium because any approach would likely conflict with the ideology of some portion of the decision makers.

(My Response: High) This is an extended part of Agenda Inertia. If you don’t believe the problem is real, there are any number of reasons for rejecting response. Aurora is a great example. For reasons only NERC can explain, NERC got industry to believe the 2007 INL test was “rigged” and therefore little has been done.

- Profiles in Cowardice - self explanatory

(Dale's Rating: Low) Decision makers are not held back, in my opinion, because they are afraid of looking foolish if an ICS cyber incident caused catastrophe does not occur. In fact if this catastrophe does not occur there would be pats on the back and self-congratulations.

(My Rating: High) Again, Aurora is a great example of profiles in cowardice. Aurora is a basic principle of physics taught in first year electrical engineering courses. EVERY electrical engineer should know that reconnecting Alternating Current equipment out-of-phase with the grid will damage or destroy it. Yet, because of NERC, the utilities are afraid to come forward and do the right thing for themselves and their customers. They are leaving the country with an existential threat.

- Satisficing - Respond, but not in a sufficient way ... or declare victory and move on

(Dale's Rating: Ultra High) Government, industry organizations and asset owners are doing something in order to check off the ICS cybersecurity box. Some even believe that they have addressed preventing potential catastrophes. Unfortunately the approach is to deploy "good security practices" that do reduce the likelihood, by varying amounts, of the cyber incident caused catastrophe, but don't address the remaining possibility of a catastrophic consequence. Items like a strong security perimeter offer significant reductions in likelihood. Others, like patching insecure by design software and components, have minimal reduction in likelihood. The decision makers need to push their teams back to the process hazards analysis (PHA) and determine if a cyber attack could cause one of the high consequence incidents.

Then there is the danger that regulations, like NERC CIP, that lift the base level or floor of ICS / IoT cyber security leading decision makers to believe they have prevented a catastrophe. In general, we are seeing most of the activity in raising the ICS security floor to a higher minimal level and not in preventing catastrophes.

(My Rating: Ultra High) I agree with Dale. An example of this would be CMS Energy that has added a cyber security expert to their Board of Directors, the Chief Information Security Officer of Comcast. Having a seasoned cyber security officer on the Board is valuable, but shouldn’t the person overseeing the cyber security of a gas and electric utility be someone who knows power plants, pipelines, substations, and SCADA systems?

- Inability to Discern the Unusual (in warnings)

(Dale's Rating: High) The ability to discern between real and non-realistic warnings is diminished often by the asset owner's entrenched engineering and operations talent. They designed, built and maintained the system, and are loath to admit in most, but not all, cases there is a serious risk of catastrophe that is not addressed. These internal experts have solved the hard problems to date, kept things running, and have a great deal of the decision makers’ trust. Then there are industry groups and sometimes government agencies that downplay most threats and warnings that arise for similar reasons. It would be encouraging if we heard occasionally, "we have analyzed warning xx, and it is not a serious issue, but this uncovered warning yy that is serious." Instead there is a large, entrenched faction playing defense and knocking down warnings.

(My Rating: High) I agree with Dale.  The NERC CIP process is an example where warnings can, and have been, blocked from appropriate decision makers. It is also because of the lack of appropriate control system cyber security training which leads to the inability to detect control system incidents as being cyber-related.

The Predictor or Possible Cassandra

(Dale’s response) This is one of the four main categories with seven sub-categories evaluating the sole person playing the Cassandra role for the issue. The authors document and evaluate Joe Weiss as the ICS community's Cassandra in Chapter 14. This issue of the qualities of the Cassandra may have been an issue last decade, but we are well past a single voice saying potential catastrophes loom.

(My response) I was never the only voice, but maybe the loudest and most consistent. I also had data and physics to validate my concerns. My focus has always been “what are we missing?” As Dale mentions, ICS cyber security at the network layer is being addressed more generally. However, the sensor and field device issue is not. Specifically issues with security and safety are still not understood much less addressed.

The Critics - Those pushing back on the warning

Scientific Reticence - push back due to lack of scientific certainty

(Dale's Rating: Medium) There is a strong predilection in the ICS community to find ways to knock down the possibility of any threat succeeding in causing serious outage or damage. If there is one error or less than 100% clear proof that the attack would succeed in can be set aside. Rather than think how the almost attack could be altered to succeed and addressing the issue, victory is declared. That said, there are few remaining that are bold enough to say a catastrophe can't happen. So while there is pushback on specifics, there is acceptance in general terms.

(My Rating: High) I wish Dale was right. Aurora continues to be the poster child for industry to say it can’t happen despite the fact there has actually been an Aurora event that has caused damage. My continuing concern is using cyber to initiate physical actions that can lead to catastrophic failures. Aurora is one example. The lack of authentication and security in process sensing makes all types of “physics” events possible. There are many ways to cause catastrophic events to occur – either directly or changing operator displays and having the operator be his/her own “attacker”.

Personal or Professional Investment

(Dale's Rating: High) The technical team designed, built and maintained the system, and are loath to admit in most cases there is a serious risk of catastrophe that is not addressed. Senior management is ultimately responsible for risk management. Many believe it is career limiting to admit this blind spot. And there is the financial impact of solving this problem, although clearly it would be tiny compared to the cost and impact of a catastrophe.

(My Rating: High) I agree with Dale. 

Non-Expert Rejection

(Dale's Rating: Medium) The biggest issue here are industry groups, lobbyists, lawmakers and regulators. Most lack the knowledge to contribute to addressing the potential catastrophe, so they are easily swayed to satisficing solutions. This is not meant to denigrate the lawmakers and regulators. It's unrealistic to expect them to have the expertise on this and the variety of other issues they need to deal with. This brings up another challenge the book identifies, regulatory capture - the collusion of the regulator and the regulated. The people best able to write the regulations are the regulated, but there is a conflict of interest.

(My Rating: High) I agree with Dale

 "Now is Not the Time" Fallacy

(Dale's Rating: Low with caveat) There is almost uniform agreement that now is the time to solve the ICS security problem. The caveat is the equally widespread acceptance of more palatable solutions that don't prevent or mitigate catastrophes potentially caused by ICS cyber incidents.

(My Rating: Very High) I wish Dale was right. What I have found is that each time another IT cyber event occurs more attention goes to the IT at the expense of ICS cyber security. The other common theme is “wait until something big happens or something happens to me, then we can take action”. Because there are minimal ICS cyber forensics and appropriate training at the control system layer (not just the network), there are very few publicly documented ICS cyber cases. However, I have been able to document more than 950 actual cases resulting in more than 1,000 deaths and more than $50 Billion in direct damages. I was recently at a major end-user where I was to give a seminar. The evening before I had dinner with their OT cyber security expert who mentioned he had been involved in an actual malicious ICS cyber security event that affected their facilities. For various reasons the event was not documented. Consequently, everyone from the end-user, other that the OT cyber expert involved, were unaware of a major ICS cyber event that occurred in their own company. So much for information sharing.

Overall Critics

Conclusions (Dale's Rating: Medium) The real value of the Cassandra Coefficient is not the qualitative scoring or final rating. The value is in thinking where the ICS community stands in each sub-category. How is this helping or hurting the implementation of measures to prevent and mitigate catastrophes? How can we change the sub-categories that are holding us back?

To me, the real value of the book provides a basis for asking “who do you believe?” Given there is no barrier to entry to be considered an “ICS cyber security expert”, who do you believe becomes very critical.

Joe Weiss