NERC's cyber security approach is preventing the electric grid from being secured

Background: In computing, a denial-of-service attack (DoS attack) is an attempt to make a machine or network resource unavailable to its intended users. One common method of attack involves saturating the target machine with communications requests, so much so that it cannot respond to legitimate traffic, or responds so slowly as to be rendered essentially unavailable. Such attacks usually lead to a server overload. Moreover, NIST defines a cyber incident as electronic communications between systems (or systems and people) that affects confidentiality, integrity, or availability.

NERC, FERC, DHS, Congress, and the electric industry have said it is important to secure the electric grid. September 27th, the presidents of the American Public Power Association (APPA), Edison Electric Institute (EEI), Electric Power Supply Association (EPSA), Nuclear Energy Institute (NEI), and the National Rural Electric Cooperative Association (NRECA) sent a letter to Senator Rockefeller proclaiming the importance of cyber security and stating they were working to secure the electric industry. Control system cyber incidents are real and numerous. My database contains more than 75 electric industry control system cyber incidents (this does not count power plants) and the number is growing. However, the electric industry and NERC generally have been silent on disclosing control system cyber incidents even within the industry.

There have been numerous discussions about the differences between compliance and security. The spirit of the NERC CIPs is to maintain the reliability of the electric grid in the face of cyber threats. However, the reality is the NERC CIPs fall far short of meeting that spirit. Specifically, the February 8 NERC Lessons Learned document provided four case histories that in IT would be considered denial-of-service events. Each of the four incidents has occurred elsewhere in the electric and other industries. In most cases, they were unintentional but it was not immediately obvious they were unintentional. In addition, there were cases where the similar incidents were caused maliciously. Three of the incident descriptions did not mention the word "cyber". The fourth stated it was "not a cyber security incident". Below is a summary, in quotes, of the four cases in the February 8 NERC report:

- "Engineers identified the hard disk on the SCADA server was fully utilized which prevented the supervisory control from functioning properly. Operators had visibility of the system, but did not have control. The post event investigation identified that an automatic file purge process was not functioning correctly which caused the hard disk to exceed its maximum capacity. The problem was found to be a historian test server issuing unidentified packets to the other historian servers. The network, not able to interpret the packets, sent them back creating a loop and ultimately resulted in network traffic congestion. This had been a latent code bug which had not previously been found by the vendor or others using the software."
o A similar situation occurred several years ago at a major control center and was also not identified as cyber.
o The SCADA alarm failure during the 2003 Northeast Outage was from a latent code bug.

- "A large utility's Energy Management System (EMS) began to lose data necessary for visibility of portions of its transmission network causing functionality and/or solution interruptions. No loss of load occurred during this event and it was quickly determined to not be a cyber security event. Excessive data packets being sent on the data network resulted in heavy loading. The extreme loading created a performance degradation of the data flows between the SCADA system, EMS Supervisory Control and various supporting systems. At times during the event, the degraded data flows limited the visibility of the EMS SCADA data to several control centers and the generation operations group. To compound the problem, as the event unfolded over an eleven hour period, EMS personnel were not able to determine the root cause of the excessive data network traffic, could not accurately predict when the problem(s) would be solved and when data would be restored to operations."
o Similar to many other events that took up to 24 hours or more to identify
o Sophisticated attacks like Stuxnet took over a year to identify

- "A utility's control center experienced a SCADA failure which resulted in a loss of monitoring functionality for more than thirty minutes. During the event, the utility's Inter-Control Center Communications Protocol (ICCP) data links remained in-service. All data sent and observed were frozen at the values transmitted at the time of the failure and remained at these values for the duration of the event. The utility's EMS did not alarm or indicate any abnormalities with the data for an extended period of time."
o Similar to what happened with the Bellingham, WA gasoline pipeline rupture in 1999.

- "A control center experienced a loss of control and monitoring functionality of the EMS due to the loss of the operator's user interface application between its primary EMS computer/host server and the system operator consoles. The EMS servers run a software application that enables the system operators to view, monitor and control the transmission system via system operator consoles. Following a time of higher-than-normal system utilization of the EMS, in particular the heavy use of the study network software application, the user interface application failed while running on the primary EMS server. Contributing to the user interface application failure was a limit to the amount of memory (RAM and Virtual) available to run the ongoing and background software application processes. As a result, the failed state of the user interface application did not trigger a system failure that would have automatically switched functionality to the redundant EMS server due to a software application configuration setting."
o Similar to what happened with the Bellingham, WA gasoline pipeline rupture in 1999.

These incidents are not trivial:
- Some of the incidents took a significant amount of time to identify the problems.
- Some of the events did not trigger alarms.
- None were identified by intrusion detection systems.
- Redundancy considerations were not always effective.
- Operator training was not always effective and did not address the cyber issues.
- The only mention of any cyber consideration was coordination of firewall modifications.

Each of these cases are loss of view/loss of control cyber incidents that were direct threats to the reliability of the electric grid. Yet the NERC CIPs doesn't address these situations. What does that mean to the cyber security of the electric grid when most utilities' cyber security programs consist of verbatim following the NERC CIPs? The NERC CIPs have done an admirable job in making cyber security of the grid more mainstream. However, if the intent is to keep the lights on, the NERC CIP approach (regardless of version) needs to be changed and an approach such as NIST SP800-53 Appendix I (control systems) implemented.

Joe Weiss

What are your comments?

You cannot post comments until you have logged in. Login Here.

Comments

  • NERC is not wrong about their labeling these events as not being cyber-attacks.  Unfortunately, they're not right either. The problem is that we do not have a standardized language for dealing with near misses. 

    Think of an example of driving in the ice and snow, and loseing traction. The car slips sideways. Thankfully control regained before sliding in to anyone or anything else, and there is no crash. The journey continues at a slower, more careful pace.

    Was this an accident? No. It easily could have been one, but it wasn't. It was a near miss incident. For comparison's sake, the Federal Aviation Administration collects Service Difficulty Reports from mechanics on abnormal or unexpected wear and tear, or abnormal function of aircraft components. They also collect reports on navigation problems or difficulties; and of course the National Transportation Safety Board investigates accidents to make recommendations for preventing similar ones in the future.

    These reports can then be assembled to determine if there is a significant problem that might need to be fixed. We don't do that in SCADA and control systems, but I think we should.

    NERC could be advocating better integrity diagnostics of the SCADA system and associated control systems. They could also attempt to standardize the language of how near-miss data is documented.

    I agree that it is not appropriate to simply label these as "not a cyber security event". While NERC is technically correct that it wasn't, this is no different than the non-accident that I cited. It is an indication that we need to slow down, put on tire chains, park the car --whatever is most appropriate. This was the warning: it is time to do something differently. If one continues to forge ahead, sooner or later there will probably be a full blown cyber security event, and nobody will have learned a thing from these examples.

    The warnings are out there. But unless we find a way to describe them and react to them, we might as well stick our heads in the sand and forget that they ever happened.

     

    Jake Brodsky

     

    Reply

RSS feed for comments on this page | RSS feed for all comments