Safety, Security and Complex Systems in Critical Infrastructure Protection

Sept. 23, 2009
The Global Critical Infrastructure Is Made Up of Complex Systems Which Are, In Turn, Made Up of Complex Systems Made Up of Simpler Systems

By Walt Boyes

Process plants, chemical plants, refineries, offshore platforms, water and wastewater utilities and power generation and distribution systems that form a large part of the global critical infrastructure are made up of complex systems which are, in turn, made up of complex systems made up of simpler systems down to the level of the sensor, controller, and final control element. Traditionally, safety systems have been designed in a vacuum and poorly integrated with the plant process and the basic process control system. So, too have cyber and physical security been designed and implemented in a vacuum. The failure to account properly for the interaction of these complex systems in safety system, control system and alarm management system design was specifically named by the Baker Report as one of the factors in the BP Texas City disaster in 2005, and can clearly be seen in other incidents. Human factors engineering has been often disregarded, and operator training has never focused on dealing with the complicated interactions of these complex systems. The author presents a case for integrating safety systems, security systems, human factors, alarm management, and operator training in a unitized system to deal with the problem of complex systems.

We have been working on making our process plants safer for nearly 50 years now, and we've made some progress. But as we have seen, with the incidents at BP in Texas City, and others since, we are far from making our plants as safe as they can be, and as safe as they should be.

And in 2008, we had another incident, this time at a Bayer facility in West Virginia.

Many dedicated professionals have spent years working on standards and implementations to make our plants safer.  But there are a large number of issues– vectors, really, that determine whether we have a safer plant or not.

There is a highly complex interaction between a large number of those vectors. Safety, security, alarm management, operations, training, and of course, your company goals all interact, and, like any complex system, simply changing one vector makes more changes than can often be visualized or calculated in advance.

No one expected the operators to have difficulty seeing both the inlet and the outlet flows to the isomerization process and the raffinate splitter tower at BP Texas City. No one expected ALL the level measurement devices on the tower to fail at the same time. No one expected the safety system to fail. No one expected that the operators would consistently make wrong decision after wrong decision as they tried to recover from the impending disaster. No one expected the diesel pickup truck to be running in the same area as the cloud of hydrocarbon vapor.

Yet all of these things happened. And people died. There have been many more accidents in the three years since the BP disaster, and there will be many more. And many more people will die.

We need to start thinking about safety, security, alarm management, operations and training as an integrated whole, and we need to have our companies agree that the safe way is the most profitable way. We have not done this yet, and until we do, people will continue to die.

Our first attempts to build safety systems took the form of dedicated systems that were stand-alone and completely separate from the basic process control system. This was done to ensure that these multiply redundant systems shared no points of failure with the control system itself.

This was both good and bad. The good news was that the safety system could only be used for one thing. It was strictly to shut down the plant if something abnormal occurred. The bad news was that it encouraged safety practitioners to develop a curious tunnel vision, so that the interactions of the safety system to the rest of the plant were often not investigated.

So, a few years ago, we began integrating safety systems and control systems…with many engineers still unwilling to do that to this day. What we learned immediately was that there were those interactions, and that a Safety Instrumented System cannot be built in a vacuum. It must be part of an overall proactive operations strategy that includes safety, security, plant operations and maintenance. Probably the best example of what I am talking about is what Dow Chemical's Levi Leathers called for, almost fifty years ago, an operating discipline in which safe operation is the most important engineering rule of the company.

Safety systems are part of the control systems in the plant, and safety systems must be considered in any cyber security strategy we implement. Even a traditional standalone SIS system could be penetrated and damaged if connected in any way to a control system or to the plant information network. At the 2008 ACS Cyber Security Conference, Bryan Singer, co-Chair of the ISA99 cyber security standard committee, and Dr. Nate Kube, a principal of Wurldtech Security Technologies demonstrated a hack of an integrated safety system (one that has received TUV approval and is being sold today). In less than 25 seconds, they were able to cause the system to force it to fail unsafely. Other hacks have been demonstrated to work against traditional stand-alone systems, too.

This illustrates the absolute fact that safety and security interrelate. And so do perimeter security, fire and gas safety controls, and personnel locating technologies. In this age of integrated systems, nothing stands truly alone.

In the 1960s and 1970s, operators were able to get an instant grasp of the operating condition of the plant or the part of the plant they were responsible for by looking at the panel wall. We gave up that viewpoint by migrating to small screens where only a part of the process could be seen. Now we are moving back to screen walls and working on visibility issues. But for years, we've had real problems with giving operators more ability to see the eagle's eye view of their processes and we wonder why their tunnel vision leads to accidents that certainly should have been prevented.

We also have a history of treating operators as less important to the operation of the plant as the engineers and managers, when, in fact the operator of a process plant is much more like a pilot of an aircraft. They are responsible for the safe operation of the plant, and the optimization of the plant, and they should not be burdened, as so many of them are, with busy-work because some top manager noticed that they weren't busy when he entered the control room. Really, good operators earn their salary during those thirty seconds of terror when the plant is in upset, just like pilots do when something goes wrong with the plane.

If we just were able to look at operators in this way, and make improving their HMIs and training critical, we could reduce accidents and save lives.

We now know that operator response to abnormal situations is highly dependent on how the information is presented to them. The BP accident shows clearly that fact. If the operators had been able to easily see both the flow in and the flow out of the raffinate splitter tower, it is highly likely that they would have intervened long before they did.

We understand this but we have not been able to build this into the routine engineering dogma of control systems and safety systems.

We have seen the output of the ASM consortium. We have seen the EEMUA guidelines. And we continue to overload operators with too many inputs, too many distractions, and too many jobs to do.

Properly, the only parts of a process that an operator should be seeing are the ones that aren't working properly, or that the operator is engaged in optimizing. Yet many HMIs are designed with lots of motion, pretty colors and three dimensional effects– because it is cool and has lots of marketing sizzle.

Many HMIs are designed and installed with minimal input from the operators, because the operators are very often considered non-professional labor– to be told where to go, and what to do by engineers and managers.

I encourage you to consider that operators are highly skilled technical professionals, whether they are trained engineers or not. This is the intent of the ISA's Certified Automation Professional designation– automation isn't an engineering discipline. Automation is a multidisciplinary profession that includes engineers, operators, scientists, and technicians.

The operator needs to be in charge of the process he or she is operating… and we need to provide the tools to really be in charge.

Operators cannot handle hundreds of alarms every minute and be in charge of the process.

Many people have real problems with IEC61508, IEC61511 and ANSI/ISA84.01-2004. We're educated to be project-oriented. We propose projects, we get projects approved, funds are dedicated to those projects, and we do the project and then it is over. And so, with process safety projects, process optimization projects, and alarm management projects, we have first success, then gradual decay after the project people turn the "project" over to the operations staff.

But safety, alarm management, security and operator training are not amenable to project-oriented engineering thinking.

They are inherently processes, and they are best managed as continuous processes, and there are very few of us who instinctively think in those terms.

But if we are to meet the intent of the safety standards, and more than that, we are to begin to operate on the basis that all these topics form an inter-related and interdependent interconnected system, we are going to have to think instinctively in process terms rather than project terms.

Since the money comes from highest management, it doesn't matter fundamentally if we can change our thinking about alarm management, optimization, safety and security projects and begin thinking of them as issues that require continuous process control and process optimization themselves.

If we cannot communicate the importance of this paradigm shift to higher management, nothing will change, and people will continue to die.

Physical security is another interrelated issue. How many fewer people would have died at BP if the operators had been able to detect the drivers of the diesel pickup truck as they moved into the danger zone, and stopped them? No one will ever know. But in emergency situations it is critical to have firm control of the perimeter and to know where all of your people and your assets are. Knowing where a fire truck or a staff member with first aid or CPR training is, and being able to vector them to the area of most need can make the difference between deaths and survival. And where better to have the information than in the control room, as part of the operators' gestalt?

As I said, cyber security issues must be considered in any safety implementation in any process plant, just as safety issues must be considered when administering IT security issues. I talked earlier about a hack of a safety system. That particular safety system hack was accomplished as part of a penetration test by Wurldtech as requested by the safety system manufacturer. Is your safety system secure, or just safe? Was it engineered in a vacuum, or was it designed in cooperation and understanding of all the interacting factors in your plant? Do the operators live, breathe and eat safety and security? Does your management? Are these core values for your company, and if not, why aren't they?

We are about to lose the last generation that knows how to operate our plants manually. We are about to lose a terrific amount of institutional knowledge and we are hard pressed to replace that institutional knowledge that we are losing– and that lack of training has been shown to contribute materially to accidents like the BP Texas City accident. And if we do not provide operators the highest level of training, we will surely see people die.

Would you want to fly from here to Shanghai with a pilot with 10 years' experience, or with a pilot whose experience is 90 days in a simulator? How about running your plant?

And how good is the training you provide to your operating personnel? How good is the training you provide to your maintenance personnel?

Although we've been listening for nearly 15 years to manufacturing experts talk about the efficiencies and profitability of real time operations management, we still have fewer than 20% of companies worldwide that practice it. And there is a direct correlation between companies who practice RTOM and companies who have been forced into it.

Somehow, we have to learn how to internalize these 21st century manufacturing principles before we are forced to the wall.

Dow Chemical's Levi Leathers described what he called "operating discipline" in which safety was paramount, and drove profits. Based on results, his operating discipline, which Dow still proudly practices, seems to have worked extremely well, and Dow expects it to continue to be the fundamental core of their operating principles.

There is a concept called the "economic calculus." In the 1960s, in North America and Western Europe, environmental pollution was rampant and unchecked. Laws and perhaps as importantly, public opinion, the zeitgeist, if you will, expanded the economic calculus and now it is part of the corporate balance sheet that environmental pollution is abated and plants are designed ab initio incorporating pollution control systems, and whole systems of recycling for profit have been established.

Today, the economic calculus is being widened again to include sustainability…being "green." We now see businesses in all dead seriousness assuring one another of their respective greenness, and we are already seeing companies realize substantial operating savings and even increases in profitability by controlling energy costs.
Tomorrow, we may see the proliferation of real time performance management and inherently safe operation. We may see companies in all seriousness assuring one another of their safety operations consciousness, because the economic calculus may widen once again.

The problem with safety, security, good training and alarm management and other operations issues is that fundamentally, they depend on fear for an argument. "If you don't do this, people may die, people will die!"
Financial managers are rarely interested in listening to such arguments, because it is impossible to show "lives not lost" on a cost accounting balance sheet. You cannot sell fear.

Management staffs are composed generally of very good people, but good intentions in the face of the pressure to show quarterly profits on the stock exchanges aren't good enough, and it is obvious that unless the economic calculus is widened, we will continue to have people die in industrial accidents in process plants with monotonous regularity.

We have to show how to widen that calculus because these issues are not as easily understood as the "green" issues have been.

For years, the TLA (three letter acronym) consultants have been preaching Real Time Control for business and for process. And for years, the captains of industry have been paying lip service to it. But the change is finally upon us. Even SAP has come to the public realization that you can't optimize your enterprise just with an automated balance sheet. With Microsoft and SAP supporting ISA95, and ISA88, it is a sure bet that the Great Divide between the business systems and the plant floor will be erased. But it will not be easy. Invensys' Dr. Peter Martin's pioneering studies of economic information transfer within enterprises makes that clear. There's too much information buried in the plants, that never makes its way to the decision makers, and when it does, it is presented in ways the decision makers don't want, or don't understand.

The financial management system is broken. Worse, we are under great stresses from technological advances– and many decision makers don't understand these, as well.

There are two tsunamis– the wave of globalisation we are riding out, and the wave of disruptive technologies we have been living with for the past thirty years.

In the early days, we were all concerned with sensors, measurement and final control elements: how to build the watch. Later it became imperative that we know how to tell time: how to close control loops.

Our skill sets expanded yet again when it became obvious that it wasn't enough to be able to build the watch and tell time, but now we had to know what the benefits of being on time were: expanding the benefits of control to the entire plant via distributed control systems.

Even though this is where a lot of us stopped, it just isn't enough. We know how sensors work, we know how loops work, and we know how to control a process. Unfortunately, the required skill set has expanded again. Now we have to understand scheduling. Our primary value is to see to it that information from the plant is transmitted to the enterprise. We're now working on fourth order concepts.

Lynn Craig, one of the founders of WBF and a leader in the ISA88 batch standard committee, says that we no longer can hide out in the instrument shop, depending on the fact that our knowledge is so arcane that nobody will bother us there. "We're being dragged kicking and screaming back into the real world of manufacturing," he says.

And we're finally going to have to address the issue of how to measure what we do as automation professionals.
As Dr. Peter Martin, in his book "Bottom-line Automation" says, "If you can't measure it in financial terms, it never happened."

We are going to have to re-frame our skill set so that we can define and describe everything we do in terms that the CFO can understand…because then the CFO and the CEO can explain to Wall Street why all of the productivity gains of the last 30 years have come from operating unit optimization strategies and automation.

And if we want to sell the concept of operating inherently safely, we will have to sell it in these terms as well.

Dr. Martin tells a story about an interview he conducted with a CFO. The CFO burst out complaining, "If one more engineer comes in my office trying to tell me all about his KPIs, and how much money he's made for my company, I'm going to fire his ass!"

This challenge is perhaps going to be harder than simply surviving the technical, economic and political changes we're being whipsawed with every day.

We know how to make our enterprises far more flexible, far more nimble than they have ever been. WE know how to make our enterprises far more profitable than they've ever been. We know how to make our enterprises inherently safe and secure. WE have the data, and the tools to use it.

But the managements we report to at the enterprise level have no idea that buried within their plants is the knowledge that will help them and their entire companies not only survive but grow and prosper, as far into the future as I can see.

For a hundred years, we've been applying control principles to manufacturing processes, whether continuous, batch, hybrid, or discrete. What has NOT been done, is to apply those principles systematically across the entire enterprise.

Enterprises continue to be operated and reported on the basis of cost accounting principles set up when an enterprise had widely scattered cottage industries with difficult and slow communications between organizations. It is possible to make a process optimization change that saves a plant $10 million a year, and the CEO and CFO may not even notice in the financial rollups they actually see.

It is also possible to make a plant more efficient, and damage the enterprise at the same time, if the production from the plant is now far greater than the next plant downstream in the supply chain cannot absorb the increase.

It isn't about process control and engineering anymore. We MUST develop systems that communicate in real time with the financial managers– and who's going to do it, if not us? We have the data, and we understand how to use it.

Dr. Martin often shows a slide in presentations, and I've borrowed the concepts with permission. The CEO oversees two things: measurement and operations of the business. He has a CFO for measurement, and a COO for operating the business. But if we look at Dr. Martin's model, we can see that it is unbalanced, indeed it is broken. On the operational side, we have good metrics, in fact, we have such good metrics that we've been turning out those KPIs the CFO hates for over a dozen years now, in real time.

But on the financial side, the metrics between the manufacturing resource base and the enterprise are missing.

This is our challenge. As you can see, this is a measurement problem. What do we know how to do, have been doing for over a hundred years? We are really good at measuring things, collecting and analyzing the data, and using that data to provide operational control. We have to figure out how to provide financial management control, and, frankly, we are the only people who can do it– because we are the keepers of the processes, the measurements and the data.

The problem with expanding the economic calculus to include operating inherently safely is that there is no way in any cost accounting system of accounting for preventing costs that have not happened. The costs of prevention, the costs of changing to inherently safe operation are accounted as unsupported costs, and they reduce the profit of the enterprise in the near term, and even sometimes in the medium term. Yet there are always millions of dollars or Euros available to fix systems after disasters occur, because now there is a metric– actual costs of lost revenue and fines and lawsuit settlements.

We must show that operating inherently safely saves money from the start, and improves profitability from Day One. We can only do that with real time performance metrics—which the CFO refuses to see because they aren't expressed in cost accounting terms.

Our challenge,then,  is to determine means and methodologies to generate Real Time performance measures in terms that can be rolled up into standard financial reports. Our challenge is to produce production cycle reports that can take that real time performance data and feed a real time accounting system, which in turn can feed a financial reporting system. We've been complaining about the cost accounting system, and its bastard cousin, activity based costing or activity based measurements forever because they can't operate in real time, and they can't take real time data and operate on it, either. But the entire financial structure of the world runs on this kind of accounting and reporting, and we can complain forever, with less than no result.

Our challenge is to develop the metrics to feed that system data that is timely and accurate, so that better decisions can be made. It is axiomatic that if you can make better decisions, you can improve flexibility, productivity, and profitability of the enterprise..and show that operating inherently safely is more profitable, just as Levi Leathers believed….and the data and all the tools are ours.

And while we're doing that, we still have to make the plant run. An instrumentation engineer at a LyondellBasell plant once told me that he is expecting his standard technicians to know much more than they used to have to know, and pay scales and job descriptions are lagging behind what they need to know to just do their jobs.

Ian Nimmo, President of User Centered Design Services, says, "In a properly run plant, the operator should not have to intervene in the plant operation except in the case of an upset, and any time the operator has to do anything, it is an upset."

System Integrators, vendors and end users alike are asking their people to do more with less, and that means smarter people, with smarter tools, and a clear grasp of the way the process works.

This means that the market for knowledgeable automation workers is actually increasing worldwide. But this will be automation workers who understand the entire picture from the sensor to the enterprise, and can work in any part of it, interchangeably.

We will have to change our elevator speech, too.

When you try to distill down what we do so that we can explain it to the CEO, or to our wife's best friend at a party, it's a lot easier, and more intelligible, to say, "I work in manufacturing automation," or "process automation," or "I help automate the processes that make (insert whatever your plant does)." This is a lot easier than trying to explain that instrumentation doesn't mean you play in a band.

The profession we follow is changing even more. We are being pulled out from behind our manufacturing cells, production lines, our flowmeters and differential pressure transmitters, analyzers, PLCs and control valves, and our Safety Instrumented Systems and made to act as business process analysts as well as engineers and technicians. We can either fight to the death to retain our old labels, or we can willingly embrace the new responsibilities our companies have thrust upon us. One is safe, the other scary. But one will continue the cycle of layoffs and downsizing, while the other reinforces the importance we have to the conduct of successful business.

And one will continue to allow people to be killed, as the man described as a "model employee" at Bayer in West Virginia was killed last year, while the other will enable us to change the economic calculus to permit our plants to operate inherently safely.

How Safe is Safe? How Secure is Secure?

This is a three-sided tale. Safety. Security. Compliance. Engineering. Finance. Legal.

As I said we would be, in my keynote speech last year at the TÜV Rheinland Safety Symposium, we're beginning to see a convergence between the disciplines of functional safety and control system cybersecurity. It isn't hard to see why. Both disciplines focus on the behavior of complex systems. Both disciplines are based on risk management. Both disciplines require continuing engineering analysis and management.

Since both disciplines are about managing risk to acceptable levels we can easily see that ultimate safety isn't a viable goal. Nor is ultimate security a viable goal. We need as much safety as we must have to eliminate or dramatically reduce the incidence of accidents in the plant. We need as much security as we must have to eliminate or dramatically reduce the incidence of cyber intrusion into the control and SCADA systems we operate. But we don't want to be hampered in operating the plant by either safety or security regulations and enforcement. So, we want just enough, but not too much of either safety or security.

There's the engineering side of risk management, and then there's the financial side. The financial side says, we can have less safety and security than the engineers want by insuring against accidents and intrusions. That way, company profits stay protected, but company personnel and assets sometimes do not.

When, as is beginning to happen now, governments begin making regulations about either safety or cybersecurity, we find the legal side of risk management rearing its head.

While the engineers want enough safety and security to prevent accidents but not hamper production, and the beancounters want as little safety and security as they have to pay for, the lawyers want none of those things. Their job is to keep the company from being sued, and the way they do that is by instituting a risk management vehicle called compliance.

As far as the lawyers are concerned, the company only has to do as little as possible toward functional safety or cybersecurity as they can, and be in compliance with the regulations.

In the power industry in the US, we have the NERC CIPs…and people insisting that their cyber security practices which are manifestly unsafe to the engineers, and way too costly already to the beancounters, are just fine because they are in compliance.

We are seeing this attitude spread to the water and wastewater utilities, and to some extent to the transportation sector and some of the chemical, pharmaceutical and food industries, because they are used to regulation, and compliance to regulations.

None of this, however, is making our infrastructure any safer or more cyber secure.

We must continue to focus on the idea that safety is about preserving safely people and processes and assets, not hedging with insurance policies to cover drastically unsafe practices. We must continue to focus on the idea that security is about the ability of our systems to withstand assaults from without, disaffected employees from within, and simple accidents.

I can just hear the CEO trying to explain to the Sarbanes-Oxley folks, "Well, we were in compliance. It isn't our fault that the terrorists' cyber attack killed our functional safety system and blew up our plant. We were in compliance!"

We are, as automation professionals, in a remarkably different place than we have been over the past 30 years. We are in demand.

We are scarce, and we now have the tools to prove that we are not only necessary, but irreplaceable. Imagine what would happen if all of us walked off our jobs for 60 days…but we don't have to do that.

What we MUST do is to stop thinking like instrument engineers, like control systems people, like safety systems engineers…and start thinking like real automation professionals.

We have a larger, deeper skill set that we need to learn than any other discipline. It isn't enough to be an engineer…in fact, many automation professionals aren't engineers.

We must be able to engineer, to plan, to manage projects, to understand many kinds of processes in many different industries…in a way, we're like Ginger Rogers. She could do everything Fred Astaire could do– and she did it backwards, and in high heels.

And while we are dancing backward, and in high heels, we will be able to change the economic calculus so that not only will we operate in a green fashion, but that operating "green" will mean operating inherently safely.