About security...a word from a "recovering IT person."

Dec. 10, 2007
Last week I posted on the SCADA list a response to an IT person who took exception to my statement that IT people who try to do security in process control systems can even be dangerous. In part, I said: You are dead wrong about one thing, though. Control System engineering is NOT a subdiscipline of computer science. It is a multidisciplinary approach to control of manufacturing processes that uses computer science as well as other disciplines like electrical engineering, me...
Last week I posted on the SCADA list a response to an IT person who took exception to my statement that IT people who try to do security in process control systems can even be dangerous. In part, I said: You are dead wrong about one thing, though. Control System engineering is NOT a subdiscipline of computer science. It is a multidisciplinary approach to control of manufacturing processes that uses computer science as well as other disciplines like electrical engineering, mechanical engineering, chemical engineering, and systems and business process analysis.This is one of the problems when automation professionals start to talk about what they do with computer science people. Many CompSci people believe that all problems are IT problems. To a hammer all problems are nails. Learning to become a competent automation professional is all about not being a hammer. After Dow Chemical's Eric Cosman posted agreeing with me, Wurldtech's Bryan Singer, who is chair of SP99, posted this (reproduced with Bryan's kind permission): The more I think on this one, the more I agree with it as well. As a computer scientist by trade... as I often joke in presentations that I am a "recovering IT person." Through my early days in security involved networking and software integrity testing, as well as the "guns guards and gates" version of security (hat tip to the US Military on that one), writing code in over 25 languages and working almost exclusive in X.25, ATM, Ethernet, etc networking... when I came to start working in a manufacturing discipline (MES, ERP, LIMS, etc) and later more towards automation, I must admit I was one of the "problem children" that saw everything in process control with an IT hammer. But a number of years ago I did have to start facing the fact that process control is an engineering discipline, with the unique fusion being we make stuff PHYSICALLY move.Taking 300ms to get an email or a network response would almost always be considered acceptable on an IT network, for example. Even several extra seconds often doesn't cause to much angst and consternation. So, IT folks tend to design networks that fall within these constraints, even on the shop floor. And I personally have many times witnessed where these design constraints failed. From my time as manager of the network and security services business at Rockwell and my continued involvement in industrial network design and security now, I see time and time again where these commonly accepted methodologies for designing networks lead to extended and critical failures:Examples:1- A filler that started up while someone was working on the machine. I won't touch the issue of safety systems being improperly designed here, which they were. The main issue being too much network traffic was causing sensor messages to not be received by the controller in sufficient time. Machine started... luckily no physical injury2- Light curtain message not received on a product line for carts with heavy metal objects on them... result was an injury. At odds here was improper logic design in the controller (again, no specifics as to protect the identity of the customer)3- Failing network where 3 cameras consumed 45% of the available network bandwidth: network improperly designed for this high amount of UDP traffic (broadcast/multicast)4- Network deployed and tested with a lot of fiber. Implemented by a local integrator that "checked" the network many times. They considered everything to be within acceptable levels for transmission rates. But, the applications and controllers kept failing. Our diagnostics found out that while these times were probably OK for an IT network, they were not for controls as messages were not being received. Later analysis found 45% of the fiber to be "bad" though it had been tested several times before, outdated switching technology had insufficient processing power, VLAN design was incorrect... the list goes on. Amazing that once those conditions were removed... everything else just worked.What I always struggle with in dealing with IT folks is to get them to understand that we are fusion the logical and physical worlds at the controller. Causing a square wave to no longer be a square wave because the controller is faulting in some way due to network failure is a bad thing. IT will often inspect the network and say "its fine" but it is certainly notfor the industrial controls. A recent public example of this has been the Browns Ferry VFD failure. Having seen MANY VFD failures due to similar conditions... people need to know that converging networks and designing with an IT mindset is a risky proposition.While not trying to give a shameless plug, this is one of the things that really attracted me to Wurldtech, and even other device testing vendors in this space. I used to be a bit skeptical, but one of the most tremendous benefits I've seen of these types of tools is that they remove all doubt. You can show, beyond a shadow of a doubt, that when certain network or protocol conditions are met when testing a control device... physical changes are possible.Herein lies the challenge... while I personally feel you don't have to be an expert in every industry... I've done equally efficient work in discrete and process environments... you certainly DO have to understand how these systems work. Understanding SCADA versus DCS versus other controller types, how safety systems work, how I/O works, how drives work, you name it...When IT measures risk, they typically look at availability issues, time to repair, cost to repair, with the de-facto standard being to take the affected systems offline, make changes, restore from backup if necessary, and then restore service... In our space, the luxury is not available.I always evaluate every components in the following risk categories:1- Safety - Can failure, modification, or errant behavior cause a safety issue? (i.e centrifuges, compressors, burners, etc.)2- Regulatory compliance - can this result in a reportable regulatory issue including safety, quality, loss of regulatory data, etc?3- Efficiency: Can this cause an outage or a reduction in efficiency?4- Quality: Can this affect finished goods quality, quality of service, or result in some sort of recall, boil water order, blackouts, brownouts, etc (depending obviously upon the industry)5- Asset Loss: Does this result in a partial or full loss of the asset, or other assets around it (i.e. explosions, physical damage, maintenance repairs, etc).Safety is almost always the most difficult... the most common answer I hear is "oh, its not a problem because the safety system will catch it!" I have seen this to not be the case at an alarmingly often rate.When analyzing safety, I always ask three questions:1- Where is the safety logic? If it is in the component or connected TO the component in question, almost always all bets are off. Yes I know this is a bad design practice in most cases, but it is probably more common than realized2- What constitutes the safety "system" including hardwires, etc3- What conditions would allow the physical safety (such as piping, shorts, potential overrides in safety systems, AFI's (always false instructions), problems with photo eyes, etc).Usually, when this level of analysis is done, doubt is removed from the doubters.. This is also one place where device testing can be a terrific advantage... rather than just theorize, you can actually CREATE the conditions under which the failure could occur.So... apologize for the length here.. but this is where I see how the applied science of engineering is so very different from IT and computer science. We are now fusing what at one point were very different systems, with the obvious problem of some growing pains. There are so many factors that must be considered. Even the INL demonstration of overflowing a chemical tank sometimes furthers the issue. Most engineers would rightly identify that a hard wired level sensor would most likely stop that potential problem... pretty much standard for tanks with hazardous chemicals I would venture.We do not server our needs by continuing to argue about this... or to think arrogantly that one is right over the other. I have worked on both sides of the fence. I hope no one takes this a bragging (hey, I'm the most humble person I know ), but rather I'm trying to represent from one persons viewpoint that has worked BOTH sides and made that transition... we aren't launching the space shuttle. But it does require a balance of many disciplines to be effective. And YES, you do need to understand not only the process control aspects, but the safety aspects, blends of other protocols and types, good electrical engineering, good mechanical engineering, good networking and software skills, solid soft skills for dealing with people.. and a whole bevy of other talents. There aren't that many people out there with all of those skills, which means we have to check the attitudes at the door and put effective TEAMS together to address the challenges. They are many, and the problems are often embedded deep. All problems are not nails.... On a lighter note, I do like to remind everyone of a funny joke you often here in the militaries around the world: There is nothing in life that can't be solved by a proper application of C4 explosives :)