The case for zero-fault software systems

June 30, 2017
Best practices for implementing and auditing the integrity of information and decision systems across accounting, marketing, engineering, manufacturing, supply chain management and customer service systems.

Any company undergoing digitization and adopting the Industrial Internet of Things (IIoT) may experience up to a 10-fold increase in confusion, delays, failures and expense. The root cause will be reliance on interactions among a suite of computer programs that the company does not understand. The current lack of understanding is causing alarming financial and customer/supplier loyalty losses due to inadequate software quality and resiliency. These losses will increase more than tenfold unless remedial and preventive actions are taken now.

No company would allow strangers to walk in off the street and direct operations at odds with its policies and rules. However, adopting open- source software and moving major chunks of information and decision processing to the cloud relies on software that the company doesn't understand or may not be aware is involved.

This article is for those responsible for implementing and auditing the integrity of their company’s information and decision systems across accounting, marketing, engineering, manufacturing, supply chain management and customer service systems, as well as the computer-based capabilities delivered in their products. Integrity is not just an IT staff challenge. Those responsible span from the corporate level to system users to administrators of user access permissions, database updates and computer program patches, updates and new features.

Corporate-level involvement occurs because government regulators are discontinuing punitive fines. Effective in March 2016, they're pursuing indictment, conviction and incarceration of executives. This means that the board of directors auditing committee must go beyond simply checking the accounting numbers in the annual 10K reports to also confirms that the company computers are implementing the policies and instructions that management thinks they are. Because most IT systems are not auditable today, no one knows what all those computers are really doing.

The complexity threat

The system scale and complexity crisis does not mean the software staff is incompetent. It simply reflects the principle that a system can become  so large and complex that no one person can understand it. Many companies at the $1-billion revenue level rely on a montage of more than 10 platforms, four or more database management systems (DBMS), seven or more languages, at least 500 business threads, more than 100 million secure lines of code (SLOC), and multiple maintenance cycles ranging from ASAP to monthly to annually during 20 years of system use. As Rudolf Starkermann has explained, no group of people can mutually under- stand large, complex systems because as group size increases, their proclivity for understanding one another decreases.

Mutual understanding in the front end of projects has been helped by advances in modeling, design and simulation. However, while the current focus on agile development, DevOps and similar software engineering methods avoids futile “big bang” projects, it doesn't respond adequately to the complexity and cybersecurity threats. Agile development is essential for company adaptability and security, but doesn't ensure an agile, deployed system.

Ways are still lacking for confirming that deployed hardware components and software code faithfully implement front-end design intentions. In the current era, a company must master a) appropriate and confident use of unknown and unknow- able computer programs from multiple sources, b) faster pace and more precise orchestration of information and decision actions supply-chain- wide, c) increased pace of configuration changes both planned and unplanned, and d) increasing incidence of cyber-attacks, both external and internal. Whew!

Mastery will entail adding certain staff competencies and taking a new approach to software management, development and maintenance.

Three technology challenges must be addressed:

  • Ways of quickly finding all logic, arithmetic or semantic faults, called “bugs,” in any given computer program. Effective technology exists but is not widely practiced. Too many companies rely on software testing. More on that later.
  • Ways of finding all logic, arithmetic or semantic inconsistencies when any two or more bug-free programs are caused to interoperate. Inconsistencies occur because respective programmers held different world views or be- cause internal administrators and operators took correct actions but at the wrong times (how the Three Mile Island nuclear reactor got melted). Effective technology for this is now becoming available.
  • Making the code immune to cyber-attacks both from external sources such as real-time hackers and from sabotage embedded in open-source or cloud-based software. The current emphasis is on preventing intrusion of such threats. A collateral emphasis that makes software immune to at- tacks by removing all the exploitation opportunities can pay off handsomely. Fault-free software is largely immune to cyber-attacks. Effective technology for this is likewise be- coming available.

Overcoming complexity

Prevailing ISO, ANSI and industry-specific standards and practices for system design, develop, test and deploy scenarios must be improved. Quality assurance efforts must admit that complex systems simply can't be done right the first time. Instead, the “run, break, fix” paradigm becomes the preferred method because successful large systems are composed of orchestrations of successful small systems.

System principles of dynamic and integrity limits are key. Integrity is threatened by software bugs and by cyber-attacks, operator miscues, administrator patches and database changes, as well as hardware failures. Dynamic and integrity limits are conditioned by progress and safety properties. Integrity limits can  be found by proofreading the code—no test beds, test cases, regression testing, etc. All code is translated to an intermediate language, transformed to Dijkstra Guarded Commands, then analyzed by induction for consistency and sufficiency of program predicates. This finds the weakest precondition satisfying progress and safety properties for the post condition associated with any error.

Software integrity assessment accelerates system integration and complements system test. Faulty code wastes approximately 40% of system test time and cost because the testing activity must await bug fixes. Finding the bugs first by proof-reading lets testing focus on discovering system dynamic limits. Further, while integrity assessment can find all bugs, testing cannot. As Prof. E.W. Dijkstra warned us years ago, “Testing shows the presence, not the absence, of bugs,” and data curated by Capers Jones shows that even the best testing finds only about 98% of bugs, so 2% lurk in deployed code.

There are two kinds of integrity assessments. One uses a catalog of known malformed algorithms, then scans subject code for their existence. The other, software integrity assessment, finds all the faults whether previously known or not, and does so across an arbitrarily large, heterogeneous suite of programs.

In practice, the first initiative is better software maintenance. The software staff can't be expected to “do it right the first time” because “right” is not known until the operational system starts to function. As in the military axiom, “No plan survives the first contact,” no new software survives operational use without the need for maintenance. Today, companies spend up to 50% of their software budget on maintenance and have to wait hours and days for fixes. New ways of assessing software code enable first responders to diagnose and fix operational errors in a matter of minutes, not days or weeks, while cutting cost to less than 20% of that currently experienced.

The second initiative is to devise readiness confirmation. In large, complex, adaptive systems, anything and everything may be changing. Accordingly, just because a system was fit-for- purpose last time doesn't mean it will be fit-for-purpose this time. Just as an aircraft pilot does a pre-flight walkaround or a machine has a push-to-test button or a microprocessor runs non-functional tests of itself in background, it's prudent to implement readiness confirmation capability in every software system. This also implements system auditability, which has been lacking for years.

The third initiative is adopting a fit-for-purpose standard. We write new or adapt existing software for a purpose. Unfortunately, the purpose is rarely formally stated, let alone confirmed. Beware of POSIWID; years ago, system guru Stafford Beer ad- vised us that when a system gets sufficiently large and opaque, the purpose of a system is what it does (POSIWID), regardless of what the designers and users intend it to do.

A viable company must emphasize a clear statement of purpose for all software modules and ensembles, then confirm fit for the intended purpose. Currently, the typical standard is “meets requirements” even though no one previously certified that the requirements adequately described fitness for purpose. Obviously, a fit-for-purpose standard should apply company- wide, supply chain-wide, industry-wide and worldwide.

The value proposition

How much money is your company spending on a) computer software licenses, b) software development, c) software integration and d) software maintenance? How much money and customer/supplier loyalty are you losing or wasting due to downtime from system failures and cybersecurity instances? Would it be useful to reduce all that by half while making your company run better?

The software integrity assessment value proposition, especially for early adopters, lies in the possible reduction of 90% of software maintenance costs, 50% of system integration costs and 20% of original development costs. Commercialization of software integrity assessment technology entails training 5% of a software workforce as first responders. Each first responder can generate a 90% gross margin within one year and ongoing.

Software system-level quality is becoming the most important, albeit least recognized, business opportunity. The likelihood of latent bugs in deployed software increases with scope, complexity and rate of change of a given software system. A decade ago, little was said about inadequate software systems quality. Tomorrow, each new line of computer code introduced anywhere in your products and internal systems must be considered a cyber- attack until shown otherwise.

It's time to emphasize, nay demand, zero-fault code from sup- pliers and staff. Alert enterprises have time to respond. Others will fail too fast to survive.