There has been a spate of articles regarding unintended bias in machine learning algorithms and how these may lead to undesirable outcomes, such as reinforcing errors rather than unmasking insights. These biases could impact who gets hired by changing the way characteristics are scored or valued in an algorithm, for example, or they could impact the prioritization of a self-driving car’s actions, with possible life-threatening consequences.
Given the increasing role data science plays in our lives, algorithm bias is a legitimate concern. Algorithms are, after all, a function of the data they're fed for training and execution, and the scoring of parameters by data scientists or other experts, each of whom can introduce bias.
Further, the importance of machine learning-enabled analytics is only going to increase due to overwhelming and growing data volumes, access to inexpensive and elastic computing resources, and the tremendous value of improved insights. The benefits for improving business and production outcomes by leveraging machine learning innovation, versus unwieldy traditional approaches, are simply too compelling to ignore.
For example, in a recent webinar, Seeq demonstrated a statistical threshold model for asset failure, which provided warning for a bearing failure. But a Seeq machine learning algorithm applied to the same data worked much better, detecting the imminent failure days in advance (Figure 1).
At the same time, while recent press attention has been on algorithm bias, a similar concern should be applied to the analysts, who educate end users on the machine-learning market. The rapid increase in interest in machine learning or artificial intelligence (AI) has meant confusion and questions, and this translates to market opportunity for consultants and analysts.
Magic algorithm bias
The first of these biases is easily detected: look for the word “automatic.” There's a belief that if an end user takes data and iterates it through enough algorithms, they'll find a particular algorithm to be a “best fit” to the data, and insights will be achieved.
There are two issues with this approach. First, the only guaranteed outcome of lots of data and algorithms is many false positives because there's no end-user expertise applied to the algorithm or the data. Second, whether a vendor has 100 or 1,000 algorithms in their portfolio, there's no assurance they have the most appropriate algorithm.
For example, a Seeq customer recently developed a neural network algorithm for greenhouse gas emission mitigation, which is hardly an off-the-shelf use case, to solve an issue specific to their industry.
Therefore, the issue with this bias is it assumes a vendor has a sufficient or appropriate number of algorithms from which a best-fit algorithm may be determined and applied to provide the desired insight. But in a world exploding with algorithms, innovation and open source, using a list of algorithms from a single vendor is a suboptimal strategy.
Pilotless planes
The second bias uses a similar but different word as a key to flag analyst bias: “automated.” It's interesting that after 20 years, most users still struggle with something as simple as Microsoft Auto-correct in Word, whereas some analysts imagine a world where process manufacturing processes, or even whole plants, are automated and run independently of human intervention, without even continuous remote monitoring by experts.
If the first bias speaks to the lack of sufficiency in algorithms, the second is marked by a lack of data and expertise. Algorithms can only use what they're fed in terms of data, and the assumption that all relevant data is either available or has been included in the analyzed data set is a false premise.
Data science success stories are littered with examples of key factors not considered, such as ambient temperature or humidity. Equally, there are the stories of unexpected context, such as a flood, a swarm of bees or other unexpected phenomena.
Whether key factors are unconsidered, unexpected or simply overlooked, a key to success is the ability of subject experts to “see” or evaluate the situation and explore any potential factor. This could be done by using the expertise of a person familiar with the application, or it could be by done by adding additional data or context to inform the analysis. The importance of human expertise and the data and context it brings to the analytics process should not be underestimated as it is always critical.
It has to be AI...
The first two biases may be summarized as either the lack of algorithm or lack of data sufficiency. The third bias of some analysts is trying to pick the winners and losers among algorithms based on the cool and hip versus the uncool and classic.
Using this approach, they define some algorithms as better and others as less important, without considering the use case. They typically prefer compute-intensive algorithms, such as random forest, neural network and unstructured learning algorithms.
You see this when an analyst constantly references AI as a requirement for success in data science deployments, even as a pioneer in the field is calling for more intelligent use of the term (bit.ly/2202voices). This approach is misplaced because it doesn’t start with simply asking if the desired insight requires a particular algorithm type.
Ordinary least squares regression, for example, was invented in the early 19th century, and became a Microsoft Excel feature in 1997. It remains the leading tool for finding relationships between two signals in process manufacturing applications.
The reality is that easy access by process engineers to regression algorithms—along with cleansed, contextualized, data—would be a tremendous leap forward for companies struggling to improve their analytics sophistication. The idea that end users must leap to complex algorithms accessible only by data scientists, and which may or may not be appropriate for the task at hand, is a bias to be avoided.
Algorithms ahead!
Everyone brings bias to their efforts based on their experience and education. Given the critical importance of machine learning, and the absolute necessity of it given the demands of data volumes and the value opportunity, it's critical to consider bias in algorithms, along with the biases of those educating the market on approaches to adopting machine learning. There are some early markers for bias, and these should be identified with the same focus as issues within the algorithms themselves.
For end users, we argue for prioritizing the critical attributes leading to machine learning success—data access, data quality, algorithm fit to task and subject matter expert context—as the key criteria for algorithm prioritization and value.
Michael Risse is CMO and vice president at Seeq Corp. He can be contacted at [email protected].