When real-time data that isn't real

Staring at data extracted from the site’s process historian, a cadre of process engineers, operations specialists and their big-data analyst were peering at the seconds leading up to a curious, sudden and unexplained reversal of flow that appeared to induce a surge or stonewall of their large feed compressor. They began seeing evidence of instances where flows and temperatures were ramping—increasing or decreasing steadily in the seconds leading up to the event—all scrutinized as possible precursors to the anomaly. Which was the first measurement to start deviating from the previous steady state?

After a few hours of struggling to make sense of the data, someone asked about the sampling frequency of the historian. A 10-second sampling frequency was not extraordinarily slow for most continuous processes as big columns, vessels and tanks don’t move very fast. Within minutes, the team quickly came to grips with the realization that the previous morning and afternoon were spent in vain—their historian queries were automatically interpolating between stored data points. When the queries were limited to “only actual snapshots,” there was really no ramp in any of the data. What they saw was just the historian connecting the dots. This is something historians do more-or-less by default. It was back to the drawing board (smart board?) for the root-cause analysis team.

Historians are being mined like never before, and those endeavoring to gain some insight have had some genuine successes. But as we tinker with tools that have the promise of predicting failures and detecting optimums, we should consider deeply the source, quality and timing of the data. Historians are by nature built to compromise between a granular depiction of every twitch of every measurement, and available storage. Compression was a technique borne of this compromise, back when hard drives were small, unreliable and expensive (some old folks may recall platters the size of a medium pizza). My DCS to this day still limits historical data to a small (by today’s standards) restricted hard disk footprint. 600 MB archives fit neatly on a CD—is that a reason to make them that size today? While many measurements (tank levels, for instance) truly do not warrant obsessive sampling and storage, today’s disk capacities and network speeds accommodate faster rates for measurements that might truly be more interesting, maybe not today, but someday.

We encounter many artifacts of the recent but bygone era of expensive and scarce storage. For no reason except habit, we may ploddingly accept 10-second sample rates and 1% compression deviation (the minimum change required to capture a new snapshot). Compression is an artifact of how process data historians conserved hard drive space. When configuring a point, one tells the historian, “Here is the smallest change that I consider significant.” If the measurement doesn’t change by more than that amount, it’s recorded as flatlining, and no new points are stored until the change exceeds the threshold. When a snapshot exceeds the deviation setting, the compression algorithm will check if the previous point is on a straight line with the new point (a steady trend upward, for example, filling a tank). If the prior point is within the tolerance, it’s tossed and only the new point is stored. Hence, a tank steadily filling for eight hours might only store two points, the current snapshot and the point eight hours ago when the filling began. It’s clever and effective, but when points are configured arbitrarily or too conservatively, valuable insights could be thrown in the recycle bin, their contained information lost forever.

The sample rate of the historian itself is likely no better than one second, and we’ve seen instances where a split-second phenomenon like reverse flow through a compressor is missed. And how does the historian get data from the DCS? If it’s OPC DA, for example, points configured as “advise” only update when the system’s OPC server determines a point has changed. Does the time stamp come from the DCS side, the OPC server, or the historian? Sometimes it’s configurable, but who knows which option was selected?

Aspiring and practicing data analysts, consider adding this to your chores: Investigate, communicate and document the sources of the “big data” you’re digesting. Before proclaiming correlation or causality, ensure the mechanisms of data collection, storage and reporting aren’t creating a false reality.

About the author: John Rezabek