# Drowning in Data, Starving for Info-Part 3

## More Chat with Randy Reiss, Who Helped Develop a Data Analytics System for Batch Processes for the New Opportunities Created by the Federal Drug Administration's Process Analytical Technology (PAT) Initiative

By Greg McMillan and Stan Weiner

*Greg McMillan and Stan Weiner bring their wits and more than 66 years of process control experience to bear on your questions, comments, and problems. Write to them at **controltalk@putman.net**.*

**Stan:** How do you quantify the effect of unmeasured disturbances?

**Randy:** The analysis is based on measurements. Unmeasured disturbances come in the form of variability in a QA value that is uncorrelated to the measurements. For the online principal component analysis (PCA), the quality assurance (QA) value is not considered, and, therefore, there would be no indication. For the projection to latent structures—also known as partial least squares (PLS)—the prediction simply would be wrong. In short, there is no quantification of unmeasured disturbances.

**Greg:** How do you set the thresholds for deviations?

**Randy:** A model does have an alpha level parameter that indicates how sensitive the model is. If that parameter is set to .01, then anything outside of 99% of the bell curve is considered a deviation. A more common alpha level is the .05 or 95% exhibited by the bell curve.

**Stan:** How do you determine if a PCA or PLS result is wrong?

**Randy:** The PCA is wrong if it generates so many false positives or false negatives that the operation personal ignore it; then the analysis is not doing its job. For PLS, if the predictions do not reasonably match the lab analysis QA, then PLS is not doing its job. In short, if the PCA or PLS are not adding value to your daily operation, then something is wrong.

**Greg:** How do you find the most important contributions for PCA and PLS (drill down)?

**Randy:** PCA contributions are the amount each measurement contributes to the overall statistic, either the T^2 statistic or error statistic (Q or SPE). Thus, when a fault is detected by the statistic exceeding the upper control limit (UCL), the contributions can be used to see which tags are causing the fault. The idea of multivariate analysis is that more than one measurement may be causing the fault. Thus, when looking at contributions, the operator may not see a single measurement contribution that exceeds the UCL, but a combination of measurements that are larger than the rest, but still well below the UCL.

This is where process knowledge comes in. The operator or engineer needs to be able to consider the measurements that are identified as outliers and associate them to a process fault. One method may be to bring up a trend of the measurements for this equipment unit and focus on the tags identified by PCA and ask, "What is going wrong?" The idea is to assess a cause from the correlations provided by the analysis. PCA is an early fault detection tool, but does not diagnose the fault for you. PLS contributions may or may not be available in some toolsets.

When a predicted product quality goes bad, it is natural to ask what caused the prediction to indicate a decrease in product quality. However, the prediction calculation only looks at the major correlations between measurements and modeled QA. Thus, the contributing tags for a prediction may not portray the overall picture of the fault, but it may be useful in pointing out a major factor. Regardless, it is always important to refer to the associated PCA charts for a larger view of the problem. In short, a PLS analysis only works on data defined in the model, and PCA reports on what is in the model and what is happening outside of the model.

**Stan:** Is there anything that can be done to verify causal relationships or track down root causes?

**Randy:** The very nature of statistical modeling is correlations, and any causal relationship must be verified. To this extent, analytics are just indicators and should not be used to explicitly define causes for process characteristics.

There is no substitute for a process engineer with a good understanding of the process. The analytics should add to the engineer's process knowledge base and understanding, but should not be used as a de facto tool for defining casual relationships in the process. In practice, if a modeled process shows a relationship between certain measurements and product quality during a period in the batch, it's worth investigating. Periods when the PLS confidence interval significantly narrows may be a good indicator that a critical stage of processing is occurring with regards to end-of-batch product quality. Likewise, if the un-normalized upper control limit (UCL) in the PCA analysis dips, it means the modeled batches were very similar during this period. This information can be used as a starting point for process improvements that may result in better product consistency and quality.

**Stan:** How do you view the need for better data visualization and the prospects of using parallel coordinates?

**Randy:** I tend to see data visualization as what you do when you don't have multivariate statistics; that is, data visualization has the same intent of multivariate statistics, but lacks the mathematical rigor. Instead, it is left up to the viewer to make correlations. The data visualization is supposed to point the viewer in the right direction, but ultimately it is the viewer deciphering the meaning of multivariate data. The whole point of statistics is that humans cannot distinguish correlations as well as mathematical rigor. A statistic will more definitively group the most similar objects. Data visualization is useful when patterns are apparent.

However, it may be the case that patterns are not apparent or misleading. I agree that better data visualization can help see patterns and outliers, but it only goes so far and relies heavily on human intuition. However, the technique of quantized generalized parallel coordinate plots (GPCP) looks quite interesting. It is more informative than a normal parallel plot, as it projects the data onto a basis for a continuous 2D function. In this sense, it transforms the whole batch into a single line. Obviously, the basis will determine what the continuous 2D function will be telling us about the batch. This technique may be useful for engineers. It could prove useful for tracking down the source of disturbances and interactions via common frequencies.

However, I do not see how any of this could be applicable to operators as it is just too abstract. You will never be able to convince or train an operator to read the GPCP graph.

Maybe there is a way to use parallel plots with analytics. What about plotting the statistics on a parallel plot? Plot the PCA statistics (T^2 and Q) and the PLS statistics (prediction and confidence interval). Overlay these four variables for all time slices of a batch, or maybe look at the same time slice of many batches. Maybe that would be interesting. Is the prediction high when the T^2 and Q statistics are high? Because some of these statistics are mathematically dependent, it just might help people understand the relationships. There may be some value in using parallel plots for viewing the various statistics in analytics.

**Greg:** We conclude our visit with Randy with his latest "Top 10 List" with some awfully familiar sounding songs.

### Top 10 Songs for the Data Analytics Project

10. Does Anybody Really Know What Batch This Is?

9. Another Batch Bites the Dust

8. We Gotta Get Data Out of this Process [If It's the Last Thing We Ever Do]

7. Good Batches, Bad Batches [You Know I've Had My Share]

6. Correlation Dreaming

5. Changes in Variables, Changes in Attitudes

4. [There Must Be] 50 Measurements to Model your Process

3. This Project's So Bright, I Gotta Wear Shades

2. Gimme All Your Data [All Your QA and ICs Too]

1. In-A-Planta-Da-Vida