Drowning in Data, Starving for Information-2

March 10, 2010
McMillan and Weiner Tacked a Big Question, What Is Data Analytics?
By Greg McMillan and Stan Weiner

Greg McMillan and Stan Weiner bring their wits and more than 66 years of process control experience to bear on your questions, comments, and problems. Write to them at [email protected].

Stan: We are fortunate to be able to interview Randy Reiss, who helped develop a data analytics system for batch processes for the Federal Drug Administration's (FDA) Process Analytical Technology (PAT) initiative.

Greg: What is data analytics?

Randy: It is the use of multivariate statistics for monitoring an industrial process. The most commonly used methods are principal component analysis (PCA) and partial least squares (PLS). Both reduce the dimension of the data by abstracting it based on predominate correlations. In doing so, the major correlations of the data are highlighted, and the effect of data redundancy is reduced. PCA is used for early fault detection and employs two statistics to explain if the process in within control limits: Hoteling's T^2 and an error statistic called Q SPE, or DMod-X. T^2 is a comparison of the relationships of the measurements. For example, the model may show that pressure and temperature are highly correlated; thus, the T^2 statistics will explain how well the on-line batch matches that modeled relationship. The error statistic is a reading of how far the measurement is from the model. The two statistics are complementary in that, if a deviation does not show up in the model (T^2), then it shows up in the error. In use, the operator would monitor the PCA statistics to make sure the process is within a control limit. If the control limit is exceeded, then a drill-down contribution can be used to investigate which process tags are causing the statistics to report a deviation.

PLS is used to predict the end of batch quality. The statistic employed is the prediction and an associated confidence interval. The prediction is the calculated value of a real quality parameter at the end of the batch. For example, a complicated lab analysis that could take hours may be required to determine an acid content of a product. PLS can be used to model that QA parameter, and allow the operator to move the batch to the next processing stage based on the prediction value. Likewise, the confidence interval has uses in identifying critical periods of the process and its effect on product quality.

Stan: How do you address the 3D aspect of batch profiles?

Randy: Batch data from a set of measurements can be thought of as a stack of matrices aligned by measurement and time slice. Although 3D analysis methods are available (for example, parallel factor analysis), it is more common in industrial analytics to use 2D analysis algorithms, such as PCA and PLS.

To get the 3D data into a meaningful 2D form, the data is unfolded. Unfolding is a matter of taking 3D data and placing it side by side to form a 2D matrix. The method of unfolding affects how the analysis deciphers the relationship in the data.

Since a batch is comprised of a time series of data from a set of measurements, it makes sense to analyze the data across batches, but we want to retain the information per sensor and per time slice. That method is called batch-wise unfolding and is the preferred method for process analytics.

Variable-wise unfolding retains the data per variable, but analyzes it across the batches and across the entire batch time. The result is a model that tries to represent the entire batch with a single set of linear relations. That is like using a line to characterize a curve. It may be somewhat correct part of the time, but never really fits the curve. Because batch-wise unfolding retains the information per time slice of the model, it creates a model with a finer granularity that better characterizes a non-linear process. However, the non-linear behavior of a batch-wise unfolding can be over-fitted, whereas a variable-wise unfolding model cannot.

Some call hybrid unfolding a third  method, but it is actually just variable-wise unfolding with local scaling; that is, instead of scaling each variable from all the time slices of the model, hybrid unfolding only scales the values from the same time slice. It's an improvement to variable-wise unfolding in that sense, but the overall structure and consequences of variable-wise unfolding still remain.

Greg: How do you deal with variable batch lengths?

Randy: The analysis requires that each batch be of the same length— not the reality of production; thus, methods are required to manipulate the data to a uniform length, such as accordion stretch/shrink, simple truncation of data, the use of an indicator variable, major event synchronization or dynamic time warping (DTW).

DTW works very well for model building. A dynamic optimization adjusts the length of each batch to the best fit with the least change in data. The results retain the features of all the batches, and do a good job of matching them in time. The benefit to the analysis is a much better batch profile to develop statistics. However, DTW is a time-consuming algorithm, easily accounting for 40% of the time to develop a model. When performing analytics on-line, a more complex problem presents itself: properly aligning the on-line batch with the correct time slice of the model for analysis. The major difficulties of on-line synchronization are the lack of complete data because the batch is in progress, and the time constraints of real-time processing of the analysis.

Stan: What guidance can you offer on how to select inputs and batches?

Randy: A single quality assurance (QA) value (e.g., end-of-batch lab quality result) should be used to rate each batch. If you want to look at multiple QA values, then you will need to do multiple analyses. The set of batches used to generate a model is called the training set.

I get the best results when a uniform distribution of QA results is used for the training set. The training set uses the same number of batches from the full range of the QA parameter. Usually, this is done by establishing sub-ranges or bins, and then populating each bin with the same number of batches to represent the full range of the QA parameter evenly in the model. Be sure to keep a test set of batches off to the side to validate your model.

Selection of inputs is somewhat more difficult and subjective. Start with all the measurements for the equipment unit. Know your process and what measurements are important for that stage of processing. Strip off measurements that are obviously not applicable. Then consider each one. Avoid those that are only relevant for a short period of the stage or that are shared by other equipment units. If you're unsure, keep it in the model and try it out. Iteration is how you develop a good model. Try it and look at the PCA contributions to see what variables are causing deviations. Are they real, or is the measurement problematic? Is the deviation expressed in another measurement? Try not to restrict the data coming into the analysis. Redundant measurements are OK if each measurement is not problematic in itself, but unrelated data can be cause false alarms on-line. Once a set of measurements is determined, this will usually stay the same over time. However, as a process develops, you may need to regenerate new models with newer batches in the training set. Be sure to maintain the uniform distribution of the QA values in the mode training set when updating the model.

You need a minimum of 30 batches to stabilize the mathematics of a reasonable process. Greater variability in the training set will require more batches. Diminishing returns start when using more than 50 batches for a process. More batches are not always better. Too many batches of little variability will create a model that is very tight and may cause false alarms.

Greg: Randy's latest Top 10 list might just have an Oscar winner. 

Top 10 Movie Titles for the Data Analytics Project

10. Honey, I Shrunk the Data
9. Analytics Now
8. The Batch Hunter
7. Lord of the [Principal] Components
6. The Which Data Project?
5. 2001: An Analytics Odyssey
4. Statistic Without a Cause
3. The Empirical Strikes Back
2. Snow White and the 70 Batches
1. The Extrapolationist.