Figure 4: Correlation matrix for data of spinning band distillation
column, left hand parameter has a linear impact to top hand parameter
with positive or negative correlation
The correlation matrix
indicates that there is a strong linear relationship between the
temperature measurements and heater power, which is expected for a
distillation column (1). Further, there is a strong relationship between
distillate and bottom product mass (2), but these features are not
considered for the pressure drop forecast as it is known from experience
that there is no significant impact on the pressure drop in the column.
The same argument applies to the liquid level in the bottom (3). In
terms of temperature measurements, the temperature in the head of the
column is retained as a feature, because it contains information on the
boiling point of the volatile component and the current concentration.
Pressure drop is kept as a feature as it describes the recent pressure
drop trend, which can be useful for the forecast. The remaining
parameters show no strong linear relationship. Furthermore, as known
from experience and physical relationships the liquid hold-up directly
influence pressure drop in the distillation column. Thus, they are
selected as features as well. In total, 6 parameters (pressure drop,
column head temperature, band rotation speed, heater power, feed flow,
and reflux ratio) are selected and used to model the forecast.
The clustering step will be performed based on the pressure drop data
alone to identify flooding behavior in the distillation column. Pressure
drop is preprocessed and transformed as described for the forecast
problem in order to maintain the same data structure and facilitate the
implementation with live data. Time series data can be typically
decomposed into the following four features: trend, level, seasonality
and noise. To ensure good visualization and interpretability of the
occurring clusters, two features are chosen for the clustering process.
As the flooding behavior does not occur in specific regular intervals
(seasonality) and noise has been reduced by means of EWMA, trend and
level should contain the significant information to identify meaningful
clusters and are therefore chosen as features.
Model training and
validation
Data from the spinning band distillation column is acquired in intervals
of one second and since flooding happens abruptly, it is important to
maintain this sample frequency despite the large amount of data that is
collected. Therefore, scalable and computationally inexpensive models
based on regression trees, which are explained in more detail in section
1.1, are prioritized in the scope of this work. These bagging and
boosting methods will be used with regression trees as base estimators
for the pressure drop forecast and their performance will be compared
based on chosen metrics, i.e. root mean squared error and
coefficient of determination (R²). Additionally, linear regression will
be applied for the pressure drop forecast to serve as a reference model.
The window and response size are determined via a grid search using a
representative regression model (random forest regression) with the
default settings from the scikit-learn library in Python. Investigated
window sizes range from 5 to 20 s and response sizes from 15 to 30 s.
The goal is to use a small window size to keep the amount of data during
the transformation small (Figure 3) and a large response size for a long
forecast, while maintaining a good prediction accuracy
(R² > 0.95). Training data consists of 8 and test data of 2
recorded distillation runs, which corresponds to 54948 and 9884
measurements, respectively. The resulting accuracies for different
window and response sizes are given in Table 1 in the form of RMSE and
R².
Table 3: Root mean squared
error (RMSE) and coefficient of determination (R²) for different window
and response sizes using random forest regression.