2.4 Machine learning
The values of GFP intensities were decreased by five orders of magnitude
before being evaluated by machine learning. In all machine learning
algorithms except for principal component analysis (PCA), data from the
E1 yeast extract was used for doubled validation calculations. The
remaining data were separated into learning and test datasets with
random cross-validation (85:15). PCA, PLS, and RF were performed on the
Python 3.6 platform using the scikit-learn
library.[20] The number of components for the PLS
models was set at 6. For RF, the parameters were set as the following:
max_depth, 10; max_features, 6; max_leaf_nodes, none; n_estimators,
300; random_state, 2525; in case estimate cell yields and max_depth,
5; max_features, 169; n_estimators, 50; random_state, 2525; in case
of GFP yield. The parameters were set after searching for the optimal
parameters using the grid search function.
NN and DNN were coded in Python 3.6 using TensorFlow 1.5 and the Keras
library
(https://keras.io/).[21]In all cases, the input shape was set for 205 parameters. To estimate
the final yield, the output shape was a single parameter, cell yield, or
GFP. For time course estimation, the output shape was set for 5
parameters corresponding to the sampling time for each cell growth and
GFP sample. Conventional NN was composed of a single hidden layer with
100 units of hyperbolic tangent (tanh) activations. The network was
constructed with fully connected networks. HeNormal class was used as a
kernel weight initializer. Activations of output layers were set to
linear. Adam algorithms were applied to the optimizer with the default
setting of the Keras library. Learning was carried out to minimize the
mean squared error (MSE) (eq 1). The times of training was set at 3,000.
Model check point functions were record weights of the model with
minimal MSE.
\begin{equation}
MSE\ =\ \frac{1}{n}\sum_{i=1}^{n}\left(y_{i}-{\overset{\overline{}}{y}}_{i}\right)^{2}\nonumber \\
\end{equation}(eq. 1), where n indicates the number of input variables,\({\overset{\overline{}}{y}}_{i}\) indicates the measurement values of
dependent variables, and yi indicates the
estimate values of the dependent variables by the constructed model.
DNN were constructed with 4 hidden layers (200, 100, 50, and 20 units)
and tanh activations. The number of training times was set at 10,000.
The other DNN parameters corresponded to those of NN.
MIE calculations were performed with reference to the MDA calculation
reported by Date and Kikuchi[19] For the MSE
calculation, the values in a variable were randomly rearranged among the
input data, which was called permutation, and the rearranged data
matrices were evaluated by the constructed DNN model. The model loss
obtained by the permutations was compared with the model loss determined
by the MSE calculation. In the calculation, a relatively small influence
on MSE means that the constructed model was rarely influenced by the
variable. However, a relatively large influence on MSE means that the
constructed model was significantly affected by the variables. Based on
the criterion, the MIE can evaluate the importance of the variables in
the constructed DNN model. In this study, permutations were repeated 60
times for each variable, and the average MSE for each variable
calculated from the rearranged matrices was used as a representative
importance.
To evaluate the effect of the important variables, a sensitivity
analysis was performed to estimate the cell growths and GFP yields while
varying only a single important variable in the yeast composition.
A personal computer (PC) equipped with a graphic processing unit were
used for the calculations. PC Spec. OS: Ubuntu 16.04LTS, CPU: Intel Core
i7-8700 (3.2-4.6 GHz / 6 cores / 12 threads / 12MB cash), Memory:
DDR-2666 32 GB, GPU: NVIDIA GeForce GTX 1080Ti 11GB.