4 Discussion
In this study, we evaluated the application of machine learning
algorithms as a method to examine the composition profiles of various
yeast extracts and their effect on GFP heterogeneous protein production
by E. coli . According to the GC-MS profiling of yeast extracts, a
variety of compositions were observed (Figure 1 ). Using 20
different compositions of yeast extracts, the yields of cell growth and
GFP production in E. coli varied between 3.05 ± 0.04 and 5.00 ±
0.23 and between 2.55×104 ± 4.13×103and 4.86×104 ± 4.17×103,
respectively (Figures S1 and S2 ). The differences in GFP and
cell yields were associated with the composition profiles of the yeast
extracts. Then, we applied machine learning algorithms to determine the
relationship between the cultivation results and the yeast extract
compositions via a metabolomics approach. PCA and PLS have been
frequently applied to metabolomics
approaches.[18,19,22,23] However, the PLS
algorithm did not fit the experimental data as well as the other
algorithms, although the coefficients of determination
(R2learn and
R2test, synonym Q2)
were sufficient in general.[15,16,23] To improve
the estimation of the cultivation results from the medium components,
RF, NN, and DNN were applied to the present data based on the comparison
of algorithms (Figure 2 and Figure 3 ). The data tended to fit
the algorithms with smaller estimating losses than the losses of PLS.
This trend has been observed in previous
studies.[17,19] In particular,
MSEval decreased in the case of NN. This means that NN
can avoid overfitting to the training data. DNN showed smaller losses
than NN, and it was the best model for estimating cultivation results.
The described DNN structure may not be the best model for the present
data because the DNN structures can be further arranged. In addition,
there is a limited amount of experimental data, and this limited dataset
may affect the DNN model. However, the strategies using DNN algorithms
improve the model accuracies in comparison to PLS. In general, it is
difficult to calculate the important variables via DNN algorithms. In
this study, the important variables can be estimated by DNN-MIL using
permutation algorithms. Glu, Asp, trehalose or maltose, glycerol, and
phosphate were estimated to be the important components for GFP
production (Figure 4 ). Furthermore, the relationships between
the number of input variables give top 18 and 15 important variables
that dominated the estimating accuracies, for cell and GFP yields,
respectively. Indeed, adding additional Glu at 0.05 g/L increased the
GFP yield by 12.9% when M4 yeast extract was used as a component of the
production medium (Figure 5 ). These results demonstrate that
DNN-MIL can calculate the features of yeast extract compositions for GFP
production. However, the sensitivity analyses (Figures S4 and
S5 ) estimated that the important variables were found by DNN-MIL, and
that the analyses determined less of an influence on the cell and GFP
yields. We believe the differences were caused by the difference in
input data. This was because the important variables were calculated
using a global dataset of all yeast extracts used by DNN-MIL, while the
sensitivity analyses were performed for individual specific yeast
extracts (M4). These results show that each individual important
variable may weakly influence cellular activities such as growth and
expressing foreign proteins in basal yeast extracts. These effects may
vary among different brands and lots of yeast extract. Although glycerol
was estimated to increase cell and GFP yields in the case of M4 yeast
extract, the yields of cells and GFP were significantly decreased in the
experimental validation (Figure 5 ). This difference in the
results between the sensitivity analysis and the experimental validation
were observed. Thus, the risk of false positives or negatives using
estimations made by machine learning is still a concern.
Glu, Ala, Phe, Ile, Lys, and Asp increased the cell and GFP yields, and
Leu, Ser, Thr, Asn, Val, and Tyr decreased the cell and GFP yields
(Figure 5 ). Chow et al. also reported that in recombinantE. coli BLR(DE3), Asn, Asp, Gln, and Glu increased the production
of elastin-like polypeptides, which are recombinant peptide-based
biopolymers that contain repetitive sequences enriched in Gly, Val, Pro,
and Ala.[24] In this study, Glu and Asp, but not
Asn, increased the expression of GFP. These results may indicate thatE. coli behaviors in rich medium were varied compared with its
activity in the basal media and standard culture conditions. Kurmar et
al. also reported that 20 mixed amino acids with chemically defined
media increased recombinant peptide production by 40% in E. coliBL21 (DE3).[25] Generally, the addition of amino
acids to growth medium can influence E. coli protein expression.
In rich medium, E. coli cells grow faster, and expression of the
majority of the translation apparatus genes is significantly elevated.
This is consistent with known patterns of growth rate-dependent
regulation and an increased rate of protein synthesis in rapidly growing
cells. The behavior in minimal cells would be controlled by the
biosynthesis of building blocks, such as de novo biosynthesis of
amino acids and nucleotides.[26,27] However, the
effects of individual amino acids in rich medium have not been
sufficiently studied, and surprisingly, there is no common consensus
today. Therefore, many engineers associated with industrial production
are forced to screen for the best raw materials, such as different
brands and lots of yeast extracts, because they have no information on
the significant components in the raw materials. In this study, we
demonstrated that the DNN-MIL algorithm can be applied to estimate the
cell growth and GFP yield by a recombinant strain of E. coli , and
it can predict the components that are most important for cell growth
and GFP production. A part of this estimation was matched to the results
of the validating cultivations with the additional components. In
particular, Glu was estimated to be the most important variable in the
DNN-MIL simulation. The GFP yield increased by 12.9% in the validating
cultivation. These results imply that the DNN-MIL between compositions
of raw materials, yields of cells, and heterologous protein production
can provide promising information for the optimization of medium
components and quality control. However, the DNN model may lead to
fallacies because of the deviation of the learning dataset. Based on the
sensitivity analysis, phosphate and glycerol were estimated to increase
cell and GFP yields (Figure S5), but these components reduced the yields
in the actual validating cultivation (Figure 5). The other components
which could not be detected by GC-MS were ignored in the present study.
These other components may affect the behaviors estimated by DNN-MIL.
This weakness of the current strategy will be improved by enriching the
datasets via increasing the numbers of raw materials and using
additional instrumentational analyses.
To our knowledge, this is the first study to use a DNN-mediated approach
for a regression model, although Date and Kikuchi have already
demonstrated DNN-mediated metabolomics for a classification
model.[19]
In conclusion, the GC-MS profiles of yeast extracts and cultivation
yields of a heterologous protein fit best to the DNN algorithm. The MIL
calculation based on a permutation algorithm identified the important
variables that have the potential to enhance or reduce protein
production and cell growth. The DNN-mediated omics-like analysis between
media and cultivation can be applied to new strategies for optimizing
medium compositions and for quality control of media components. In
addition, DNN-mediated metabolomics approaches are applicable to general
metabolomics.