4 Discussion
In this study, we evaluated the application of machine learning algorithms as a method to examine the composition profiles of various yeast extracts and their effect on GFP heterogeneous protein production by E. coli . According to the GC-MS profiling of yeast extracts, a variety of compositions were observed (Figure 1 ). Using 20 different compositions of yeast extracts, the yields of cell growth and GFP production in E. coli varied between 3.05 ± 0.04 and 5.00 ± 0.23 and between 2.55×104 ± 4.13×103and 4.86×104 ± 4.17×103, respectively (Figures S1 and S2 ). The differences in GFP and cell yields were associated with the composition profiles of the yeast extracts. Then, we applied machine learning algorithms to determine the relationship between the cultivation results and the yeast extract compositions via a metabolomics approach. PCA and PLS have been frequently applied to metabolomics approaches.[18,19,22,23] However, the PLS algorithm did not fit the experimental data as well as the other algorithms, although the coefficients of determination (R2learn and R2test, synonym Q2) were sufficient in general.[15,16,23] To improve the estimation of the cultivation results from the medium components, RF, NN, and DNN were applied to the present data based on the comparison of algorithms (Figure 2 and Figure 3 ). The data tended to fit the algorithms with smaller estimating losses than the losses of PLS. This trend has been observed in previous studies.[17,19] In particular, MSEval decreased in the case of NN. This means that NN can avoid overfitting to the training data. DNN showed smaller losses than NN, and it was the best model for estimating cultivation results. The described DNN structure may not be the best model for the present data because the DNN structures can be further arranged. In addition, there is a limited amount of experimental data, and this limited dataset may affect the DNN model. However, the strategies using DNN algorithms improve the model accuracies in comparison to PLS. In general, it is difficult to calculate the important variables via DNN algorithms. In this study, the important variables can be estimated by DNN-MIL using permutation algorithms. Glu, Asp, trehalose or maltose, glycerol, and phosphate were estimated to be the important components for GFP production (Figure 4 ). Furthermore, the relationships between the number of input variables give top 18 and 15 important variables that dominated the estimating accuracies, for cell and GFP yields, respectively. Indeed, adding additional Glu at 0.05 g/L increased the GFP yield by 12.9% when M4 yeast extract was used as a component of the production medium (Figure 5 ). These results demonstrate that DNN-MIL can calculate the features of yeast extract compositions for GFP production. However, the sensitivity analyses (Figures S4 and S5 ) estimated that the important variables were found by DNN-MIL, and that the analyses determined less of an influence on the cell and GFP yields. We believe the differences were caused by the difference in input data. This was because the important variables were calculated using a global dataset of all yeast extracts used by DNN-MIL, while the sensitivity analyses were performed for individual specific yeast extracts (M4). These results show that each individual important variable may weakly influence cellular activities such as growth and expressing foreign proteins in basal yeast extracts. These effects may vary among different brands and lots of yeast extract. Although glycerol was estimated to increase cell and GFP yields in the case of M4 yeast extract, the yields of cells and GFP were significantly decreased in the experimental validation (Figure 5 ). This difference in the results between the sensitivity analysis and the experimental validation were observed. Thus, the risk of false positives or negatives using estimations made by machine learning is still a concern.
Glu, Ala, Phe, Ile, Lys, and Asp increased the cell and GFP yields, and Leu, Ser, Thr, Asn, Val, and Tyr decreased the cell and GFP yields (Figure 5 ). Chow et al. also reported that in recombinantE. coli BLR(DE3), Asn, Asp, Gln, and Glu increased the production of elastin-like polypeptides, which are recombinant peptide-based biopolymers that contain repetitive sequences enriched in Gly, Val, Pro, and Ala.[24] In this study, Glu and Asp, but not Asn, increased the expression of GFP. These results may indicate thatE. coli behaviors in rich medium were varied compared with its activity in the basal media and standard culture conditions. Kurmar et al. also reported that 20 mixed amino acids with chemically defined media increased recombinant peptide production by 40% in E. coliBL21 (DE3).[25] Generally, the addition of amino acids to growth medium can influence E. coli protein expression. In rich medium, E. coli cells grow faster, and expression of the majority of the translation apparatus genes is significantly elevated. This is consistent with known patterns of growth rate-dependent regulation and an increased rate of protein synthesis in rapidly growing cells. The behavior in minimal cells would be controlled by the biosynthesis of building blocks, such as de novo biosynthesis of amino acids and nucleotides.[26,27] However, the effects of individual amino acids in rich medium have not been sufficiently studied, and surprisingly, there is no common consensus today. Therefore, many engineers associated with industrial production are forced to screen for the best raw materials, such as different brands and lots of yeast extracts, because they have no information on the significant components in the raw materials. In this study, we demonstrated that the DNN-MIL algorithm can be applied to estimate the cell growth and GFP yield by a recombinant strain of E. coli , and it can predict the components that are most important for cell growth and GFP production. A part of this estimation was matched to the results of the validating cultivations with the additional components. In particular, Glu was estimated to be the most important variable in the DNN-MIL simulation. The GFP yield increased by 12.9% in the validating cultivation. These results imply that the DNN-MIL between compositions of raw materials, yields of cells, and heterologous protein production can provide promising information for the optimization of medium components and quality control. However, the DNN model may lead to fallacies because of the deviation of the learning dataset. Based on the sensitivity analysis, phosphate and glycerol were estimated to increase cell and GFP yields (Figure S5), but these components reduced the yields in the actual validating cultivation (Figure 5). The other components which could not be detected by GC-MS were ignored in the present study. These other components may affect the behaviors estimated by DNN-MIL. This weakness of the current strategy will be improved by enriching the datasets via increasing the numbers of raw materials and using additional instrumentational analyses.
To our knowledge, this is the first study to use a DNN-mediated approach for a regression model, although Date and Kikuchi have already demonstrated DNN-mediated metabolomics for a classification model.[19]
In conclusion, the GC-MS profiles of yeast extracts and cultivation yields of a heterologous protein fit best to the DNN algorithm. The MIL calculation based on a permutation algorithm identified the important variables that have the potential to enhance or reduce protein production and cell growth. The DNN-mediated omics-like analysis between media and cultivation can be applied to new strategies for optimizing medium compositions and for quality control of media components. In addition, DNN-mediated metabolomics approaches are applicable to general metabolomics.