Gene design and bioinformatics prediction
It is well-known that amino acid sequence is the major determinant of soluble expression levels, folding and function of proteins in E. coli . When the tertiary structure of a protein is determined, the solubility of the expressed target protein may be enhanced using rational site-directed mutagenesis. A more general approach to obtain more soluble protein consists of generating gene libraries based on directed evolution by a mutation in a random or position-specific manner (Cobb, Chao, & Zhao, 2013).
Artificial oil-body system was developed by presenting oleosin-GFP fusion proteins (Meagher, 2011). Expressed proteins are rescued from aggregation using the E. coli ribosome display system by binding them to the ribosomal protein L23 (Plückthun, 2012).
A further study drew the conclusion that the amino acid length has a negative influence on protein solubility, which may be due to an increased misfolding rate with increasing length. Proteins with more than 400 amino acid residues are harder to express. Increasing net charge, either positive or negative, has a positive influence on protein solubility. Typically, disordered regions of proteins form unstable tertiary structures and dynamic conformations which easily aggregate into inclusion bodies. The grand average of hydropathicity (GRAVY) of proteins, an indicator for average hydrophobicity, is inversely correlated with the soluble expression level of target proteins (A. K. Roy, Acharjee, Upadhyay, & Ghosh, 2017). Additionally, arginine, leucine, and cysteine ​​content proved to be inversely related to protein solubility. Arginine decreases the solubility, which may be attributable to its rare codons. Cysteine ​​content has a slightly negative effect on protein solubility. However, isoleucine and lysine are beneficial for soluble expression, thus the right substitution may improve soluble expression levels of target recombinant proteins. On the other hand,asparagine, threonine and glutamine have no significant effect on protein solubility, and are suitable for substitution due to the fact that they are exposed to solvents. Arg to Lys substitution and Leu to Ile or Val substitution are proper suggestions for mutagenesis. The removal of a signal peptide coding sequence, required for the export of proteins from the site of synthesis to the target site, increases the stability and expression of recombinant proteins (Chang et al., 2016).
The secondary structure of protein, including the number of turns, disulfide bonds, α-helixes and β-sheets is an important determinant of protein solubility. The sequence with a high content of Asp, Asn, Pro, Gly, and Ser tend to form more turns, which is associated with difficulties in folding and decreased folding rates. Moreover, the number of disulfide bonds significantly decreases the correct folding rate of proteins due to the reducing environment of the cytoplasm inE. coli . It was also reported that proteins with a higher proportion of β-sheets are more prone to aggregation than those with α-helical structure (Gopal & Kumar, 2013).
The average codon adaptation index (CAI) is used to assess the bias of codon usage of the host cell. To avoid the codon bias obstacles of the heterologous host, the gene sequence should be optimized based on host codon usage bias. To avoid the formation of the secondary structure in mRNA and efficient translation of a gene, site-directed mutagenesis can be used to manipulate the gene without altering the amino acid sequence (Correa & Oppezzo, 2015). The GC content of the codon has been proved to be positively correlated with the concentration of mRNA and transcription initiation efficiency, but have little effect on the expression levels of the target protein (Ragionieri et al., 2015). It is noteworthy that the genetic code of a target protein should be engineered without changing the functional domain of the protein.
Bioinformatics are widely used for the selection of domains and regions of a protein with high chance for the manipulation of solubility, immunogenicity and other desirable characteristics (Hesaraki et al., 2013; Khalili et al., 2018; Malaei et al., 2019; Malaei, Rasaee, Paknejad, Latifi, & Rahbarizadeh, 2018). Bioinformatics prediction tools can be effectively used to investigate and improve the solubility of a protein through genetic engineering of its sequence prior to the time-consuming and laborious experimental steps (Chang, Song, Tey, & Ramanan, 2014; Hebditch, Carballo-Amador, Charonis, Curtis, & Warwicker, 2017; Rawi et al., 2018). Previous studies developed statistical correlations between protein primary structure characteristics or sequence-based features (variables), which include the total number of residues (length), molecular weight (MW), counts of buried amino acids, counts of hydrogen bonds, counts of nitrogen atoms, secondary structures, isoelectric point (pI), hydrophobicity, each amino acid (AA) content, net charge, negative charge, turn-forming residues fraction, proline fraction and cysteine fraction (Bertone et al., 2001; Habibi et al., 2014; Idicula‐Thomas & Balaji, 2005; Trainor, Broom, & Meiering, 2017).
The majority of bioinformatics sequence-based prediction tools with machine learning backbone, including PROSO (Smialowski et al., 2007), SOLpro (Magnan, Randall, & Baldi, 2009), PROSO II (Smialowski, Doose, Torkler, Kaufmann, & Frishman, 2012), CCSOL (Agostini, Vendruscolo, & Tartaglia, 2012), scoring card method (SCM) (Huang et al., 2012), RPSP (Wilkinson & Harrison, 1991), use a support vector machine (SVM)-based model (Suykens & Vandewalle, 1999), the multiple linear regressions fit model, Wilkinson-Harrison prediction model, or the solubility index-based model to distinguish between soluble and insoluble proteins. Some of these tools such as PROSO (the source of training data set was the previously published experimental information of the TargetDB database), PRSP, SOLpro and Recombinant Protein Solubility Prediction, offer acceptable prediction performances with user-friendly interface (Habibi et al., 2014; Magnan et al., 2009; A. Roy, Nair, Sen, Soni, & Madhusudhan, 2017; Smialowski et al., 2012). Periscope (Periplasmic Expression for Soluble Protein Expression), a computational approach with a two-stage architecture, was used for quantitative prediction of the soluble heterologous proteins expressed in the periplasm of E. coli (Chang et al., 2016).