Gene design and bioinformatics prediction
It is well-known that amino acid sequence is the major determinant of
soluble expression levels, folding and function of proteins in E.
coli . When the tertiary structure of a protein is determined, the
solubility of the expressed target protein may be enhanced using
rational site-directed mutagenesis. A more general approach to obtain
more soluble protein consists of generating gene libraries based on
directed evolution by a mutation in a random or position-specific manner
(Cobb, Chao, & Zhao, 2013).
Artificial oil-body system was developed by presenting oleosin-GFP
fusion proteins (Meagher, 2011). Expressed proteins are rescued from
aggregation using the E. coli ribosome display system by binding
them to the ribosomal protein L23 (Plückthun, 2012).
A further study drew the conclusion that the amino acid length has a
negative influence on protein solubility, which may be due to an
increased misfolding rate with increasing length. Proteins with more
than 400 amino acid residues are harder to express. Increasing net
charge, either positive or negative, has a positive influence on protein
solubility. Typically, disordered regions of proteins form unstable
tertiary structures and dynamic conformations which easily aggregate
into inclusion bodies. The grand average of hydropathicity (GRAVY) of
proteins, an indicator for average hydrophobicity, is inversely
correlated with the soluble expression level of target proteins (A. K.
Roy, Acharjee, Upadhyay, & Ghosh, 2017). Additionally, arginine,
leucine, and cysteine content proved to be inversely related to
protein solubility. Arginine decreases the solubility, which may be
attributable to its rare codons. Cysteine content has a slightly
negative effect on protein solubility. However, isoleucine and lysine
are beneficial for soluble expression, thus the right substitution may
improve soluble expression levels of target recombinant proteins. On the
other hand,asparagine, threonine and glutamine have no significant
effect on protein solubility, and are suitable for substitution due to
the fact that they are exposed to solvents. Arg to Lys substitution and
Leu to Ile or Val substitution are proper suggestions for mutagenesis.
The removal of a signal peptide coding sequence, required for the export
of proteins from the site of synthesis to the target site, increases the
stability and expression of recombinant proteins (Chang et al., 2016).
The secondary structure of protein, including the number of turns,
disulfide bonds, α-helixes and β-sheets is an important determinant of
protein solubility. The sequence with a high content of Asp, Asn, Pro,
Gly, and Ser tend to form more turns, which is associated with
difficulties in folding and decreased folding rates. Moreover, the
number of disulfide bonds significantly decreases the correct folding
rate of proteins due to the reducing environment of the cytoplasm inE. coli . It was also reported that proteins with a higher
proportion of β-sheets are more prone to aggregation than those with
α-helical structure (Gopal & Kumar, 2013).
The average codon adaptation index (CAI) is used to assess the bias of
codon usage of the host cell. To avoid the codon bias obstacles of the
heterologous host, the gene sequence should be optimized based on host
codon usage bias. To avoid the formation of the secondary structure in
mRNA and efficient translation of a gene, site-directed mutagenesis can
be used to manipulate the gene without altering the amino acid sequence
(Correa & Oppezzo, 2015). The GC content of the codon has been proved
to be positively correlated with the concentration of mRNA and
transcription initiation efficiency, but have little effect on the
expression levels of the target protein (Ragionieri et al., 2015). It is
noteworthy that the genetic code of a target protein should be
engineered without changing the functional domain of the protein.
Bioinformatics are widely used for the selection of domains and regions
of a protein with high chance for the manipulation of solubility,
immunogenicity and other desirable characteristics (Hesaraki et al.,
2013; Khalili et al., 2018; Malaei et al., 2019; Malaei, Rasaee,
Paknejad, Latifi, & Rahbarizadeh, 2018). Bioinformatics prediction
tools can be effectively used to investigate and improve the solubility
of a protein through genetic engineering of its sequence prior to the
time-consuming and laborious experimental steps (Chang, Song, Tey, &
Ramanan, 2014; Hebditch, Carballo-Amador, Charonis, Curtis, &
Warwicker, 2017; Rawi et al., 2018). Previous studies developed
statistical correlations between protein primary structure
characteristics or sequence-based features (variables), which include
the total number of residues (length), molecular weight (MW), counts of
buried amino acids, counts of hydrogen bonds, counts of nitrogen atoms,
secondary structures, isoelectric point (pI), hydrophobicity, each amino
acid (AA) content, net charge, negative charge, turn-forming residues
fraction, proline fraction and cysteine fraction (Bertone et al., 2001;
Habibi et al., 2014; Idicula‐Thomas & Balaji, 2005; Trainor, Broom, &
Meiering, 2017).
The majority of bioinformatics sequence-based prediction tools with
machine learning backbone, including PROSO (Smialowski et al., 2007),
SOLpro (Magnan, Randall, & Baldi, 2009), PROSO II (Smialowski, Doose,
Torkler, Kaufmann, & Frishman, 2012), CCSOL (Agostini, Vendruscolo, &
Tartaglia, 2012), scoring card method (SCM) (Huang et al., 2012), RPSP
(Wilkinson & Harrison, 1991), use a support vector machine (SVM)-based
model (Suykens & Vandewalle, 1999), the multiple linear regressions fit
model, Wilkinson-Harrison prediction model, or the solubility
index-based model to distinguish between soluble and insoluble proteins.
Some of these tools such as PROSO (the source of training data set was
the previously published experimental information of the TargetDB
database), PRSP, SOLpro and Recombinant Protein Solubility Prediction,
offer acceptable prediction performances with user-friendly interface
(Habibi et al., 2014; Magnan et al., 2009; A. Roy, Nair, Sen, Soni, &
Madhusudhan, 2017; Smialowski et al., 2012). Periscope (Periplasmic
Expression for Soluble Protein Expression), a computational approach
with a two-stage architecture, was used for quantitative prediction of
the soluble heterologous proteins expressed in the periplasm of E.
coli (Chang et al., 2016).