Acknowledgment
Abstract
The protein structure prediction problem is considered one of the most difficult problems in bioinformatics and one with the most profound potential medical applications. Its difficulty is largely due to its complexity, hence the need for a simple model, such as the HP Model, created by Ken Dill in 1985. Despite its usefulness due to its computational simplicity, the HP Model is in some ways an oversimplification, which creates the need for an extended model that is more biologically plausible while still retaining computational simplicity. This project investigated whether or not the Side Chain Model would fulfill this need. Initial results suggest that where the chain is composed of clumps of H-nodes or multiple H-nodes separated by groups of P-nodes, there exists a more optimal folding in the Side Chain Model. However, in cases where the chain includes a long string of consecutive H-nodes, the Side Chain Model may not necessarily be better than the HP model. Despite these cases, preliminary results suggest that the Side Chain Model is an improvement upon the HP model both energy-wise and from a biological point of view.
Introduction to proteins and backround on the HP model
Proteins are large biological macromolecules that perform various metabolic functions in the human body. They are composed of amino acids: a carboxyl functional group, an amino functional group, an alpha carbon, a hydrogen atom, and a sidechain that varies depending on the amino acid. There are twenty different amino acids, and proteins consist of as many of 34,350 amino acids.
Proteins have a unique native structure that is perfectly matched to their function. Understanding how protein folds can yield a better understanding of diseases like Huntington’s and Parkinson’s that result from protein misfolding and aid in drug inhibitor design. The protein structure prediction (PSP) problem is very difficult with the number of possibilities for foldings, given the massive length of these proteins and the twenty different monomers, which all interact differently. The PSP problem is so difficult that it is considered as an NP-hard problem, which means that it may be easy to check if a potential folding is the true native structure of a protein, but it’s nearly impossible to determine the correct potential folding from scratch.
To deal with the difficulty of the protein folding problem, Ken Dill developed the HP model in 1985. He separated each of the 20 amino acids into 2 groups, hydrophobic and hydrophilic. Hydrophobic amino acids are represented with a red H-node and hydrophilic amino acids are represented with a blue P-node. He put the chain of hydrophobic and hydrophilic amino acids on a two-dimensional square lattice to fold them. He folded them in a way to maximize the number of contacts in the fold. A contact is when two H nodes are one unit apart on the lattice but are not next to each other on the amino acid chain. The more contacts a certain folding of an HP chain has, the lower the energy level is. For example, if a given chain of H's and P's, when folded, had five contacts, it has an energy value of -5. A lower energy level is ideal because proteins tend to fold into a state of minimal energy. Due to these properties, amino acid chains in the HP model tend to fold in a pattern that creates a dense hydrophobic core. If a specific chain of H's and P's can only fold in one way such that it has its maximum number of conacts it has a "unique optimal folding.
Because the HP model is an extreme simplification of a protein's folding pattern, there are a lot of shortcomings in it relating to how accurate it is to real life. First of all, the HP model only takes into account HH interactions, when realistically there are more than one type of interaction to be accounted for in an amino acid chain. Secondly, the HP model does not accurately represent the backbone of the protein. Another issue with the HP model is that it heavily restricts the movement of H and P nodes because they are all directly connected.
Side Chain Model
To address the shortcoming in the HP model, we developed the side chain model. The Side Chain Model is an extension of the HP model that separates H and P nodes from their respective backbones, thus resulting in three monomers: H-nodes, P-nodes, and neutral backbone nodes. This separation increases H-nodes’ flexibility and thus the possibility for more contacts. The rules of an allowed folding in the HP model and the definition of a contact still hold for the Side Chain Model. To fold a chain in the Side Chain Model, the backbone is first folded, and then the H-nodes are placed to maximize the contacts while the P-nodes are placed wherever they fit, typically on the outside so that a central dense hydrophobic core can be formed. Each backbone node must have a P or H node extending perpendicularly off of it. This side chain model allows for intermediary flips of H and P nodes through 3-space, so that H-nodes can be moved to the center, allowing for the creation of a dense hydrophobic core.
Comparison/Analysis
Resultant HP and Side Chain Models were ranked primarily based off their number of contacts. If two unique foldings of the same chain had the same number of contacts, centricity was used to determine the optimal fold. With the aforementioned metrics, the Side Chain Model outperformed the HP Model in the majority of cases. When the Side Chain Model did not outperform than the HP Model, it matched the HP model in the number of contacts. For chains with large consecutive stretches of H-nodes, optimal conformations in the Side Chain Model typically had more contacts, but had poor centricity values.
Because the Side Chain Model considers both H-nodes and P-nodes and backbone nodes, a new description of uniqueness is needed: which monomers are trivial, and which are significant? The backbone nodes and the H-nodes forming contacts were deemed significant, as per the HP Model. However, the P-nodes and peripheral H-nodes aren’t covered in the HP Model, thus requiring a new designation under the Side Chain Model. Peripheral H-nodes were deemed significant because they contribute to the dense hydrophobic core. P-nodes were deemed trivial because they are neutral inactive players in protein folding.