Predicting Peptide-MHC Binding Affinities With Imputed Training Data


Predicting the binding affinity between MHC proteins and their peptide ligands is a key problem in computational immunology. State of the art performance is currently achieved by the allele-specific predictor NetMHC and the pan-allele predictor NetMHCpan, both of which are ensembles of shallow neural networks. We explore an intermediate between allele-specific and pan-allele prediction: training allele-specific predictors with synthetic samples generated by imputation of the peptide-MHC affinity matrix. We find that the imputation strategy is useful on alleles with very little training data. We have implemented our predictor as an open-source software package called MHCflurry and show that MHCflurry achieves competitive performance to NetMHC and NetMHCpan.


In most vertebrates, cytotoxic T-cells enforce multi-cellular order by killing infected or cancerous cells. Each organism possesses a poly-clonal army of T-cells which collectively are able to distinguish unhealthy cells from healthy ones. This amazing feat is achieved through the winnowing and expansion of clonal T-cell populations possessing highly specific T-cell receptors (TCRs) (Blackman 1990). Each distinct TCR recognizes a small number of similar peptides bound to an MHC molecule on the surface of a cell (Huseby 2005). Though there are many steps in “antigen processing” (Cresswell 2005), it has become apparent that MHC binding is the most restrictive step. Peptide-MHC affinity prediction is the well-studied problem of predicting the binding strength of a given peptide and MHC pair (Lundegaard 2007). Early approaches focused on “sequence motifs”(Sette 1989), followed by regularized linear models, linear models with interaction terms such as SMM with pairwise features (Peters 2003), and more recently the NetMHC family of predictors, a collection of related models based on ensembles of neural networks. Two of these predictors, NetMHC (Lundegaard 2008a) and NetMHCpan (Nielsen 2007), have emerged as the methods of choice across multiple fields of study within immunology, including virology (Lund 2011), tumor immunology (Gubin 2015), and autoimmunity (Abreu 2012).

NetMHC is an allele-specific method which trains a separate predictor for each allele’s binding dataset, whereas NetMHCpan is a pan-allele method whose inputs are vector encodings of both a peptide and a subsequence of a particular MHC molecule. The conventional wisdom is that NetMHC performs better on alleles with many assayed ligands whereas NetMHCpan is superior for less well-characterized alleles (Gfeller 2016).

In this paper we explore the space between allele-specific and pan-allele prediction by imputing the unobserved values of peptide-MHC affinities for which we have no measurements and using these imputed values for pre-training of allele-specific binding predictors.

Data and evaluation metrics

Two datasets were used from a recent paper studying the relationship between training data and pMHC predictor accuracy(Kim 2014). The training dataset (BD2009) contained entries from IEDB (Salimi 2012) up to 2009 and the test dataset (BLIND) contained IEDB entries from between 2010 and 2013 which did not overlap with BD2009 (Table \ref{tab:datasets}).

Train (BD2009) and test (BLIND) dataset sizes.
Alleles IC50 Measurements Expanded 9mers
BD2009 106 137,654 470,170
BLIND 53 27,680 83,752


Throughout this paper we will evaluate a pMHC binding predictor using three different metrics:

  • AUC: Area under the ROC curve. Estimates the probability that a “strong binder” peptide (affinity \(\leq 500\)nM) will be given a stronger predicted affinity than one whose ground truth affinity is \(>500\)nM.

  • F\(_1\) score: Measures trade-off between sensitivity and specificity for predicting “strong binders” with affinities \(\leq 500\)nM.

  • Kendall’s \(\tau\): Rank correlation across the full spectrum of binding affinities.

Comparison of imputation algorithms as predictors

A dataset of peptide-MHC affinities for