HLA binding is currently the most well-established criteria for ranking neoantigen candidates. Recent advances in training data generated from mass spectrometry provide a larger dataset of peptide binders and non-binders for individual HLA alleles. This new binding data takes two important additional components into consideration: cleavage and transportation, which are critically important for presentation assessment.

We leveraged this advancement by developing the Systematic HLA Epitope Ranking Pan Algorithm (SHERPA), our pan-predictive machine learning model for predicting MHC class I presentation.

SHERPA relies upon a proprietary, high quality, and unambiguous training dataset generated by performing immunopeptidomics on the robust set of MHC Class I alleles using monoallelic cell lines (Figure 5).

Figure 5: Overview of SHERPA machine learning algorithm

Multiple modeling strategies were combined to accurately predict neoantigens for all known alleles. The SHERPA-Binding algorithm uses both the peptide and binding pocket information to predict a binding rank. The SHERPA-Presentation algorithm incorporates additional features such as antigen processing machinery and gene expression information to predict a more comprehensive presentation rank (Figure 6).

Figure 6: SHERPA Models

The performance of SHERPA was evaluated using 10% of the monoallelic immunopeptidomics data (held-out from training) mixed with synthetic negative examples in a 1:999 ratio (commonly assumed prevalence). SHERPA models have higher precision over all recall values compared to NetMHCPan-4.0, the state-of-the-art publicly available tool (Figure 7A), and significantly higher positive predictive values among the top 0.1% peptides in the test data (Figure 7B).

Figure 7A and B: SHERPA Enables Superior Neoantigen Presentation Prediction