2020 AACR: Improved Tumor-Only Somatic Variant Calling Using a Gradient Boosted Machine Learning Algorithm

Accurate identification of somatic variants in a tumor sample is often accomplished by utilizing a paired normal tissue sample from the same patient to enable the separation of private germline mutations from somatic variants. However, a paired normal sample is not always available, making accurate somatic variant calling more challenging. Composite proxy normals and other filtering approaches can be used in lieu of a paired normal sample, but the resulting somatic call set may suffer from incomplete germline filtering and reduced sensitivity compared to paired tumor-normal analysis. To address these limitations, we developed a novel, machine learning based, tumor-only somatic small variant classifier, which leverages gradient boosted decision trees to significantly increase somatic variant specificity from a tumor-only analysis without reducing overall sensitivity.

We produced a ground truth set of somatic SNVs and indels from 350 whole exome-sequenced tumor-normal pairs using a validated cancer bioinformatics pipeline. We then generated a feature set from each tumor sample by aggregating attributes including: allelic frequency and read depth, tumor cellularity estimations, germline variant calls from HaplotypeCaller, tumor-only somatic variant calls from Mutect and Mutect2 using a proxy-normal, copy-number alterations, annotations from databases such as GnomAD and COSMIC, and problematic-region annotations including homopolymers. Somatic variant truth labels were assigned using filtered Mutect2 output from the tumor-normal analysis. The samples were randomly split into training and testing sets in a 90-10 ratio.