loading page

Machine Learning Models for Accurate Prioritization of Variants of Uncertain Significance
  • +1
  • Daniel Mahecha,
  • Haydemar Nuñez,
  • Maria Lattig,
  • Jorge Duitama
Daniel Mahecha
SIGEN, Alianza Universidad de los Andes - Fundación Santa Fe de Bogota
Author Profile
Haydemar Nuñez
Universidad de los Andes
Author Profile
Maria Lattig
SIGEN, Alianza Universidad de los Andes - Fundación Santa Fe de Bogota
Author Profile
Jorge Duitama
Universidad de los Andes
Author Profile

Abstract

The growing use of new generation sequencing technologies on genetic diagnosis has produced an exponential increase in the number of Variants of Uncertain Significance (VUS). In this manuscript we compare three machine learning methods to classify VUS as Pathogenic or No pathogenic, implementing a Random Forest (RF), a Support Vector Machine (SVM), and a Multilayer Perceptron (MLP). To train the models, we extracted 82,463 high quality variants from ClinVar, using 9 conservation scores, the loss of function tool and allele frequencies. For the RF and SVM models, hyperparameters were tuned using cross validation with a grid search. The three models were tested on a set of 5,537 variants that had been classified as VUS any time along the last three years but had been reclassified in august 2020. The three models yielded superior accuracy on this set compared to the benchmarked tools. The RF based model yielded the best performance across different variant types and was used to create VusPrize, an open source software tool for prioritization of variants of uncertain significance. We believe that our model can improve the process of genetic diagnosis on research and clinical settings.

Peer review status:POSTED

24 Nov 2020Submitted to Human Mutation
25 Nov 2020Assigned to Editor
25 Nov 2020Submission Checks Completed