Evaluating the Predictability of Cancer Types from 536 Somatic Mutations: A New Dataset

Taher Dehkharghanian, Shahryar Rahnamayan, Hamid R. Tizhoosh

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

In this paper, we introduce a new dataset for cancer research containing somatic mutation states of 536 genes of the Cancer Gene Census (CGC). We used somatic mutation information from the Cancer Genome Atlas (TCGA) projects to create this dataset. As preliminary investigations, we employed machine learning techniques, including k-Nearest Neighbors, Decision Tree, Random Forest, and Artificial Neural Networks (ANNs) to evaluate the potential of these somatic mutations for classification of cancer types. We compared our models on accuracy, precision, recall, and F1-score. We observed that ANNs outperformed the other models with F1-score of 0.36 and overall classification accuracy of 40%, and precision ranging from 12% to 92% for different cancer types. The 40% accuracy is significantly higher than random guessing which would have resulted in 3% overall classification accuracy. Although the model has relatively low overall accuracy, it has an average classification specificity of 98%. The ANN achieved high precision scores (> 0.7) for 5 of the 33 cancer types. The introduced dataset can be used for research on TCGA data, such as survival analysis, histopathology image analysis and content-based image retrieval. The dataset is available online for download: https://kimialab.uwaterloo.ca/kimia/.

Original languageEnglish (US)
Title of host publication42nd Annual International Conferences of the IEEE Engineering in Medicine and Biology Society
Subtitle of host publicationEnabling Innovative Technologies for Global Healthcare, EMBC 2020
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages5308-5311
Number of pages4
ISBN (Electronic)9781728119908
DOIs
StatePublished - Jul 2020
Event42nd Annual International Conferences of the IEEE Engineering in Medicine and Biology Society, EMBC 2020 - Montreal, Canada
Duration: Jul 20 2020Jul 24 2020

Publication series

NameProceedings of the Annual International Conference of the IEEE Engineering in Medicine and Biology Society, EMBS
Volume2020-July
ISSN (Print)1557-170X

Conference

Conference42nd Annual International Conferences of the IEEE Engineering in Medicine and Biology Society, EMBC 2020
Country/TerritoryCanada
CityMontreal
Period7/20/207/24/20

ASJC Scopus subject areas

  • Signal Processing
  • Biomedical Engineering
  • Computer Vision and Pattern Recognition
  • Health Informatics

Fingerprint

Dive into the research topics of 'Evaluating the Predictability of Cancer Types from 536 Somatic Mutations: A New Dataset'. Together they form a unique fingerprint.

Cite this