TY - GEN
T1 - Evaluating the Predictability of Cancer Types from 536 Somatic Mutations
T2 - 42nd Annual International Conferences of the IEEE Engineering in Medicine and Biology Society, EMBC 2020
AU - Dehkharghanian, Taher
AU - Rahnamayan, Shahryar
AU - Tizhoosh, Hamid R.
N1 - Publisher Copyright:
© 2020 IEEE.
PY - 2020/7
Y1 - 2020/7
N2 - In this paper, we introduce a new dataset for cancer research containing somatic mutation states of 536 genes of the Cancer Gene Census (CGC). We used somatic mutation information from the Cancer Genome Atlas (TCGA) projects to create this dataset. As preliminary investigations, we employed machine learning techniques, including k-Nearest Neighbors, Decision Tree, Random Forest, and Artificial Neural Networks (ANNs) to evaluate the potential of these somatic mutations for classification of cancer types. We compared our models on accuracy, precision, recall, and F1-score. We observed that ANNs outperformed the other models with F1-score of 0.36 and overall classification accuracy of 40%, and precision ranging from 12% to 92% for different cancer types. The 40% accuracy is significantly higher than random guessing which would have resulted in 3% overall classification accuracy. Although the model has relatively low overall accuracy, it has an average classification specificity of 98%. The ANN achieved high precision scores (> 0.7) for 5 of the 33 cancer types. The introduced dataset can be used for research on TCGA data, such as survival analysis, histopathology image analysis and content-based image retrieval. The dataset is available online for download: https://kimialab.uwaterloo.ca/kimia/.
AB - In this paper, we introduce a new dataset for cancer research containing somatic mutation states of 536 genes of the Cancer Gene Census (CGC). We used somatic mutation information from the Cancer Genome Atlas (TCGA) projects to create this dataset. As preliminary investigations, we employed machine learning techniques, including k-Nearest Neighbors, Decision Tree, Random Forest, and Artificial Neural Networks (ANNs) to evaluate the potential of these somatic mutations for classification of cancer types. We compared our models on accuracy, precision, recall, and F1-score. We observed that ANNs outperformed the other models with F1-score of 0.36 and overall classification accuracy of 40%, and precision ranging from 12% to 92% for different cancer types. The 40% accuracy is significantly higher than random guessing which would have resulted in 3% overall classification accuracy. Although the model has relatively low overall accuracy, it has an average classification specificity of 98%. The ANN achieved high precision scores (> 0.7) for 5 of the 33 cancer types. The introduced dataset can be used for research on TCGA data, such as survival analysis, histopathology image analysis and content-based image retrieval. The dataset is available online for download: https://kimialab.uwaterloo.ca/kimia/.
UR - http://www.scopus.com/inward/record.url?scp=85090998386&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85090998386&partnerID=8YFLogxK
U2 - 10.1109/EMBC44109.2020.9176699
DO - 10.1109/EMBC44109.2020.9176699
M3 - Conference contribution
AN - SCOPUS:85090998386
T3 - Proceedings of the Annual International Conference of the IEEE Engineering in Medicine and Biology Society, EMBS
SP - 5308
EP - 5311
BT - 42nd Annual International Conferences of the IEEE Engineering in Medicine and Biology Society
PB - Institute of Electrical and Electronics Engineers Inc.
Y2 - 20 July 2020 through 24 July 2020
ER -