A data science approach for the classification of low-grade and high-grade ovarian serous carcinomas

Sangdi Lin; Chen Wang; Shabnam Zarei; Debra A. Bell; Sarah E. Kerr; George C. Runger; Jean Pierre A. Kocher

doi:10.1186/s12864-018-5177-9

A data science approach for the classification of low-grade and high-grade ovarian serous carcinomas

Sangdi Lin, Chen Wang, Shabnam Zarei, Debra A. Bell, Sarah E. Kerr, George C. Runger, Jean Pierre A. Kocher

Research output: Contribution to journal › Article › peer-review

1 Scopus citations

Abstract

Background: Copy Number Alternations (CNAs) is defined as somatic gain or loss of DNA regions. The profiles of CNAs may provide a fingerprint specific to a tumor type or tumor grade. Low-coverage sequencing for reporting CNAs has recently gained interest since successfully translated into clinical applications. Ovarian serous carcinomas can be classified into two largely mutually exclusive grades, low grade and high grade, based on their histologic features. The grade classification based on the genomics may provide valuable clue on how to best manage these patients in clinic. Based on the study of ovarian serous carcinomas, we explore the methodology of combining CNAs reporting from low-coverage sequencing with machine learning techniques to stratify tumor biospecimens of different grades. Results: We have developed a data-driven methodology for tumor classification using the profiles of CNAs reported by low-coverage sequencing. The proposed method called Bag-of-Segments is used to summarize fixed-length CNA features predictive of tumor grades. These features are further processed by machine learning techniques to obtain classification models. High accuracy is obtained for classifying ovarian serous carcinoma into high and low grades based on leave-one-out cross-validation experiments. The models that are weakly influenced by the sequence coverage and the purity of the sample can also be built, which would be of higher relevance for clinical applications. The patterns captured by Bag-of-Segments features correlate with current clinical knowledge: low grade ovarian tumors being related to aneuploidy events associated to mitotic errors while high grade ovarian tumors are induced by DNA repair gene malfunction. Conclusions: The proposed data-driven method obtains high accuracy with various parametrizations for the ovarian serous carcinoma study, indicating that it has good generalization potential towards other CNA classification problems. This method could be applied to the more difficult task of classifying ovarian serous carcinomas with ambiguous histology or in those with low grade tumor co-existing with high grade tumor. The closer genomic relationship of these tumor samples to low or high grade may provide important clinical value.

Original language	English (US)
Article number	841
Journal	BMC genomics
Volume	19
Issue number	1
DOIs	https://doi.org/10.1186/s12864-018-5177-9
State	Published - Nov 27 2018

Keywords

Classification
Copy number alternations
Data science
Low-coverage whole genome sequencing
Machine learning
Ovarian serious carcinoma
Tumor grade

ASJC Scopus subject areas

Biotechnology
Genetics

Access to Document

10.1186/s12864-018-5177-9

Cite this

@article{b5cb30c81249440a85dbb356a9ef9c3d,

title = "A data science approach for the classification of low-grade and high-grade ovarian serous carcinomas",

abstract = "Background: Copy Number Alternations (CNAs) is defined as somatic gain or loss of DNA regions. The profiles of CNAs may provide a fingerprint specific to a tumor type or tumor grade. Low-coverage sequencing for reporting CNAs has recently gained interest since successfully translated into clinical applications. Ovarian serous carcinomas can be classified into two largely mutually exclusive grades, low grade and high grade, based on their histologic features. The grade classification based on the genomics may provide valuable clue on how to best manage these patients in clinic. Based on the study of ovarian serous carcinomas, we explore the methodology of combining CNAs reporting from low-coverage sequencing with machine learning techniques to stratify tumor biospecimens of different grades. Results: We have developed a data-driven methodology for tumor classification using the profiles of CNAs reported by low-coverage sequencing. The proposed method called Bag-of-Segments is used to summarize fixed-length CNA features predictive of tumor grades. These features are further processed by machine learning techniques to obtain classification models. High accuracy is obtained for classifying ovarian serous carcinoma into high and low grades based on leave-one-out cross-validation experiments. The models that are weakly influenced by the sequence coverage and the purity of the sample can also be built, which would be of higher relevance for clinical applications. The patterns captured by Bag-of-Segments features correlate with current clinical knowledge: low grade ovarian tumors being related to aneuploidy events associated to mitotic errors while high grade ovarian tumors are induced by DNA repair gene malfunction. Conclusions: The proposed data-driven method obtains high accuracy with various parametrizations for the ovarian serous carcinoma study, indicating that it has good generalization potential towards other CNA classification problems. This method could be applied to the more difficult task of classifying ovarian serous carcinomas with ambiguous histology or in those with low grade tumor co-existing with high grade tumor. The closer genomic relationship of these tumor samples to low or high grade may provide important clinical value.",

keywords = "Classification, Copy number alternations, Data science, Low-coverage whole genome sequencing, Machine learning, Ovarian serious carcinoma, Tumor grade",

author = "Sangdi Lin and Chen Wang and Shabnam Zarei and Bell, {Debra A.} and Kerr, {Sarah E.} and Runger, {George C.} and Kocher, {Jean Pierre A.}",

note = "Publisher Copyright: {\textcopyright} 2018 The Author(s).",

year = "2018",

month = nov,

day = "27",

doi = "10.1186/s12864-018-5177-9",

language = "English (US)",

volume = "19",

journal = "BMC genomics",

issn = "1471-2164",

publisher = "BioMed Central",

number = "1",

}

TY - JOUR

T1 - A data science approach for the classification of low-grade and high-grade ovarian serous carcinomas

AU - Lin, Sangdi

AU - Wang, Chen

AU - Zarei, Shabnam

AU - Bell, Debra A.

AU - Kerr, Sarah E.

AU - Runger, George C.

AU - Kocher, Jean Pierre A.

PY - 2018/11/27

Y1 - 2018/11/27

N2 - Background: Copy Number Alternations (CNAs) is defined as somatic gain or loss of DNA regions. The profiles of CNAs may provide a fingerprint specific to a tumor type or tumor grade. Low-coverage sequencing for reporting CNAs has recently gained interest since successfully translated into clinical applications. Ovarian serous carcinomas can be classified into two largely mutually exclusive grades, low grade and high grade, based on their histologic features. The grade classification based on the genomics may provide valuable clue on how to best manage these patients in clinic. Based on the study of ovarian serous carcinomas, we explore the methodology of combining CNAs reporting from low-coverage sequencing with machine learning techniques to stratify tumor biospecimens of different grades. Results: We have developed a data-driven methodology for tumor classification using the profiles of CNAs reported by low-coverage sequencing. The proposed method called Bag-of-Segments is used to summarize fixed-length CNA features predictive of tumor grades. These features are further processed by machine learning techniques to obtain classification models. High accuracy is obtained for classifying ovarian serous carcinoma into high and low grades based on leave-one-out cross-validation experiments. The models that are weakly influenced by the sequence coverage and the purity of the sample can also be built, which would be of higher relevance for clinical applications. The patterns captured by Bag-of-Segments features correlate with current clinical knowledge: low grade ovarian tumors being related to aneuploidy events associated to mitotic errors while high grade ovarian tumors are induced by DNA repair gene malfunction. Conclusions: The proposed data-driven method obtains high accuracy with various parametrizations for the ovarian serous carcinoma study, indicating that it has good generalization potential towards other CNA classification problems. This method could be applied to the more difficult task of classifying ovarian serous carcinomas with ambiguous histology or in those with low grade tumor co-existing with high grade tumor. The closer genomic relationship of these tumor samples to low or high grade may provide important clinical value.

AB - Background: Copy Number Alternations (CNAs) is defined as somatic gain or loss of DNA regions. The profiles of CNAs may provide a fingerprint specific to a tumor type or tumor grade. Low-coverage sequencing for reporting CNAs has recently gained interest since successfully translated into clinical applications. Ovarian serous carcinomas can be classified into two largely mutually exclusive grades, low grade and high grade, based on their histologic features. The grade classification based on the genomics may provide valuable clue on how to best manage these patients in clinic. Based on the study of ovarian serous carcinomas, we explore the methodology of combining CNAs reporting from low-coverage sequencing with machine learning techniques to stratify tumor biospecimens of different grades. Results: We have developed a data-driven methodology for tumor classification using the profiles of CNAs reported by low-coverage sequencing. The proposed method called Bag-of-Segments is used to summarize fixed-length CNA features predictive of tumor grades. These features are further processed by machine learning techniques to obtain classification models. High accuracy is obtained for classifying ovarian serous carcinoma into high and low grades based on leave-one-out cross-validation experiments. The models that are weakly influenced by the sequence coverage and the purity of the sample can also be built, which would be of higher relevance for clinical applications. The patterns captured by Bag-of-Segments features correlate with current clinical knowledge: low grade ovarian tumors being related to aneuploidy events associated to mitotic errors while high grade ovarian tumors are induced by DNA repair gene malfunction. Conclusions: The proposed data-driven method obtains high accuracy with various parametrizations for the ovarian serous carcinoma study, indicating that it has good generalization potential towards other CNA classification problems. This method could be applied to the more difficult task of classifying ovarian serous carcinomas with ambiguous histology or in those with low grade tumor co-existing with high grade tumor. The closer genomic relationship of these tumor samples to low or high grade may provide important clinical value.

KW - Classification

KW - Copy number alternations

KW - Data science

KW - Low-coverage whole genome sequencing

KW - Machine learning

KW - Ovarian serious carcinoma

KW - Tumor grade

UR - http://www.scopus.com/inward/record.url?scp=85057219755&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85057219755&partnerID=8YFLogxK

U2 - 10.1186/s12864-018-5177-9

DO - 10.1186/s12864-018-5177-9

M3 - Article

C2 - 30482155

AN - SCOPUS:85057219755

SN - 1471-2164

VL - 19

JO - BMC genomics

JF - BMC genomics

IS - 1

M1 - 841

ER -

A data science approach for the classification of low-grade and high-grade ovarian serous carcinomas

Abstract

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this