Robust and efficient identification of biomarkers by classifying features on graphs

Taehyun Hwang, Hugues Sicotte, Ze Tian, Baolin Wu, Jean-Pierre Kocher, Dennis A Wigle, Vipin Kumar, Rui Kuang

Research output: Contribution to journalArticle

24 Citations (Scopus)

Abstract

Motivation: A central problem in biomarker discovery from large-scale gene expression or single nucleotide polymorphism (SNP) data is the computational challenge of taking into account the dependence among all the features. Methods that ignore the dependence usually identify non-reproducible biomarkers across independent datasets. We introduce a new graph-based semi-supervised feature classification algorithm to identify discriminative disease markers by learning on bipartite graphs. Our algorithm directly classifies the feature nodes in a bipartite graph as positive, negative or neutral with network propagation to capture the dependence among both samples and features (clinical and genetic variables) by exploring bi-cluster structures in a graph. Two features of our algorithm are: (1) our algorithm can find a global optimal labeling to capture the dependence among all the features and thus, generates highly reproducible results across independent microarray or other high-thoughput datasets, (2) our algorithm is capable of handling hundreds of thousands of features and thus, is particularly useful for biomarker identification from high-throughput gene expression and SNP data. In addition, although designed for classifying features, our algorithm can also simultaneously classify test samples for disease prognosis/diagnosis. Results: We applied the network propagation algorithm to study three large-scale breast cancer datasets. Our algorithm achieved competitive classification performance compared with SVMs and other baseline methods, and identified several markers with clinical or biological relevance with the disease. More importantly, our algorithm also identified highly reproducible marker genes and enriched functions from the independent datasets.

Original languageEnglish (US)
Pages (from-to)2023-2029
Number of pages7
JournalBioinformatics
Volume24
Issue number18
DOIs
StatePublished - Sep 2008

Fingerprint

Biomarkers
Graph in graph theory
Single nucleotide Polymorphism
Bipartite Graph
Nucleotides
Gene Expression
Polymorphism
Gene expression
Single Nucleotide Polymorphism
Classify
Propagation
Prognosis
Classification Algorithm
Breast Cancer
Microarray
Microarrays
High Throughput
Labeling
Baseline
Genes

ASJC Scopus subject areas

  • Clinical Biochemistry
  • Computer Science Applications
  • Computational Theory and Mathematics

Cite this

Robust and efficient identification of biomarkers by classifying features on graphs. / Hwang, Taehyun; Sicotte, Hugues; Tian, Ze; Wu, Baolin; Kocher, Jean-Pierre; Wigle, Dennis A; Kumar, Vipin; Kuang, Rui.

In: Bioinformatics, Vol. 24, No. 18, 09.2008, p. 2023-2029.

Research output: Contribution to journalArticle

Hwang, Taehyun ; Sicotte, Hugues ; Tian, Ze ; Wu, Baolin ; Kocher, Jean-Pierre ; Wigle, Dennis A ; Kumar, Vipin ; Kuang, Rui. / Robust and efficient identification of biomarkers by classifying features on graphs. In: Bioinformatics. 2008 ; Vol. 24, No. 18. pp. 2023-2029.
@article{971913324c9349eda09c97bc57de8f8e,
title = "Robust and efficient identification of biomarkers by classifying features on graphs",
abstract = "Motivation: A central problem in biomarker discovery from large-scale gene expression or single nucleotide polymorphism (SNP) data is the computational challenge of taking into account the dependence among all the features. Methods that ignore the dependence usually identify non-reproducible biomarkers across independent datasets. We introduce a new graph-based semi-supervised feature classification algorithm to identify discriminative disease markers by learning on bipartite graphs. Our algorithm directly classifies the feature nodes in a bipartite graph as positive, negative or neutral with network propagation to capture the dependence among both samples and features (clinical and genetic variables) by exploring bi-cluster structures in a graph. Two features of our algorithm are: (1) our algorithm can find a global optimal labeling to capture the dependence among all the features and thus, generates highly reproducible results across independent microarray or other high-thoughput datasets, (2) our algorithm is capable of handling hundreds of thousands of features and thus, is particularly useful for biomarker identification from high-throughput gene expression and SNP data. In addition, although designed for classifying features, our algorithm can also simultaneously classify test samples for disease prognosis/diagnosis. Results: We applied the network propagation algorithm to study three large-scale breast cancer datasets. Our algorithm achieved competitive classification performance compared with SVMs and other baseline methods, and identified several markers with clinical or biological relevance with the disease. More importantly, our algorithm also identified highly reproducible marker genes and enriched functions from the independent datasets.",
author = "Taehyun Hwang and Hugues Sicotte and Ze Tian and Baolin Wu and Jean-Pierre Kocher and Wigle, {Dennis A} and Vipin Kumar and Rui Kuang",
year = "2008",
month = "9",
doi = "10.1093/bioinformatics/btn383",
language = "English (US)",
volume = "24",
pages = "2023--2029",
journal = "Bioinformatics",
issn = "1367-4803",
publisher = "Oxford University Press",
number = "18",

}

TY - JOUR

T1 - Robust and efficient identification of biomarkers by classifying features on graphs

AU - Hwang, Taehyun

AU - Sicotte, Hugues

AU - Tian, Ze

AU - Wu, Baolin

AU - Kocher, Jean-Pierre

AU - Wigle, Dennis A

AU - Kumar, Vipin

AU - Kuang, Rui

PY - 2008/9

Y1 - 2008/9

N2 - Motivation: A central problem in biomarker discovery from large-scale gene expression or single nucleotide polymorphism (SNP) data is the computational challenge of taking into account the dependence among all the features. Methods that ignore the dependence usually identify non-reproducible biomarkers across independent datasets. We introduce a new graph-based semi-supervised feature classification algorithm to identify discriminative disease markers by learning on bipartite graphs. Our algorithm directly classifies the feature nodes in a bipartite graph as positive, negative or neutral with network propagation to capture the dependence among both samples and features (clinical and genetic variables) by exploring bi-cluster structures in a graph. Two features of our algorithm are: (1) our algorithm can find a global optimal labeling to capture the dependence among all the features and thus, generates highly reproducible results across independent microarray or other high-thoughput datasets, (2) our algorithm is capable of handling hundreds of thousands of features and thus, is particularly useful for biomarker identification from high-throughput gene expression and SNP data. In addition, although designed for classifying features, our algorithm can also simultaneously classify test samples for disease prognosis/diagnosis. Results: We applied the network propagation algorithm to study three large-scale breast cancer datasets. Our algorithm achieved competitive classification performance compared with SVMs and other baseline methods, and identified several markers with clinical or biological relevance with the disease. More importantly, our algorithm also identified highly reproducible marker genes and enriched functions from the independent datasets.

AB - Motivation: A central problem in biomarker discovery from large-scale gene expression or single nucleotide polymorphism (SNP) data is the computational challenge of taking into account the dependence among all the features. Methods that ignore the dependence usually identify non-reproducible biomarkers across independent datasets. We introduce a new graph-based semi-supervised feature classification algorithm to identify discriminative disease markers by learning on bipartite graphs. Our algorithm directly classifies the feature nodes in a bipartite graph as positive, negative or neutral with network propagation to capture the dependence among both samples and features (clinical and genetic variables) by exploring bi-cluster structures in a graph. Two features of our algorithm are: (1) our algorithm can find a global optimal labeling to capture the dependence among all the features and thus, generates highly reproducible results across independent microarray or other high-thoughput datasets, (2) our algorithm is capable of handling hundreds of thousands of features and thus, is particularly useful for biomarker identification from high-throughput gene expression and SNP data. In addition, although designed for classifying features, our algorithm can also simultaneously classify test samples for disease prognosis/diagnosis. Results: We applied the network propagation algorithm to study three large-scale breast cancer datasets. Our algorithm achieved competitive classification performance compared with SVMs and other baseline methods, and identified several markers with clinical or biological relevance with the disease. More importantly, our algorithm also identified highly reproducible marker genes and enriched functions from the independent datasets.

UR - http://www.scopus.com/inward/record.url?scp=51749084898&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=51749084898&partnerID=8YFLogxK

U2 - 10.1093/bioinformatics/btn383

DO - 10.1093/bioinformatics/btn383

M3 - Article

VL - 24

SP - 2023

EP - 2029

JO - Bioinformatics

JF - Bioinformatics

SN - 1367-4803

IS - 18

ER -