Comparison of classification methods on protein-protein interaction document classification

Guixian Xu, Zhendong Niu, Peter Uetz, Xu Gao, Hongfang D Liu

Research output: Chapter in Book/Report/Conference proceedingConference contribution

1 Citation (Scopus)

Abstract

Protein-protein interaction (PPI) network is essential to understand the fundamental processes governing cell biology. The mining and curation of experimental PPI knowledge is critical for analysis of high-throughput genomics and proteomics data. Several PPI knowledge bases have been generated by expensive manual curation but far from comprehensive. Document classification systems have been shown to have the potential to accelerate the curation process by retrieving PPI-related documents. However, it is usually a case that a small number of positive documents can be obtained manually or from PPI knowledge bases with literature-based evidence and there are a large number of unlabeled documents where most of them are negative documents. Such data sets are called imbalanced. Learning from imbalanced data sets, where the number of examples of one (majority) class is much higher than the others, presents an important challenge to the machine learning community. It is not clear what kind of classification algorithm is suitable for PPI document classification. In this paper, we compared the performance of several document classifiers on two PPI document sets and varied the size of the number of positives and the ratio of the number of positives to the number of negatives (or unlabeled) in the experiment.

Original languageEnglish (US)
Title of host publicationProceedings - 2008 IEEE International Conference on Bioinformatics and Biomedicine Workshops, BIBMW
Pages83-90
Number of pages8
DOIs
StatePublished - 2008
Externally publishedYes
Event2008 IEEE International Conference on Bioinformatics and Biomedicine Workshops, BIBMW - Philadelphia, PA, United States
Duration: Nov 3 2008Nov 5 2008

Other

Other2008 IEEE International Conference on Bioinformatics and Biomedicine Workshops, BIBMW
CountryUnited States
CityPhiladelphia, PA
Period11/3/0811/5/08

Fingerprint

Proteins
Knowledge Bases
Cytology
Protein Interaction Maps
Genomics
Proteomics
Cell Biology
Learning systems
Classifiers
Learning
Throughput

ASJC Scopus subject areas

  • Molecular Biology
  • Information Systems
  • Biomedical Engineering

Cite this

Xu, G., Niu, Z., Uetz, P., Gao, X., & Liu, H. D. (2008). Comparison of classification methods on protein-protein interaction document classification. In Proceedings - 2008 IEEE International Conference on Bioinformatics and Biomedicine Workshops, BIBMW (pp. 83-90). [4686213] https://doi.org/10.1109/BIBMW.2008.4686213

Comparison of classification methods on protein-protein interaction document classification. / Xu, Guixian; Niu, Zhendong; Uetz, Peter; Gao, Xu; Liu, Hongfang D.

Proceedings - 2008 IEEE International Conference on Bioinformatics and Biomedicine Workshops, BIBMW. 2008. p. 83-90 4686213.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Xu, G, Niu, Z, Uetz, P, Gao, X & Liu, HD 2008, Comparison of classification methods on protein-protein interaction document classification. in Proceedings - 2008 IEEE International Conference on Bioinformatics and Biomedicine Workshops, BIBMW., 4686213, pp. 83-90, 2008 IEEE International Conference on Bioinformatics and Biomedicine Workshops, BIBMW, Philadelphia, PA, United States, 11/3/08. https://doi.org/10.1109/BIBMW.2008.4686213
Xu G, Niu Z, Uetz P, Gao X, Liu HD. Comparison of classification methods on protein-protein interaction document classification. In Proceedings - 2008 IEEE International Conference on Bioinformatics and Biomedicine Workshops, BIBMW. 2008. p. 83-90. 4686213 https://doi.org/10.1109/BIBMW.2008.4686213
Xu, Guixian ; Niu, Zhendong ; Uetz, Peter ; Gao, Xu ; Liu, Hongfang D. / Comparison of classification methods on protein-protein interaction document classification. Proceedings - 2008 IEEE International Conference on Bioinformatics and Biomedicine Workshops, BIBMW. 2008. pp. 83-90
@inproceedings{c39464f35ae244cd82289232498fd448,
title = "Comparison of classification methods on protein-protein interaction document classification",
abstract = "Protein-protein interaction (PPI) network is essential to understand the fundamental processes governing cell biology. The mining and curation of experimental PPI knowledge is critical for analysis of high-throughput genomics and proteomics data. Several PPI knowledge bases have been generated by expensive manual curation but far from comprehensive. Document classification systems have been shown to have the potential to accelerate the curation process by retrieving PPI-related documents. However, it is usually a case that a small number of positive documents can be obtained manually or from PPI knowledge bases with literature-based evidence and there are a large number of unlabeled documents where most of them are negative documents. Such data sets are called imbalanced. Learning from imbalanced data sets, where the number of examples of one (majority) class is much higher than the others, presents an important challenge to the machine learning community. It is not clear what kind of classification algorithm is suitable for PPI document classification. In this paper, we compared the performance of several document classifiers on two PPI document sets and varied the size of the number of positives and the ratio of the number of positives to the number of negatives (or unlabeled) in the experiment.",
author = "Guixian Xu and Zhendong Niu and Peter Uetz and Xu Gao and Liu, {Hongfang D}",
year = "2008",
doi = "10.1109/BIBMW.2008.4686213",
language = "English (US)",
isbn = "9781424428908",
pages = "83--90",
booktitle = "Proceedings - 2008 IEEE International Conference on Bioinformatics and Biomedicine Workshops, BIBMW",

}

TY - GEN

T1 - Comparison of classification methods on protein-protein interaction document classification

AU - Xu, Guixian

AU - Niu, Zhendong

AU - Uetz, Peter

AU - Gao, Xu

AU - Liu, Hongfang D

PY - 2008

Y1 - 2008

N2 - Protein-protein interaction (PPI) network is essential to understand the fundamental processes governing cell biology. The mining and curation of experimental PPI knowledge is critical for analysis of high-throughput genomics and proteomics data. Several PPI knowledge bases have been generated by expensive manual curation but far from comprehensive. Document classification systems have been shown to have the potential to accelerate the curation process by retrieving PPI-related documents. However, it is usually a case that a small number of positive documents can be obtained manually or from PPI knowledge bases with literature-based evidence and there are a large number of unlabeled documents where most of them are negative documents. Such data sets are called imbalanced. Learning from imbalanced data sets, where the number of examples of one (majority) class is much higher than the others, presents an important challenge to the machine learning community. It is not clear what kind of classification algorithm is suitable for PPI document classification. In this paper, we compared the performance of several document classifiers on two PPI document sets and varied the size of the number of positives and the ratio of the number of positives to the number of negatives (or unlabeled) in the experiment.

AB - Protein-protein interaction (PPI) network is essential to understand the fundamental processes governing cell biology. The mining and curation of experimental PPI knowledge is critical for analysis of high-throughput genomics and proteomics data. Several PPI knowledge bases have been generated by expensive manual curation but far from comprehensive. Document classification systems have been shown to have the potential to accelerate the curation process by retrieving PPI-related documents. However, it is usually a case that a small number of positive documents can be obtained manually or from PPI knowledge bases with literature-based evidence and there are a large number of unlabeled documents where most of them are negative documents. Such data sets are called imbalanced. Learning from imbalanced data sets, where the number of examples of one (majority) class is much higher than the others, presents an important challenge to the machine learning community. It is not clear what kind of classification algorithm is suitable for PPI document classification. In this paper, we compared the performance of several document classifiers on two PPI document sets and varied the size of the number of positives and the ratio of the number of positives to the number of negatives (or unlabeled) in the experiment.

UR - http://www.scopus.com/inward/record.url?scp=58049170453&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=58049170453&partnerID=8YFLogxK

U2 - 10.1109/BIBMW.2008.4686213

DO - 10.1109/BIBMW.2008.4686213

M3 - Conference contribution

AN - SCOPUS:58049170453

SN - 9781424428908

SP - 83

EP - 90

BT - Proceedings - 2008 IEEE International Conference on Bioinformatics and Biomedicine Workshops, BIBMW

ER -