TY - GEN
T1 - Comparison of classification methods on protein-protein interaction document classification
AU - Xu, Guixian
AU - Niu, Zhendong
AU - Uetz, Peter
AU - Gao, Xu
AU - Liu, Hongfang
PY - 2008
Y1 - 2008
N2 - Protein-protein interaction (PPI) network is essential to understand the fundamental processes governing cell biology. The mining and curation of experimental PPI knowledge is critical for analysis of high-throughput genomics and proteomics data. Several PPI knowledge bases have been generated by expensive manual curation but far from comprehensive. Document classification systems have been shown to have the potential to accelerate the curation process by retrieving PPI-related documents. However, it is usually a case that a small number of positive documents can be obtained manually or from PPI knowledge bases with literature-based evidence and there are a large number of unlabeled documents where most of them are negative documents. Such data sets are called imbalanced. Learning from imbalanced data sets, where the number of examples of one (majority) class is much higher than the others, presents an important challenge to the machine learning community. It is not clear what kind of classification algorithm is suitable for PPI document classification. In this paper, we compared the performance of several document classifiers on two PPI document sets and varied the size of the number of positives and the ratio of the number of positives to the number of negatives (or unlabeled) in the experiment.
AB - Protein-protein interaction (PPI) network is essential to understand the fundamental processes governing cell biology. The mining and curation of experimental PPI knowledge is critical for analysis of high-throughput genomics and proteomics data. Several PPI knowledge bases have been generated by expensive manual curation but far from comprehensive. Document classification systems have been shown to have the potential to accelerate the curation process by retrieving PPI-related documents. However, it is usually a case that a small number of positive documents can be obtained manually or from PPI knowledge bases with literature-based evidence and there are a large number of unlabeled documents where most of them are negative documents. Such data sets are called imbalanced. Learning from imbalanced data sets, where the number of examples of one (majority) class is much higher than the others, presents an important challenge to the machine learning community. It is not clear what kind of classification algorithm is suitable for PPI document classification. In this paper, we compared the performance of several document classifiers on two PPI document sets and varied the size of the number of positives and the ratio of the number of positives to the number of negatives (or unlabeled) in the experiment.
UR - http://www.scopus.com/inward/record.url?scp=58049170453&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=58049170453&partnerID=8YFLogxK
U2 - 10.1109/BIBMW.2008.4686213
DO - 10.1109/BIBMW.2008.4686213
M3 - Conference contribution
AN - SCOPUS:58049170453
SN - 9781424428908
T3 - Proceedings - 2008 IEEE International Conference on Bioinformatics and Biomedicine Workshops, BIBMW
SP - 83
EP - 90
BT - Proceedings - 2008 IEEE International Conference on Bioinformatics and Biomedicine Workshops, BIBMW
T2 - 2008 IEEE International Conference on Bioinformatics and Biomedicine Workshops, BIBMW
Y2 - 3 November 2008 through 5 November 2008
ER -