Semi-supervised learning of text classification on bacterial protein-protein interaction documents

Guixian Xu; Zhendong Niu; Peter Uetz; Xu Gao; Xuping Qin; Hongfang Liu

doi:10.1109/IJCBS.2009.68

Semi-supervised learning of text classification on bacterial protein-protein interaction documents

Guixian Xu, Zhendong Niu, Peter Uetz, Xu Gao, Xuping Qin, Hongfang Liu

Digital Health Sciences

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

5 Scopus citations

Abstract

Protein-protein interaction (PPI) network is essential to understand the fundamental processes governing cell biology. The mining and curation of PPI knowledge is critical for analyzing high-throughput genomics and proteomics data. Several PPI knowledge bases have been generated through expensive manual curation but far from comprehensive. It is desired to have a document classification system which can classify documents as PPI-related or not PPI-related and therefore assist the mining and curation of PPI knowledge. In order to build document classification systems, an annotated corpus is needed where each document in the corpus is tagged with a label (either positive or negative). However, it is usually the case that only a small number of positive documents can be obtained manually or from existing PPI knowledge bases with literature evidences. Meanwhile, there are a large number of unlabeled documents where most of them are not PPI-related. Machine learning based on a small number of positives and a large number of unlabeled documents is called learning from positive and unlabelled documents (LPU) which has been studied in the general domain. A popular approach for LPU is a two-step strategy where the first step is to obtain reliable negative documents (RN) and the second step is to refine RN using various methods such as clustering or boosting. In this paper, we tackle the problem of LPU for PPI document classification and compare three two-step procedures based on a public data set, Reuters-21578. One is to obtain a negative data set by building a machine learning classifier which treats each unlabelled document as negatives and then classifies unlabelled documents. The second procedure is to refine the negative data set iteratively and consider those unlabeled documents always classified as negative as reliable negatives. The third procedure is to augment the negative data set iteratively by including unlabeled documents classified as negative in any iteration. Three machine learning algorithms were deployed for each two-step procedure.

Original language	English (US)
Title of host publication	Proceedings - 2009 International Joint Conference on Bioinformatics, Systems Biology and Intelligent Computing, IJCBS 2009
Pages	263-270
Number of pages	8
DOIs	https://doi.org/10.1109/IJCBS.2009.68
State	Published - 2009
Event	2009 International Joint Conference on Bioinformatics, Systems Biology and Intelligent Computing, IJCBS 2009 - Shanghai, China Duration: Aug 3 2009 → Aug 5 2009

Publication series

Name	Proceedings - 2009 International Joint Conference on Bioinformatics, Systems Biology and Intelligent Computing, IJCBS 2009

Other

Other	2009 International Joint Conference on Bioinformatics, Systems Biology and Intelligent Computing, IJCBS 2009
Country/Territory	China
City	Shanghai
Period	8/3/09 → 8/5/09

Keywords

Protein-protein interaction
Semi-supervised learning
Text classification

ASJC Scopus subject areas

Software
Biomedical Engineering

Access to Document

10.1109/IJCBS.2009.68

Cite this

Xu, G., Niu, Z., Uetz, P., Gao, X., Qin, X., & Liu, H. (2009). Semi-supervised learning of text classification on bacterial protein-protein interaction documents. In Proceedings - 2009 International Joint Conference on Bioinformatics, Systems Biology and Intelligent Computing, IJCBS 2009 (pp. 263-270). Article 5260672 (Proceedings - 2009 International Joint Conference on Bioinformatics, Systems Biology and Intelligent Computing, IJCBS 2009). https://doi.org/10.1109/IJCBS.2009.68

Semi-supervised learning of text classification on bacterial protein-protein interaction documents. / Xu, Guixian; Niu, Zhendong; Uetz, Peter et al.
Proceedings - 2009 International Joint Conference on Bioinformatics, Systems Biology and Intelligent Computing, IJCBS 2009. 2009. p. 263-270 5260672 (Proceedings - 2009 International Joint Conference on Bioinformatics, Systems Biology and Intelligent Computing, IJCBS 2009).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

Xu, G, Niu, Z, Uetz, P, Gao, X, Qin, X & Liu, H 2009, Semi-supervised learning of text classification on bacterial protein-protein interaction documents. in Proceedings - 2009 International Joint Conference on Bioinformatics, Systems Biology and Intelligent Computing, IJCBS 2009., 5260672, Proceedings - 2009 International Joint Conference on Bioinformatics, Systems Biology and Intelligent Computing, IJCBS 2009, pp. 263-270, 2009 International Joint Conference on Bioinformatics, Systems Biology and Intelligent Computing, IJCBS 2009, Shanghai, China, 8/3/09. https://doi.org/10.1109/IJCBS.2009.68

Xu G, Niu Z, Uetz P, Gao X, Qin X, Liu H. Semi-supervised learning of text classification on bacterial protein-protein interaction documents. In Proceedings - 2009 International Joint Conference on Bioinformatics, Systems Biology and Intelligent Computing, IJCBS 2009. 2009. p. 263-270. 5260672. (Proceedings - 2009 International Joint Conference on Bioinformatics, Systems Biology and Intelligent Computing, IJCBS 2009). doi: 10.1109/IJCBS.2009.68

Xu, Guixian ; Niu, Zhendong ; Uetz, Peter et al. / Semi-supervised learning of text classification on bacterial protein-protein interaction documents. Proceedings - 2009 International Joint Conference on Bioinformatics, Systems Biology and Intelligent Computing, IJCBS 2009. 2009. pp. 263-270 (Proceedings - 2009 International Joint Conference on Bioinformatics, Systems Biology and Intelligent Computing, IJCBS 2009).

@inproceedings{8e30842860eb46a8a3db58072d9ec5a0,

title = "Semi-supervised learning of text classification on bacterial protein-protein interaction documents",

abstract = "Protein-protein interaction (PPI) network is essential to understand the fundamental processes governing cell biology. The mining and curation of PPI knowledge is critical for analyzing high-throughput genomics and proteomics data. Several PPI knowledge bases have been generated through expensive manual curation but far from comprehensive. It is desired to have a document classification system which can classify documents as PPI-related or not PPI-related and therefore assist the mining and curation of PPI knowledge. In order to build document classification systems, an annotated corpus is needed where each document in the corpus is tagged with a label (either positive or negative). However, it is usually the case that only a small number of positive documents can be obtained manually or from existing PPI knowledge bases with literature evidences. Meanwhile, there are a large number of unlabeled documents where most of them are not PPI-related. Machine learning based on a small number of positives and a large number of unlabeled documents is called learning from positive and unlabelled documents (LPU) which has been studied in the general domain. A popular approach for LPU is a two-step strategy where the first step is to obtain reliable negative documents (RN) and the second step is to refine RN using various methods such as clustering or boosting. In this paper, we tackle the problem of LPU for PPI document classification and compare three two-step procedures based on a public data set, Reuters-21578. One is to obtain a negative data set by building a machine learning classifier which treats each unlabelled document as negatives and then classifies unlabelled documents. The second procedure is to refine the negative data set iteratively and consider those unlabeled documents always classified as negative as reliable negatives. The third procedure is to augment the negative data set iteratively by including unlabeled documents classified as negative in any iteration. Three machine learning algorithms were deployed for each two-step procedure.",

keywords = "Protein-protein interaction, Semi-supervised learning, Text classification",

author = "Guixian Xu and Zhendong Niu and Peter Uetz and Xu Gao and Xuping Qin and Hongfang Liu",

year = "2009",

doi = "10.1109/IJCBS.2009.68",

language = "English (US)",

isbn = "9780769537399",

series = "Proceedings - 2009 International Joint Conference on Bioinformatics, Systems Biology and Intelligent Computing, IJCBS 2009",

pages = "263--270",

booktitle = "Proceedings - 2009 International Joint Conference on Bioinformatics, Systems Biology and Intelligent Computing, IJCBS 2009",

note = "2009 International Joint Conference on Bioinformatics, Systems Biology and Intelligent Computing, IJCBS 2009 ; Conference date: 03-08-2009 Through 05-08-2009",

}

TY - GEN

T1 - Semi-supervised learning of text classification on bacterial protein-protein interaction documents

AU - Xu, Guixian

AU - Niu, Zhendong

AU - Uetz, Peter

AU - Gao, Xu

AU - Qin, Xuping

AU - Liu, Hongfang

PY - 2009

Y1 - 2009

N2 - Protein-protein interaction (PPI) network is essential to understand the fundamental processes governing cell biology. The mining and curation of PPI knowledge is critical for analyzing high-throughput genomics and proteomics data. Several PPI knowledge bases have been generated through expensive manual curation but far from comprehensive. It is desired to have a document classification system which can classify documents as PPI-related or not PPI-related and therefore assist the mining and curation of PPI knowledge. In order to build document classification systems, an annotated corpus is needed where each document in the corpus is tagged with a label (either positive or negative). However, it is usually the case that only a small number of positive documents can be obtained manually or from existing PPI knowledge bases with literature evidences. Meanwhile, there are a large number of unlabeled documents where most of them are not PPI-related. Machine learning based on a small number of positives and a large number of unlabeled documents is called learning from positive and unlabelled documents (LPU) which has been studied in the general domain. A popular approach for LPU is a two-step strategy where the first step is to obtain reliable negative documents (RN) and the second step is to refine RN using various methods such as clustering or boosting. In this paper, we tackle the problem of LPU for PPI document classification and compare three two-step procedures based on a public data set, Reuters-21578. One is to obtain a negative data set by building a machine learning classifier which treats each unlabelled document as negatives and then classifies unlabelled documents. The second procedure is to refine the negative data set iteratively and consider those unlabeled documents always classified as negative as reliable negatives. The third procedure is to augment the negative data set iteratively by including unlabeled documents classified as negative in any iteration. Three machine learning algorithms were deployed for each two-step procedure.

AB - Protein-protein interaction (PPI) network is essential to understand the fundamental processes governing cell biology. The mining and curation of PPI knowledge is critical for analyzing high-throughput genomics and proteomics data. Several PPI knowledge bases have been generated through expensive manual curation but far from comprehensive. It is desired to have a document classification system which can classify documents as PPI-related or not PPI-related and therefore assist the mining and curation of PPI knowledge. In order to build document classification systems, an annotated corpus is needed where each document in the corpus is tagged with a label (either positive or negative). However, it is usually the case that only a small number of positive documents can be obtained manually or from existing PPI knowledge bases with literature evidences. Meanwhile, there are a large number of unlabeled documents where most of them are not PPI-related. Machine learning based on a small number of positives and a large number of unlabeled documents is called learning from positive and unlabelled documents (LPU) which has been studied in the general domain. A popular approach for LPU is a two-step strategy where the first step is to obtain reliable negative documents (RN) and the second step is to refine RN using various methods such as clustering or boosting. In this paper, we tackle the problem of LPU for PPI document classification and compare three two-step procedures based on a public data set, Reuters-21578. One is to obtain a negative data set by building a machine learning classifier which treats each unlabelled document as negatives and then classifies unlabelled documents. The second procedure is to refine the negative data set iteratively and consider those unlabeled documents always classified as negative as reliable negatives. The third procedure is to augment the negative data set iteratively by including unlabeled documents classified as negative in any iteration. Three machine learning algorithms were deployed for each two-step procedure.

KW - Protein-protein interaction

KW - Semi-supervised learning

KW - Text classification

UR - http://www.scopus.com/inward/record.url?scp=70450167556&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=70450167556&partnerID=8YFLogxK

U2 - 10.1109/IJCBS.2009.68

DO - 10.1109/IJCBS.2009.68

M3 - Conference contribution

AN - SCOPUS:70450167556

SN - 9780769537399

T3 - Proceedings - 2009 International Joint Conference on Bioinformatics, Systems Biology and Intelligent Computing, IJCBS 2009

SP - 263

EP - 270

BT - Proceedings - 2009 International Joint Conference on Bioinformatics, Systems Biology and Intelligent Computing, IJCBS 2009

T2 - 2009 International Joint Conference on Bioinformatics, Systems Biology and Intelligent Computing, IJCBS 2009

Y2 - 3 August 2009 through 5 August 2009

ER -

Semi-supervised learning of text classification on bacterial protein-protein interaction documents

Abstract

Publication series

Other

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this