Imbalanced text classification on host pathogen protein-protein interaction documents

Guixian Xu, Zhendong Niu, Xu Gao, Hongfang D Liu

Research output: Chapter in Book/Report/Conference proceedingConference contribution

1 Citation (Scopus)

Abstract

Important in understanding the fundamental processes governing cell biology. However, a large number of scientific findings about PPIs are buried in the growing volume of biomedical literature. Document classification systems have been shown to have the potential to accelerate the curation process by retrieving PPI-related documents. However, it is usually a case that a small number of positive documents can be obtained manually or from PPI knowledge bases with literature-based evidence and there are a large number of negative documents. In this paper, we investigate the effects of feature selection and feature weighting as well as kernel function of Support Vector Machines (SVMs) on imbalanced two-class classification based on 1360 host-pathogen protein-protein interactions documents. The results show that the suitable feature weighting approach is the important factor for improving the classification performance. Adjusting cost sensitive parameter of radial basis function (RBF) kernel of SVM can decrease the minority class misclassification ratio and increase the classification accuracy on imbalanced documents classification. An automated classification system to identify MEDLINE abstracts referring to host-pathogen protein-protein interactions can been developed based on the experiment.

Original languageEnglish (US)
Title of host publication2010 The 2nd International Conference on Computer and Automation Engineering, ICCAE 2010
Pages418-422
Number of pages5
Volume1
DOIs
StatePublished - 2010
Externally publishedYes
Event2nd International Conference on Computer and Automation Engineering, ICCAE 2010 - Singapore, Singapore
Duration: Feb 26 2010Feb 28 2010

Other

Other2nd International Conference on Computer and Automation Engineering, ICCAE 2010
CountrySingapore
CitySingapore
Period2/26/102/28/10

Fingerprint

Pathogens
Proteins
Support vector machines
Cytology
Feature extraction
Costs
Experiments

Keywords

  • Imbalanced text classification
  • Machine learning
  • Protein-protein interaction

ASJC Scopus subject areas

  • Computer Networks and Communications
  • Computer Science Applications
  • Control and Systems Engineering

Cite this

Xu, G., Niu, Z., Gao, X., & Liu, H. D. (2010). Imbalanced text classification on host pathogen protein-protein interaction documents. In 2010 The 2nd International Conference on Computer and Automation Engineering, ICCAE 2010 (Vol. 1, pp. 418-422). [5451921] https://doi.org/10.1109/ICCAE.2010.5451921

Imbalanced text classification on host pathogen protein-protein interaction documents. / Xu, Guixian; Niu, Zhendong; Gao, Xu; Liu, Hongfang D.

2010 The 2nd International Conference on Computer and Automation Engineering, ICCAE 2010. Vol. 1 2010. p. 418-422 5451921.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Xu, G, Niu, Z, Gao, X & Liu, HD 2010, Imbalanced text classification on host pathogen protein-protein interaction documents. in 2010 The 2nd International Conference on Computer and Automation Engineering, ICCAE 2010. vol. 1, 5451921, pp. 418-422, 2nd International Conference on Computer and Automation Engineering, ICCAE 2010, Singapore, Singapore, 2/26/10. https://doi.org/10.1109/ICCAE.2010.5451921
Xu G, Niu Z, Gao X, Liu HD. Imbalanced text classification on host pathogen protein-protein interaction documents. In 2010 The 2nd International Conference on Computer and Automation Engineering, ICCAE 2010. Vol. 1. 2010. p. 418-422. 5451921 https://doi.org/10.1109/ICCAE.2010.5451921
Xu, Guixian ; Niu, Zhendong ; Gao, Xu ; Liu, Hongfang D. / Imbalanced text classification on host pathogen protein-protein interaction documents. 2010 The 2nd International Conference on Computer and Automation Engineering, ICCAE 2010. Vol. 1 2010. pp. 418-422
@inproceedings{f59548edd521481d9ca94f845ac87397,
title = "Imbalanced text classification on host pathogen protein-protein interaction documents",
abstract = "Important in understanding the fundamental processes governing cell biology. However, a large number of scientific findings about PPIs are buried in the growing volume of biomedical literature. Document classification systems have been shown to have the potential to accelerate the curation process by retrieving PPI-related documents. However, it is usually a case that a small number of positive documents can be obtained manually or from PPI knowledge bases with literature-based evidence and there are a large number of negative documents. In this paper, we investigate the effects of feature selection and feature weighting as well as kernel function of Support Vector Machines (SVMs) on imbalanced two-class classification based on 1360 host-pathogen protein-protein interactions documents. The results show that the suitable feature weighting approach is the important factor for improving the classification performance. Adjusting cost sensitive parameter of radial basis function (RBF) kernel of SVM can decrease the minority class misclassification ratio and increase the classification accuracy on imbalanced documents classification. An automated classification system to identify MEDLINE abstracts referring to host-pathogen protein-protein interactions can been developed based on the experiment.",
keywords = "Imbalanced text classification, Machine learning, Protein-protein interaction",
author = "Guixian Xu and Zhendong Niu and Xu Gao and Liu, {Hongfang D}",
year = "2010",
doi = "10.1109/ICCAE.2010.5451921",
language = "English (US)",
isbn = "9781424455850",
volume = "1",
pages = "418--422",
booktitle = "2010 The 2nd International Conference on Computer and Automation Engineering, ICCAE 2010",

}

TY - GEN

T1 - Imbalanced text classification on host pathogen protein-protein interaction documents

AU - Xu, Guixian

AU - Niu, Zhendong

AU - Gao, Xu

AU - Liu, Hongfang D

PY - 2010

Y1 - 2010

N2 - Important in understanding the fundamental processes governing cell biology. However, a large number of scientific findings about PPIs are buried in the growing volume of biomedical literature. Document classification systems have been shown to have the potential to accelerate the curation process by retrieving PPI-related documents. However, it is usually a case that a small number of positive documents can be obtained manually or from PPI knowledge bases with literature-based evidence and there are a large number of negative documents. In this paper, we investigate the effects of feature selection and feature weighting as well as kernel function of Support Vector Machines (SVMs) on imbalanced two-class classification based on 1360 host-pathogen protein-protein interactions documents. The results show that the suitable feature weighting approach is the important factor for improving the classification performance. Adjusting cost sensitive parameter of radial basis function (RBF) kernel of SVM can decrease the minority class misclassification ratio and increase the classification accuracy on imbalanced documents classification. An automated classification system to identify MEDLINE abstracts referring to host-pathogen protein-protein interactions can been developed based on the experiment.

AB - Important in understanding the fundamental processes governing cell biology. However, a large number of scientific findings about PPIs are buried in the growing volume of biomedical literature. Document classification systems have been shown to have the potential to accelerate the curation process by retrieving PPI-related documents. However, it is usually a case that a small number of positive documents can be obtained manually or from PPI knowledge bases with literature-based evidence and there are a large number of negative documents. In this paper, we investigate the effects of feature selection and feature weighting as well as kernel function of Support Vector Machines (SVMs) on imbalanced two-class classification based on 1360 host-pathogen protein-protein interactions documents. The results show that the suitable feature weighting approach is the important factor for improving the classification performance. Adjusting cost sensitive parameter of radial basis function (RBF) kernel of SVM can decrease the minority class misclassification ratio and increase the classification accuracy on imbalanced documents classification. An automated classification system to identify MEDLINE abstracts referring to host-pathogen protein-protein interactions can been developed based on the experiment.

KW - Imbalanced text classification

KW - Machine learning

KW - Protein-protein interaction

UR - http://www.scopus.com/inward/record.url?scp=77952594815&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=77952594815&partnerID=8YFLogxK

U2 - 10.1109/ICCAE.2010.5451921

DO - 10.1109/ICCAE.2010.5451921

M3 - Conference contribution

AN - SCOPUS:77952594815

SN - 9781424455850

VL - 1

SP - 418

EP - 422

BT - 2010 The 2nd International Conference on Computer and Automation Engineering, ICCAE 2010

ER -