Document classification for mining host pathogen protein-protein interactions

Guixian Xu, Lanlan Yin, Manabu Torii, Zhendong Niu, Cathy Wu, Zhangzhi Hu, Hongfang D Liu

Research output: Chapter in Book/Report/Conference proceedingConference contribution

3 Citations (Scopus)

Abstract

Due to the heightened concern about bioterrorism and emerging/reemerging infectious diseases, a flood of molecular data about human pathogens has been generated and maintained in disparate databases. However, scientific findings regarding these pathogens and their host responses are buried in the growing volume of biomedical literature and there is an urgent need to mine information pertaining to pathogenesis-related proteins especially host-pathogen protein-protein interactions from literature. In this paper, we report our exploration of developing an automated system to identify MEDLINE abstracts referring to host-pathogen protein-protein interactions. An annotated corpus consisting of 1,360 MEDLINE abstracts was generated. With this corpus, we developed and evaluated document classification systems using support vector machines (SVMs). We also investigated the effects of feature selection using the information gain (IG) measure. Document classification systems were designed at two levels, abstract-level and sentence-level. We observed that feature selection was effective not only in reducing the dimensionality of features to build a compact system, but also in improving document classification performance. We also observed abstract-level systems and sentence-level systems yielded different classification of MEDLINE abstracts, and the combination of these systems could improve the overall document classification.

Original languageEnglish (US)
Title of host publicationProceedings - IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2008
Pages461-466
Number of pages6
DOIs
StatePublished - 2008
Externally publishedYes
Event2008 IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2008 - Philadelphia, PA, United States
Duration: Nov 3 2008Nov 5 2008

Other

Other2008 IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2008
CountryUnited States
CityPhiladelphia, PA
Period11/3/0811/5/08

Fingerprint

Pathogens
Proteins
MEDLINE
Emerging Communicable Diseases
Feature extraction
Bioterrorism
Support vector machines
Databases

ASJC Scopus subject areas

  • Molecular Biology
  • Information Systems
  • Biomedical Engineering

Cite this

Xu, G., Yin, L., Torii, M., Niu, Z., Wu, C., Hu, Z., & Liu, H. D. (2008). Document classification for mining host pathogen protein-protein interactions. In Proceedings - IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2008 (pp. 461-466). [4684940] https://doi.org/10.1109/BIBM.2008.66

Document classification for mining host pathogen protein-protein interactions. / Xu, Guixian; Yin, Lanlan; Torii, Manabu; Niu, Zhendong; Wu, Cathy; Hu, Zhangzhi; Liu, Hongfang D.

Proceedings - IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2008. 2008. p. 461-466 4684940.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Xu, G, Yin, L, Torii, M, Niu, Z, Wu, C, Hu, Z & Liu, HD 2008, Document classification for mining host pathogen protein-protein interactions. in Proceedings - IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2008., 4684940, pp. 461-466, 2008 IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2008, Philadelphia, PA, United States, 11/3/08. https://doi.org/10.1109/BIBM.2008.66
Xu G, Yin L, Torii M, Niu Z, Wu C, Hu Z et al. Document classification for mining host pathogen protein-protein interactions. In Proceedings - IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2008. 2008. p. 461-466. 4684940 https://doi.org/10.1109/BIBM.2008.66
Xu, Guixian ; Yin, Lanlan ; Torii, Manabu ; Niu, Zhendong ; Wu, Cathy ; Hu, Zhangzhi ; Liu, Hongfang D. / Document classification for mining host pathogen protein-protein interactions. Proceedings - IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2008. 2008. pp. 461-466
@inproceedings{8ce67600a71a430d9c3ad81c02fbff7a,
title = "Document classification for mining host pathogen protein-protein interactions",
abstract = "Due to the heightened concern about bioterrorism and emerging/reemerging infectious diseases, a flood of molecular data about human pathogens has been generated and maintained in disparate databases. However, scientific findings regarding these pathogens and their host responses are buried in the growing volume of biomedical literature and there is an urgent need to mine information pertaining to pathogenesis-related proteins especially host-pathogen protein-protein interactions from literature. In this paper, we report our exploration of developing an automated system to identify MEDLINE abstracts referring to host-pathogen protein-protein interactions. An annotated corpus consisting of 1,360 MEDLINE abstracts was generated. With this corpus, we developed and evaluated document classification systems using support vector machines (SVMs). We also investigated the effects of feature selection using the information gain (IG) measure. Document classification systems were designed at two levels, abstract-level and sentence-level. We observed that feature selection was effective not only in reducing the dimensionality of features to build a compact system, but also in improving document classification performance. We also observed abstract-level systems and sentence-level systems yielded different classification of MEDLINE abstracts, and the combination of these systems could improve the overall document classification.",
author = "Guixian Xu and Lanlan Yin and Manabu Torii and Zhendong Niu and Cathy Wu and Zhangzhi Hu and Liu, {Hongfang D}",
year = "2008",
doi = "10.1109/BIBM.2008.66",
language = "English (US)",
isbn = "9780769534527",
pages = "461--466",
booktitle = "Proceedings - IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2008",

}

TY - GEN

T1 - Document classification for mining host pathogen protein-protein interactions

AU - Xu, Guixian

AU - Yin, Lanlan

AU - Torii, Manabu

AU - Niu, Zhendong

AU - Wu, Cathy

AU - Hu, Zhangzhi

AU - Liu, Hongfang D

PY - 2008

Y1 - 2008

N2 - Due to the heightened concern about bioterrorism and emerging/reemerging infectious diseases, a flood of molecular data about human pathogens has been generated and maintained in disparate databases. However, scientific findings regarding these pathogens and their host responses are buried in the growing volume of biomedical literature and there is an urgent need to mine information pertaining to pathogenesis-related proteins especially host-pathogen protein-protein interactions from literature. In this paper, we report our exploration of developing an automated system to identify MEDLINE abstracts referring to host-pathogen protein-protein interactions. An annotated corpus consisting of 1,360 MEDLINE abstracts was generated. With this corpus, we developed and evaluated document classification systems using support vector machines (SVMs). We also investigated the effects of feature selection using the information gain (IG) measure. Document classification systems were designed at two levels, abstract-level and sentence-level. We observed that feature selection was effective not only in reducing the dimensionality of features to build a compact system, but also in improving document classification performance. We also observed abstract-level systems and sentence-level systems yielded different classification of MEDLINE abstracts, and the combination of these systems could improve the overall document classification.

AB - Due to the heightened concern about bioterrorism and emerging/reemerging infectious diseases, a flood of molecular data about human pathogens has been generated and maintained in disparate databases. However, scientific findings regarding these pathogens and their host responses are buried in the growing volume of biomedical literature and there is an urgent need to mine information pertaining to pathogenesis-related proteins especially host-pathogen protein-protein interactions from literature. In this paper, we report our exploration of developing an automated system to identify MEDLINE abstracts referring to host-pathogen protein-protein interactions. An annotated corpus consisting of 1,360 MEDLINE abstracts was generated. With this corpus, we developed and evaluated document classification systems using support vector machines (SVMs). We also investigated the effects of feature selection using the information gain (IG) measure. Document classification systems were designed at two levels, abstract-level and sentence-level. We observed that feature selection was effective not only in reducing the dimensionality of features to build a compact system, but also in improving document classification performance. We also observed abstract-level systems and sentence-level systems yielded different classification of MEDLINE abstracts, and the combination of these systems could improve the overall document classification.

UR - http://www.scopus.com/inward/record.url?scp=58049158462&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=58049158462&partnerID=8YFLogxK

U2 - 10.1109/BIBM.2008.66

DO - 10.1109/BIBM.2008.66

M3 - Conference contribution

AN - SCOPUS:58049158462

SN - 9780769534527

SP - 461

EP - 466

BT - Proceedings - IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2008

ER -