TY - GEN
T1 - Learning from positive and unlabeled documents for retrieval of bacterial protein-protein interaction literature
AU - Liu, Hongfang
AU - Torii, Manabu
AU - Xu, Guixian
AU - Hu, Zhangzhi
AU - Goll, Johannes
PY - 2010
Y1 - 2010
N2 - With the advance of high-throughput genomics and proteomics technologies, it becomes critical to mine and curate protein-protein interaction (PPI) networks from biological research literature. Several PPI knowledge bases have been curated by domain experts but they are far from comprehensive. Observing that PPI-relevant documents can be obtained from PPI knowledge bases recording literature evidences and also that a large number of unlabeled documents (mostly negative) are freely available, we investigated learning from positive and unlabeled data (LPU) and developed an automated system for the retrieval of PPI-relevant articles aiming at assisting the curation of a bacterial PPI knowledge base, MPIDB. Two different approaches of obtaining unlabeled documents were used: one based on PubMed MeSH term search and the other based on an existing knowledge base, UniProtKB. We found unlabeled documents obtained from UniProtKB tend to yield better document classifiers for PPI curation purposes. Our study shows that LPU is a possible scenario for the development of an automated system to retrieve PPI-relevant articles, where there is no requirement for extra annotation effort. Selection of machine learning algorithms and that of unlabeled documents would be critical in constructing an effective LPU-based system.
AB - With the advance of high-throughput genomics and proteomics technologies, it becomes critical to mine and curate protein-protein interaction (PPI) networks from biological research literature. Several PPI knowledge bases have been curated by domain experts but they are far from comprehensive. Observing that PPI-relevant documents can be obtained from PPI knowledge bases recording literature evidences and also that a large number of unlabeled documents (mostly negative) are freely available, we investigated learning from positive and unlabeled data (LPU) and developed an automated system for the retrieval of PPI-relevant articles aiming at assisting the curation of a bacterial PPI knowledge base, MPIDB. Two different approaches of obtaining unlabeled documents were used: one based on PubMed MeSH term search and the other based on an existing knowledge base, UniProtKB. We found unlabeled documents obtained from UniProtKB tend to yield better document classifiers for PPI curation purposes. Our study shows that LPU is a possible scenario for the development of an automated system to retrieve PPI-relevant articles, where there is no requirement for extra annotation effort. Selection of machine learning algorithms and that of unlabeled documents would be critical in constructing an effective LPU-based system.
KW - Document retrieval
KW - Learning from positive and unlabeled
KW - Protein-protein interaction
UR - http://www.scopus.com/inward/record.url?scp=77953743348&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=77953743348&partnerID=8YFLogxK
U2 - 10.1007/978-3-642-13131-8_8
DO - 10.1007/978-3-642-13131-8_8
M3 - Conference contribution
AN - SCOPUS:77953743348
SN - 3642131301
SN - 9783642131301
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 62
EP - 70
BT - Linking Literature, Information, and Knowledge for Biology - Workshop of the BioLink Special Interest Group, ISMB/ECCB 2009, Revised Selected Papers
T2 - Workshop of the BioLINK Special Interest Group on Linking Literature, Information and Knowledge for Biology, ISMB/ECCB 2009
Y2 - 28 June 2009 through 29 June 2009
ER -