Learning from positive and unlabeled documents for automated detection of alternative splicing sentences in MEDLINE abstracts

Yang Chen, Manabu Torii, Chang Tien Lu, Hongfang Liu

Research output: Chapter in Book/Report/Conference proceedingConference contribution

3 Scopus citations

Abstract

Alternative splicing is considered to be a key factor underlying increased cellular and functional complexity in higher eukaryotes. With the advance of high-throughput genomics technologies, it becomes critical to mine alternative splicing knowledge from biological research literature. Meanwhile, there have been many papers published on DNA splicing and translation and it is time-consuming to find papers specifically relevant to alternative splicing. Observing that documents reporting alternative splicing can be obtained from existing knowledge bases recording literature evidences and also that a large number of unlabeled documents are freely available, we investigated learning from positive and unlabeled data (LPU) for retrieving papers relevant to alternative splicing. The positive documents are from Literature Support for Alternative Transcripts (LSAT) and unlabeled documents are obtained from Gene Reference Into Function (GeneRIF). We generated nine unlabeled datasets different in size or the way documents were sampled, and compared the performance of document classifiers built using different unlabeled datasets and machine learning algorithms. The study shows that LPU is a viable strategy to build document filtering system, while the performance of trained classifiers is affected by the choice of the unlabeled data set. Selection of machine learning algorithms and that of unlabeled documents would be critical in constructing an effective LPU-based system.

Original languageEnglish (US)
Title of host publication2011 IEEE International Conference on Bioinformatics and Biomedicine Workshops, BIBMW 2011
Pages530-537
Number of pages8
DOIs
StatePublished - 2011
Event2011 IEEE International Conference on Bioinformatics and Biomedicine Workshops, BIBMW 2011 - Atlanta, GA, United States
Duration: Nov 12 2011Nov 15 2011

Publication series

Name2011 IEEE International Conference on Bioinformatics and Biomedicine Workshops, BIBMW 2011

Other

Other2011 IEEE International Conference on Bioinformatics and Biomedicine Workshops, BIBMW 2011
Country/TerritoryUnited States
CityAtlanta, GA
Period11/12/1111/15/11

Keywords

  • Alternative Splicing
  • Document Retrieval
  • LPU

ASJC Scopus subject areas

  • Biomedical Engineering
  • Health Informatics
  • Health Information Management

Fingerprint

Dive into the research topics of 'Learning from positive and unlabeled documents for automated detection of alternative splicing sentences in MEDLINE abstracts'. Together they form a unique fingerprint.

Cite this