TY - GEN
T1 - Learning from positive and unlabeled documents for automated detection of alternative splicing sentences in MEDLINE abstracts
AU - Chen, Yang
AU - Torii, Manabu
AU - Lu, Chang Tien
AU - Liu, Hongfang
PY - 2011/12/1
Y1 - 2011/12/1
N2 - Alternative splicing is considered to be a key factor underlying increased cellular and functional complexity in higher eukaryotes. With the advance of high-throughput genomics technologies, it becomes critical to mine alternative splicing knowledge from biological research literature. Meanwhile, there have been many papers published on DNA splicing and translation and it is time-consuming to find papers specifically relevant to alternative splicing. Observing that documents reporting alternative splicing can be obtained from existing knowledge bases recording literature evidences and also that a large number of unlabeled documents are freely available, we investigated learning from positive and unlabeled data (LPU) for retrieving papers relevant to alternative splicing. The positive documents are from Literature Support for Alternative Transcripts (LSAT) and unlabeled documents are obtained from Gene Reference Into Function (GeneRIF). We generated nine unlabeled datasets different in size or the way documents were sampled, and compared the performance of document classifiers built using different unlabeled datasets and machine learning algorithms. The study shows that LPU is a viable strategy to build document filtering system, while the performance of trained classifiers is affected by the choice of the unlabeled data set. Selection of machine learning algorithms and that of unlabeled documents would be critical in constructing an effective LPU-based system.
AB - Alternative splicing is considered to be a key factor underlying increased cellular and functional complexity in higher eukaryotes. With the advance of high-throughput genomics technologies, it becomes critical to mine alternative splicing knowledge from biological research literature. Meanwhile, there have been many papers published on DNA splicing and translation and it is time-consuming to find papers specifically relevant to alternative splicing. Observing that documents reporting alternative splicing can be obtained from existing knowledge bases recording literature evidences and also that a large number of unlabeled documents are freely available, we investigated learning from positive and unlabeled data (LPU) for retrieving papers relevant to alternative splicing. The positive documents are from Literature Support for Alternative Transcripts (LSAT) and unlabeled documents are obtained from Gene Reference Into Function (GeneRIF). We generated nine unlabeled datasets different in size or the way documents were sampled, and compared the performance of document classifiers built using different unlabeled datasets and machine learning algorithms. The study shows that LPU is a viable strategy to build document filtering system, while the performance of trained classifiers is affected by the choice of the unlabeled data set. Selection of machine learning algorithms and that of unlabeled documents would be critical in constructing an effective LPU-based system.
KW - Alternative Splicing
KW - Document Retrieval
KW - LPU
UR - http://www.scopus.com/inward/record.url?scp=84862957370&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84862957370&partnerID=8YFLogxK
U2 - 10.1109/BIBMW.2011.6112425
DO - 10.1109/BIBMW.2011.6112425
M3 - Conference contribution
AN - SCOPUS:84862957370
SN - 9781457716133
T3 - 2011 IEEE International Conference on Bioinformatics and Biomedicine Workshops, BIBMW 2011
SP - 530
EP - 537
BT - 2011 IEEE International Conference on Bioinformatics and Biomedicine Workshops, BIBMW 2011
T2 - 2011 IEEE International Conference on Bioinformatics and Biomedicine Workshops, BIBMW 2011
Y2 - 12 November 2011 through 15 November 2011
ER -