Learning from positive and unlabeled documents for automated detection of alternative splicing sentences in MEDLINE abstracts

Yang Chen; Manabu Torii; Chang Tien Lu; Hongfang Liu

doi:10.1109/BIBMW.2011.6112425

Learning from positive and unlabeled documents for automated detection of alternative splicing sentences in MEDLINE abstracts

Yang Chen, Manabu Torii, Chang Tien Lu, Hongfang Liu

Digital Health Sciences

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

3 Scopus citations

Abstract

Alternative splicing is considered to be a key factor underlying increased cellular and functional complexity in higher eukaryotes. With the advance of high-throughput genomics technologies, it becomes critical to mine alternative splicing knowledge from biological research literature. Meanwhile, there have been many papers published on DNA splicing and translation and it is time-consuming to find papers specifically relevant to alternative splicing. Observing that documents reporting alternative splicing can be obtained from existing knowledge bases recording literature evidences and also that a large number of unlabeled documents are freely available, we investigated learning from positive and unlabeled data (LPU) for retrieving papers relevant to alternative splicing. The positive documents are from Literature Support for Alternative Transcripts (LSAT) and unlabeled documents are obtained from Gene Reference Into Function (GeneRIF). We generated nine unlabeled datasets different in size or the way documents were sampled, and compared the performance of document classifiers built using different unlabeled datasets and machine learning algorithms. The study shows that LPU is a viable strategy to build document filtering system, while the performance of trained classifiers is affected by the choice of the unlabeled data set. Selection of machine learning algorithms and that of unlabeled documents would be critical in constructing an effective LPU-based system.

Original language	English (US)
Title of host publication	2011 IEEE International Conference on Bioinformatics and Biomedicine Workshops, BIBMW 2011
Pages	530-537
Number of pages	8
DOIs	https://doi.org/10.1109/BIBMW.2011.6112425
State	Published - 2011
Event	2011 IEEE International Conference on Bioinformatics and Biomedicine Workshops, BIBMW 2011 - Atlanta, GA, United States Duration: Nov 12 2011 → Nov 15 2011

Publication series

Name	2011 IEEE International Conference on Bioinformatics and Biomedicine Workshops, BIBMW 2011

Other

Other	2011 IEEE International Conference on Bioinformatics and Biomedicine Workshops, BIBMW 2011
Country/Territory	United States
City	Atlanta, GA
Period	11/12/11 → 11/15/11

Keywords

Alternative Splicing
Document Retrieval
LPU

ASJC Scopus subject areas

Biomedical Engineering
Health Informatics
Health Information Management

Access to Document

10.1109/BIBMW.2011.6112425

Cite this

Chen, Y., Torii, M., Lu, C. T., & Liu, H. (2011). Learning from positive and unlabeled documents for automated detection of alternative splicing sentences in MEDLINE abstracts. In 2011 IEEE International Conference on Bioinformatics and Biomedicine Workshops, BIBMW 2011 (pp. 530-537). Article 6112425 (2011 IEEE International Conference on Bioinformatics and Biomedicine Workshops, BIBMW 2011). https://doi.org/10.1109/BIBMW.2011.6112425

Learning from positive and unlabeled documents for automated detection of alternative splicing sentences in MEDLINE abstracts. / Chen, Yang; Torii, Manabu; Lu, Chang Tien et al.
2011 IEEE International Conference on Bioinformatics and Biomedicine Workshops, BIBMW 2011. 2011. p. 530-537 6112425 (2011 IEEE International Conference on Bioinformatics and Biomedicine Workshops, BIBMW 2011).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

Chen, Y, Torii, M, Lu, CT & Liu, H 2011, Learning from positive and unlabeled documents for automated detection of alternative splicing sentences in MEDLINE abstracts. in 2011 IEEE International Conference on Bioinformatics and Biomedicine Workshops, BIBMW 2011., 6112425, 2011 IEEE International Conference on Bioinformatics and Biomedicine Workshops, BIBMW 2011, pp. 530-537, 2011 IEEE International Conference on Bioinformatics and Biomedicine Workshops, BIBMW 2011, Atlanta, GA, United States, 11/12/11. https://doi.org/10.1109/BIBMW.2011.6112425

Chen Y, Torii M, Lu CT, Liu H. Learning from positive and unlabeled documents for automated detection of alternative splicing sentences in MEDLINE abstracts. In 2011 IEEE International Conference on Bioinformatics and Biomedicine Workshops, BIBMW 2011. 2011. p. 530-537. 6112425. (2011 IEEE International Conference on Bioinformatics and Biomedicine Workshops, BIBMW 2011). doi: 10.1109/BIBMW.2011.6112425

Chen, Yang ; Torii, Manabu ; Lu, Chang Tien et al. / Learning from positive and unlabeled documents for automated detection of alternative splicing sentences in MEDLINE abstracts. 2011 IEEE International Conference on Bioinformatics and Biomedicine Workshops, BIBMW 2011. 2011. pp. 530-537 (2011 IEEE International Conference on Bioinformatics and Biomedicine Workshops, BIBMW 2011).

@inproceedings{dbde6359f86444ee975d48aefcc6f082,

title = "Learning from positive and unlabeled documents for automated detection of alternative splicing sentences in MEDLINE abstracts",

abstract = "Alternative splicing is considered to be a key factor underlying increased cellular and functional complexity in higher eukaryotes. With the advance of high-throughput genomics technologies, it becomes critical to mine alternative splicing knowledge from biological research literature. Meanwhile, there have been many papers published on DNA splicing and translation and it is time-consuming to find papers specifically relevant to alternative splicing. Observing that documents reporting alternative splicing can be obtained from existing knowledge bases recording literature evidences and also that a large number of unlabeled documents are freely available, we investigated learning from positive and unlabeled data (LPU) for retrieving papers relevant to alternative splicing. The positive documents are from Literature Support for Alternative Transcripts (LSAT) and unlabeled documents are obtained from Gene Reference Into Function (GeneRIF). We generated nine unlabeled datasets different in size or the way documents were sampled, and compared the performance of document classifiers built using different unlabeled datasets and machine learning algorithms. The study shows that LPU is a viable strategy to build document filtering system, while the performance of trained classifiers is affected by the choice of the unlabeled data set. Selection of machine learning algorithms and that of unlabeled documents would be critical in constructing an effective LPU-based system.",

keywords = "Alternative Splicing, Document Retrieval, LPU",

author = "Yang Chen and Manabu Torii and Lu, {Chang Tien} and Hongfang Liu",

year = "2011",

doi = "10.1109/BIBMW.2011.6112425",

language = "English (US)",

isbn = "9781457716133",

series = "2011 IEEE International Conference on Bioinformatics and Biomedicine Workshops, BIBMW 2011",

pages = "530--537",

booktitle = "2011 IEEE International Conference on Bioinformatics and Biomedicine Workshops, BIBMW 2011",

note = "2011 IEEE International Conference on Bioinformatics and Biomedicine Workshops, BIBMW 2011 ; Conference date: 12-11-2011 Through 15-11-2011",

}

TY - GEN

T1 - Learning from positive and unlabeled documents for automated detection of alternative splicing sentences in MEDLINE abstracts

AU - Chen, Yang

AU - Torii, Manabu

AU - Lu, Chang Tien

AU - Liu, Hongfang

PY - 2011

Y1 - 2011

N2 - Alternative splicing is considered to be a key factor underlying increased cellular and functional complexity in higher eukaryotes. With the advance of high-throughput genomics technologies, it becomes critical to mine alternative splicing knowledge from biological research literature. Meanwhile, there have been many papers published on DNA splicing and translation and it is time-consuming to find papers specifically relevant to alternative splicing. Observing that documents reporting alternative splicing can be obtained from existing knowledge bases recording literature evidences and also that a large number of unlabeled documents are freely available, we investigated learning from positive and unlabeled data (LPU) for retrieving papers relevant to alternative splicing. The positive documents are from Literature Support for Alternative Transcripts (LSAT) and unlabeled documents are obtained from Gene Reference Into Function (GeneRIF). We generated nine unlabeled datasets different in size or the way documents were sampled, and compared the performance of document classifiers built using different unlabeled datasets and machine learning algorithms. The study shows that LPU is a viable strategy to build document filtering system, while the performance of trained classifiers is affected by the choice of the unlabeled data set. Selection of machine learning algorithms and that of unlabeled documents would be critical in constructing an effective LPU-based system.

AB - Alternative splicing is considered to be a key factor underlying increased cellular and functional complexity in higher eukaryotes. With the advance of high-throughput genomics technologies, it becomes critical to mine alternative splicing knowledge from biological research literature. Meanwhile, there have been many papers published on DNA splicing and translation and it is time-consuming to find papers specifically relevant to alternative splicing. Observing that documents reporting alternative splicing can be obtained from existing knowledge bases recording literature evidences and also that a large number of unlabeled documents are freely available, we investigated learning from positive and unlabeled data (LPU) for retrieving papers relevant to alternative splicing. The positive documents are from Literature Support for Alternative Transcripts (LSAT) and unlabeled documents are obtained from Gene Reference Into Function (GeneRIF). We generated nine unlabeled datasets different in size or the way documents were sampled, and compared the performance of document classifiers built using different unlabeled datasets and machine learning algorithms. The study shows that LPU is a viable strategy to build document filtering system, while the performance of trained classifiers is affected by the choice of the unlabeled data set. Selection of machine learning algorithms and that of unlabeled documents would be critical in constructing an effective LPU-based system.

KW - Alternative Splicing

KW - Document Retrieval

KW - LPU

UR - http://www.scopus.com/inward/record.url?scp=84862957370&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84862957370&partnerID=8YFLogxK

U2 - 10.1109/BIBMW.2011.6112425

DO - 10.1109/BIBMW.2011.6112425

M3 - Conference contribution

AN - SCOPUS:84862957370

SN - 9781457716133

T3 - 2011 IEEE International Conference on Bioinformatics and Biomedicine Workshops, BIBMW 2011

SP - 530

EP - 537

BT - 2011 IEEE International Conference on Bioinformatics and Biomedicine Workshops, BIBMW 2011

T2 - 2011 IEEE International Conference on Bioinformatics and Biomedicine Workshops, BIBMW 2011

Y2 - 12 November 2011 through 15 November 2011

ER -

Learning from positive and unlabeled documents for automated detection of alternative splicing sentences in MEDLINE abstracts

Abstract

Publication series

Other

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this