Integrating information retrieval with distant supervision for Gene Ontology annotation

Dongqing Zhu; Dingcheng Li; Ben Carterette; Hongfang Liu

doi:10.1093/database/bau087

Integrating information retrieval with distant supervision for Gene Ontology annotation

Dongqing Zhu, Dingcheng Li, Ben Carterette, Hongfang Liu

Digital Health Sciences

Research output: Contribution to journal › Article › peer-review

2 Scopus citations

Abstract

This article describes our participation of the Gene Ontology Curation task (GO task) in BioCreative IV where we participated in both subtasks: A) identification of GO evidence sentences (GOESs) for relevant genes in full-text articles and B) prediction of GO terms for relevant genes in full-text articles. For subtask A, we trained a logistic regression model to detect GOES based on annotations in the training data supplemented with more noisy negatives from an external resource. Then, a greedy approach was applied to associate genes with sentences. For subtask B, we designed two types of systems: (i) search-based systems, which predict GO terms based on existing annotations for GOESs that are of different textual granularities (i.e., full-text articles, abstracts, and sentences) using state-of-the-art information retrieval techniques (i.e., a novel application of the idea of distant supervision) and (ii) a similarity-based system, which assigns GO terms based on the distance between words in sentences and GO terms/synonyms. Our best performing system for sub-task A achieves an F1 score of 0.27 based on exact match and 0.387 allowing relaxed overlap match. Our best performing system for subtask B, a search-based system, achieves an F1 score of 0.075 based on exact match and 0.301 considering hierarchical matches. Our search-based systems for subtask B significantly outperformed the similarity-based system.

Original language	English (US)
Journal	Database
Volume	2014
DOIs	https://doi.org/10.1093/database/bau087
State	Published - 2014

ASJC Scopus subject areas

Information Systems
General Biochemistry, Genetics and Molecular Biology
General Agricultural and Biological Sciences

Access to Document

10.1093/database/bau087

Cite this

@article{4d4eddb3c3c9401f8831506b3e9ebecf,

title = "Integrating information retrieval with distant supervision for Gene Ontology annotation",

abstract = "This article describes our participation of the Gene Ontology Curation task (GO task) in BioCreative IV where we participated in both subtasks: A) identification of GO evidence sentences (GOESs) for relevant genes in full-text articles and B) prediction of GO terms for relevant genes in full-text articles. For subtask A, we trained a logistic regression model to detect GOES based on annotations in the training data supplemented with more noisy negatives from an external resource. Then, a greedy approach was applied to associate genes with sentences. For subtask B, we designed two types of systems: (i) search-based systems, which predict GO terms based on existing annotations for GOESs that are of different textual granularities (i.e., full-text articles, abstracts, and sentences) using state-of-the-art information retrieval techniques (i.e., a novel application of the idea of distant supervision) and (ii) a similarity-based system, which assigns GO terms based on the distance between words in sentences and GO terms/synonyms. Our best performing system for sub-task A achieves an F1 score of 0.27 based on exact match and 0.387 allowing relaxed overlap match. Our best performing system for subtask B, a search-based system, achieves an F1 score of 0.075 based on exact match and 0.301 considering hierarchical matches. Our search-based systems for subtask B significantly outperformed the similarity-based system.",

author = "Dongqing Zhu and Dingcheng Li and Ben Carterette and Hongfang Liu",

note = "Publisher Copyright: {\textcopyright} The Author(s) 2014. Published by Oxford University Press.",

year = "2014",

doi = "10.1093/database/bau087",

language = "English (US)",

volume = "2014",

journal = "Database",

issn = "1758-0463",

publisher = "Oxford University Press",

}

TY - JOUR

T1 - Integrating information retrieval with distant supervision for Gene Ontology annotation

AU - Zhu, Dongqing

AU - Li, Dingcheng

AU - Carterette, Ben

AU - Liu, Hongfang

PY - 2014

Y1 - 2014

N2 - This article describes our participation of the Gene Ontology Curation task (GO task) in BioCreative IV where we participated in both subtasks: A) identification of GO evidence sentences (GOESs) for relevant genes in full-text articles and B) prediction of GO terms for relevant genes in full-text articles. For subtask A, we trained a logistic regression model to detect GOES based on annotations in the training data supplemented with more noisy negatives from an external resource. Then, a greedy approach was applied to associate genes with sentences. For subtask B, we designed two types of systems: (i) search-based systems, which predict GO terms based on existing annotations for GOESs that are of different textual granularities (i.e., full-text articles, abstracts, and sentences) using state-of-the-art information retrieval techniques (i.e., a novel application of the idea of distant supervision) and (ii) a similarity-based system, which assigns GO terms based on the distance between words in sentences and GO terms/synonyms. Our best performing system for sub-task A achieves an F1 score of 0.27 based on exact match and 0.387 allowing relaxed overlap match. Our best performing system for subtask B, a search-based system, achieves an F1 score of 0.075 based on exact match and 0.301 considering hierarchical matches. Our search-based systems for subtask B significantly outperformed the similarity-based system.

AB - This article describes our participation of the Gene Ontology Curation task (GO task) in BioCreative IV where we participated in both subtasks: A) identification of GO evidence sentences (GOESs) for relevant genes in full-text articles and B) prediction of GO terms for relevant genes in full-text articles. For subtask A, we trained a logistic regression model to detect GOES based on annotations in the training data supplemented with more noisy negatives from an external resource. Then, a greedy approach was applied to associate genes with sentences. For subtask B, we designed two types of systems: (i) search-based systems, which predict GO terms based on existing annotations for GOESs that are of different textual granularities (i.e., full-text articles, abstracts, and sentences) using state-of-the-art information retrieval techniques (i.e., a novel application of the idea of distant supervision) and (ii) a similarity-based system, which assigns GO terms based on the distance between words in sentences and GO terms/synonyms. Our best performing system for sub-task A achieves an F1 score of 0.27 based on exact match and 0.387 allowing relaxed overlap match. Our best performing system for subtask B, a search-based system, achieves an F1 score of 0.075 based on exact match and 0.301 considering hierarchical matches. Our search-based systems for subtask B significantly outperformed the similarity-based system.

UR - http://www.scopus.com/inward/record.url?scp=84996542167&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84996542167&partnerID=8YFLogxK

U2 - 10.1093/database/bau087

DO - 10.1093/database/bau087

M3 - Article

C2 - 25183856

AN - SCOPUS:84996542167

SN - 1758-0463

VL - 2014

JO - Database

JF - Database

ER -

Integrating information retrieval with distant supervision for Gene Ontology annotation

Abstract

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this