Gene symbol disambiguation using knowledge-based profiles

Hua Xu; Jung Wei Fan; George Hripcsak; Eneida A. Mendonça; Marianthi Markatou; Carol Friedman

doi:10.1093/bioinformatics/btm056

Gene symbol disambiguation using knowledge-based profiles

Hua Xu, Jung Wei Fan, George Hripcsak, Eneida A. Mendonça, Marianthi Markatou, Carol Friedman

Digital Health Sciences

Research output: Contribution to journal › Article › peer-review

37 Scopus citations

Abstract

Motivation: The ambiguity of biomedical entities, particularly of gene symbols, is a big challenge for text-mining systems in the biomedical domain. Existing knowledge sources, such as Entrez Gene and the MEDLINE database, contain information concerning the characteristics of a particular gene that could be used to disambiguate gene symbols. Results: For each gene, we create a profile with different types of information automatically extracted from related MEDLINE abstracts and readily available annotated knowledge sources. We apply the gene profiles to the disambiguation task via an information retrieval method, which ranks the similarity scores between the context where the ambiguous gene is mentioned, and candidate gene profiles. The gene profile with the highest similarity score is then chosen as the correct sense. We evaluated the method on three automatically generated testing sets of mouse, fly and yeast organisms, respectively. The method achieved the highest precision of 93.9% for the mouse, 77.8% for the fly and 89.5% for the yeast.

Original language	English (US)
Pages (from-to)	1015-1022
Number of pages	8
Journal	Bioinformatics
Volume	23
Issue number	8
DOIs	https://doi.org/10.1093/bioinformatics/btm056
State	Published - Apr 15 2007

ASJC Scopus subject areas

Statistics and Probability
Biochemistry
Molecular Biology
Computer Science Applications
Computational Theory and Mathematics
Computational Mathematics

Access to Document

10.1093/bioinformatics/btm056

Cite this

@article{e4244640037b47b18d344d3a87f134db,

title = "Gene symbol disambiguation using knowledge-based profiles",

abstract = "Motivation: The ambiguity of biomedical entities, particularly of gene symbols, is a big challenge for text-mining systems in the biomedical domain. Existing knowledge sources, such as Entrez Gene and the MEDLINE database, contain information concerning the characteristics of a particular gene that could be used to disambiguate gene symbols. Results: For each gene, we create a profile with different types of information automatically extracted from related MEDLINE abstracts and readily available annotated knowledge sources. We apply the gene profiles to the disambiguation task via an information retrieval method, which ranks the similarity scores between the context where the ambiguous gene is mentioned, and candidate gene profiles. The gene profile with the highest similarity score is then chosen as the correct sense. We evaluated the method on three automatically generated testing sets of mouse, fly and yeast organisms, respectively. The method achieved the highest precision of 93.9% for the mouse, 77.8% for the fly and 89.5% for the yeast.",

author = "Hua Xu and Fan, {Jung Wei} and George Hripcsak and Mendon{\c c}a, {Eneida A.} and Marianthi Markatou and Carol Friedman",

note = "Funding Information: This work was supported in part by Grants R01 LM7659, R01 LM8635 from the National Library of Medicine, and Grants NSF-IIS-0430743, NSF-DMS-0504957 from the National Science Foundation. We would like to thank Lyudmila Shagina for providing technical support. Funding to pay the Open Access publication charges was provided by the National Library of Medicine (grant LM8635).",

year = "2007",

month = apr,

day = "15",

doi = "10.1093/bioinformatics/btm056",

language = "English (US)",

volume = "23",

pages = "1015--1022",

journal = "Bioinformatics",

issn = "1367-4803",

publisher = "Oxford University Press",

number = "8",

}

TY - JOUR

T1 - Gene symbol disambiguation using knowledge-based profiles

AU - Xu, Hua

AU - Fan, Jung Wei

AU - Hripcsak, George

AU - Mendonça, Eneida A.

AU - Markatou, Marianthi

AU - Friedman, Carol

N1 - Funding Information: This work was supported in part by Grants R01 LM7659, R01 LM8635 from the National Library of Medicine, and Grants NSF-IIS-0430743, NSF-DMS-0504957 from the National Science Foundation. We would like to thank Lyudmila Shagina for providing technical support. Funding to pay the Open Access publication charges was provided by the National Library of Medicine (grant LM8635).

PY - 2007/4/15

Y1 - 2007/4/15

N2 - Motivation: The ambiguity of biomedical entities, particularly of gene symbols, is a big challenge for text-mining systems in the biomedical domain. Existing knowledge sources, such as Entrez Gene and the MEDLINE database, contain information concerning the characteristics of a particular gene that could be used to disambiguate gene symbols. Results: For each gene, we create a profile with different types of information automatically extracted from related MEDLINE abstracts and readily available annotated knowledge sources. We apply the gene profiles to the disambiguation task via an information retrieval method, which ranks the similarity scores between the context where the ambiguous gene is mentioned, and candidate gene profiles. The gene profile with the highest similarity score is then chosen as the correct sense. We evaluated the method on three automatically generated testing sets of mouse, fly and yeast organisms, respectively. The method achieved the highest precision of 93.9% for the mouse, 77.8% for the fly and 89.5% for the yeast.

AB - Motivation: The ambiguity of biomedical entities, particularly of gene symbols, is a big challenge for text-mining systems in the biomedical domain. Existing knowledge sources, such as Entrez Gene and the MEDLINE database, contain information concerning the characteristics of a particular gene that could be used to disambiguate gene symbols. Results: For each gene, we create a profile with different types of information automatically extracted from related MEDLINE abstracts and readily available annotated knowledge sources. We apply the gene profiles to the disambiguation task via an information retrieval method, which ranks the similarity scores between the context where the ambiguous gene is mentioned, and candidate gene profiles. The gene profile with the highest similarity score is then chosen as the correct sense. We evaluated the method on three automatically generated testing sets of mouse, fly and yeast organisms, respectively. The method achieved the highest precision of 93.9% for the mouse, 77.8% for the fly and 89.5% for the yeast.

UR - http://www.scopus.com/inward/record.url?scp=34249717333&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=34249717333&partnerID=8YFLogxK

U2 - 10.1093/bioinformatics/btm056

DO - 10.1093/bioinformatics/btm056

M3 - Article

C2 - 17314123

AN - SCOPUS:34249717333

SN - 1367-4803

VL - 23

SP - 1015

EP - 1022

JO - Bioinformatics

JF - Bioinformatics

IS - 8

ER -

Gene symbol disambiguation using knowledge-based profiles

Abstract

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this