Gene name ambiguity of eukaryotic nomenclatures

Lifeng Chen; Hongfang Liu; Carol Friedman

doi:10.1093/bioinformatics/bth496

Gene name ambiguity of eukaryotic nomenclatures

Lifeng Chen, Hongfang Liu, Carol Friedman

Digital Health Sciences

Research output: Contribution to journal › Article › peer-review

106 Scopus citations

Abstract

Motivation: With more and more scientific literature published online, the effective management and reuse of this knowledge has become problematic. Natural language processing (NLP) may be a potential solution by extracting, structuring and organizing biomedical information in online literature in a timely manner. One essential task is to recognize and identify genomic entities in text. 'Recognition' can be accomplished using pattern matching and machine learning. But for 'identification' these techniques are not adequate. In order to identify genomic entities, NLP needs a comprehensive resource that specifies and classifies genomic entities as they occur in text and that associates them with normalized terms and also unique identifiers so that the extracted entities are well defined. Online organism databases are an excellent resource to create such a lexical resource. However, gene name ambiguity is a serious problem because it affects the appropriate identification of gene entities. In this paper, we explore the extent of the problem and suggest ways to address it. Results: We obtained gene information from 21 organisms and quantified naming ambiguities within species, across species, with English words and with medical terms. When the case (of letters) was retained, official symbols displayed negligible intra-species ambiguity (0.02%) and modest ambiguities with general English words (0.57%) and medical terms (1.01%). In contrast, the across-species ambiguity was high (14.20%). The inclusion of gene synonyms increased intraspecies ambiguity substantially and full names contributed greatly to gene-medical-term ambiguity. A comprehensive lexical resource that covers gene information for the 21 organisms was then created and used to identify gene names by using a straightforward string matching program to process 45 000 abstracts associated with the mouse model organism while ignoring case and gene names that were also English words. We found that 85.1% of correctly retrieved mouse genes were ambiguous with other gene names. When gene names that were also English words were included, 233% additional 'gene' instances were retrieved, most of which were false positives. We also found that authors prefer to use synonyms (74.7%) to official symbols (17.7%) or full names (7.6%) in their publications.

Original language	English (US)
Pages (from-to)	248-256
Number of pages	9
Journal	Bioinformatics
Volume	21
Issue number	2
DOIs	https://doi.org/10.1093/bioinformatics/bth496
State	Published - Jan 15 2005

ASJC Scopus subject areas

Statistics and Probability
Biochemistry
Molecular Biology
Computer Science Applications
Computational Theory and Mathematics
Computational Mathematics

Access to Document

10.1093/bioinformatics/bth496

Cite this

@article{8cde9a3c8be3440fb839ce362f9b154b,

title = "Gene name ambiguity of eukaryotic nomenclatures",

abstract = "Motivation: With more and more scientific literature published online, the effective management and reuse of this knowledge has become problematic. Natural language processing (NLP) may be a potential solution by extracting, structuring and organizing biomedical information in online literature in a timely manner. One essential task is to recognize and identify genomic entities in text. 'Recognition' can be accomplished using pattern matching and machine learning. But for 'identification' these techniques are not adequate. In order to identify genomic entities, NLP needs a comprehensive resource that specifies and classifies genomic entities as they occur in text and that associates them with normalized terms and also unique identifiers so that the extracted entities are well defined. Online organism databases are an excellent resource to create such a lexical resource. However, gene name ambiguity is a serious problem because it affects the appropriate identification of gene entities. In this paper, we explore the extent of the problem and suggest ways to address it. Results: We obtained gene information from 21 organisms and quantified naming ambiguities within species, across species, with English words and with medical terms. When the case (of letters) was retained, official symbols displayed negligible intra-species ambiguity (0.02%) and modest ambiguities with general English words (0.57%) and medical terms (1.01%). In contrast, the across-species ambiguity was high (14.20%). The inclusion of gene synonyms increased intraspecies ambiguity substantially and full names contributed greatly to gene-medical-term ambiguity. A comprehensive lexical resource that covers gene information for the 21 organisms was then created and used to identify gene names by using a straightforward string matching program to process 45 000 abstracts associated with the mouse model organism while ignoring case and gene names that were also English words. We found that 85.1% of correctly retrieved mouse genes were ambiguous with other gene names. When gene names that were also English words were included, 233% additional 'gene' instances were retrieved, most of which were false positives. We also found that authors prefer to use synonyms (74.7%) to official symbols (17.7%) or full names (7.6%) in their publications.",

author = "Lifeng Chen and Hongfang Liu and Carol Friedman",

note = "Funding Information: We would like to thank Dr Judy Blake for her insight concerning the nomenclature for the mouse model and gene naming conventions. The work was supported by grants LM7659 from the NLM and EIA-031 from the NSF.",

year = "2005",

month = jan,

day = "15",

doi = "10.1093/bioinformatics/bth496",

language = "English (US)",

volume = "21",

pages = "248--256",

journal = "Bioinformatics",

issn = "1367-4803",

publisher = "Oxford University Press",

number = "2",

}

TY - JOUR

T1 - Gene name ambiguity of eukaryotic nomenclatures

AU - Chen, Lifeng

AU - Liu, Hongfang

AU - Friedman, Carol

N1 - Funding Information: We would like to thank Dr Judy Blake for her insight concerning the nomenclature for the mouse model and gene naming conventions. The work was supported by grants LM7659 from the NLM and EIA-031 from the NSF.

PY - 2005/1/15

Y1 - 2005/1/15

N2 - Motivation: With more and more scientific literature published online, the effective management and reuse of this knowledge has become problematic. Natural language processing (NLP) may be a potential solution by extracting, structuring and organizing biomedical information in online literature in a timely manner. One essential task is to recognize and identify genomic entities in text. 'Recognition' can be accomplished using pattern matching and machine learning. But for 'identification' these techniques are not adequate. In order to identify genomic entities, NLP needs a comprehensive resource that specifies and classifies genomic entities as they occur in text and that associates them with normalized terms and also unique identifiers so that the extracted entities are well defined. Online organism databases are an excellent resource to create such a lexical resource. However, gene name ambiguity is a serious problem because it affects the appropriate identification of gene entities. In this paper, we explore the extent of the problem and suggest ways to address it. Results: We obtained gene information from 21 organisms and quantified naming ambiguities within species, across species, with English words and with medical terms. When the case (of letters) was retained, official symbols displayed negligible intra-species ambiguity (0.02%) and modest ambiguities with general English words (0.57%) and medical terms (1.01%). In contrast, the across-species ambiguity was high (14.20%). The inclusion of gene synonyms increased intraspecies ambiguity substantially and full names contributed greatly to gene-medical-term ambiguity. A comprehensive lexical resource that covers gene information for the 21 organisms was then created and used to identify gene names by using a straightforward string matching program to process 45 000 abstracts associated with the mouse model organism while ignoring case and gene names that were also English words. We found that 85.1% of correctly retrieved mouse genes were ambiguous with other gene names. When gene names that were also English words were included, 233% additional 'gene' instances were retrieved, most of which were false positives. We also found that authors prefer to use synonyms (74.7%) to official symbols (17.7%) or full names (7.6%) in their publications.

AB - Motivation: With more and more scientific literature published online, the effective management and reuse of this knowledge has become problematic. Natural language processing (NLP) may be a potential solution by extracting, structuring and organizing biomedical information in online literature in a timely manner. One essential task is to recognize and identify genomic entities in text. 'Recognition' can be accomplished using pattern matching and machine learning. But for 'identification' these techniques are not adequate. In order to identify genomic entities, NLP needs a comprehensive resource that specifies and classifies genomic entities as they occur in text and that associates them with normalized terms and also unique identifiers so that the extracted entities are well defined. Online organism databases are an excellent resource to create such a lexical resource. However, gene name ambiguity is a serious problem because it affects the appropriate identification of gene entities. In this paper, we explore the extent of the problem and suggest ways to address it. Results: We obtained gene information from 21 organisms and quantified naming ambiguities within species, across species, with English words and with medical terms. When the case (of letters) was retained, official symbols displayed negligible intra-species ambiguity (0.02%) and modest ambiguities with general English words (0.57%) and medical terms (1.01%). In contrast, the across-species ambiguity was high (14.20%). The inclusion of gene synonyms increased intraspecies ambiguity substantially and full names contributed greatly to gene-medical-term ambiguity. A comprehensive lexical resource that covers gene information for the 21 organisms was then created and used to identify gene names by using a straightforward string matching program to process 45 000 abstracts associated with the mouse model organism while ignoring case and gene names that were also English words. We found that 85.1% of correctly retrieved mouse genes were ambiguous with other gene names. When gene names that were also English words were included, 233% additional 'gene' instances were retrieved, most of which were false positives. We also found that authors prefer to use synonyms (74.7%) to official symbols (17.7%) or full names (7.6%) in their publications.

UR - http://www.scopus.com/inward/record.url?scp=13444256580&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=13444256580&partnerID=8YFLogxK

U2 - 10.1093/bioinformatics/bth496

DO - 10.1093/bioinformatics/bth496

M3 - Article

C2 - 15333458

AN - SCOPUS:13444256580

SN - 1367-4803

VL - 21

SP - 248

EP - 256

JO - Bioinformatics

JF - Bioinformatics

IS - 2

ER -

Gene name ambiguity of eukaryotic nomenclatures

Abstract

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this