Biological nomenclatures: a source of lexical knowledge and ambiguity.

O. Tuason; L. Chen; H. Liu; J. A. Blake; C. Friedman

Biological nomenclatures: a source of lexical knowledge and ambiguity.

O. Tuason, L. Chen, H. Liu, J. A. Blake, C. Friedman

Digital Health Sciences

Research output: Contribution to journal › Article › peer-review

56 Scopus citations

Abstract

There has been increased work in developing automated systems that involve natural language processing (NLP) to recognize and extract genomic information from the literature. Recognition and identification of biological entities is a critical step in this process. NLP systems generally rely on nomenclatures and ontological specifications as resources for determining the names of the entities, assigning semantic categories that are consistent with the corresponding ontology, and assignment of identifiers that map to well-defined entities within a particular nomenclature. Although nomenclatures and ontologies are valuable for text processing systems, they were developed to aid researchers and are heterogeneous in structure and semantics. A uniform resource that is automatically generated from diverse resources, and that is designed for NLP purposes would be a useful tool for the field, and would further database interoperability. This paper presents work towards this goal. We have automatically created lexical resources from four model organism nomenclature systems (mouse, fly, worm, and yeast), and have studied performance of the resources within an existing NLP system, GENIES. Using nomenclatures is not straightforward because issues concerning ambiguity, synonymy, and name variations are quite challenging. In this paper we focus mainly on ambiguity. We determined that the number of ambiguous gene names within the individual nomenclatures, across the four nomenclatures, and with general English ranged from 0%-10.18%, 1.187%-20.30%, and 0%-2.49% respectively. When actually processing text, we found the rate of ambiguous occurrences (not counting ambiguities stemming from English words) to range from 2.4%-32.9% depending on the organisms considered.

Original language	English (US)
Pages (from-to)	238-249
Number of pages	12
Journal	Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing
State	Published - 2004

ASJC Scopus subject areas

General Medicine

Cite this

@article{e51e4303636d486d9fd5ec58040dcb50,

title = "Biological nomenclatures: a source of lexical knowledge and ambiguity.",

abstract = "There has been increased work in developing automated systems that involve natural language processing (NLP) to recognize and extract genomic information from the literature. Recognition and identification of biological entities is a critical step in this process. NLP systems generally rely on nomenclatures and ontological specifications as resources for determining the names of the entities, assigning semantic categories that are consistent with the corresponding ontology, and assignment of identifiers that map to well-defined entities within a particular nomenclature. Although nomenclatures and ontologies are valuable for text processing systems, they were developed to aid researchers and are heterogeneous in structure and semantics. A uniform resource that is automatically generated from diverse resources, and that is designed for NLP purposes would be a useful tool for the field, and would further database interoperability. This paper presents work towards this goal. We have automatically created lexical resources from four model organism nomenclature systems (mouse, fly, worm, and yeast), and have studied performance of the resources within an existing NLP system, GENIES. Using nomenclatures is not straightforward because issues concerning ambiguity, synonymy, and name variations are quite challenging. In this paper we focus mainly on ambiguity. We determined that the number of ambiguous gene names within the individual nomenclatures, across the four nomenclatures, and with general English ranged from 0%-10.18%, 1.187%-20.30%, and 0%-2.49% respectively. When actually processing text, we found the rate of ambiguous occurrences (not counting ambiguities stemming from English words) to range from 2.4%-32.9% depending on the organisms considered.",

author = "O. Tuason and L. Chen and H. Liu and Blake, {J. A.} and C. Friedman",

year = "2004",

language = "English (US)",

pages = "238--249",

journal = "Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing",

issn = "2335-6936",

publisher = "World Scientific Publishing Co., Inc.",

}

TY - JOUR

T1 - Biological nomenclatures

T2 - a source of lexical knowledge and ambiguity.

AU - Tuason, O.

AU - Chen, L.

AU - Liu, H.

AU - Blake, J. A.

AU - Friedman, C.

PY - 2004

Y1 - 2004

N2 - There has been increased work in developing automated systems that involve natural language processing (NLP) to recognize and extract genomic information from the literature. Recognition and identification of biological entities is a critical step in this process. NLP systems generally rely on nomenclatures and ontological specifications as resources for determining the names of the entities, assigning semantic categories that are consistent with the corresponding ontology, and assignment of identifiers that map to well-defined entities within a particular nomenclature. Although nomenclatures and ontologies are valuable for text processing systems, they were developed to aid researchers and are heterogeneous in structure and semantics. A uniform resource that is automatically generated from diverse resources, and that is designed for NLP purposes would be a useful tool for the field, and would further database interoperability. This paper presents work towards this goal. We have automatically created lexical resources from four model organism nomenclature systems (mouse, fly, worm, and yeast), and have studied performance of the resources within an existing NLP system, GENIES. Using nomenclatures is not straightforward because issues concerning ambiguity, synonymy, and name variations are quite challenging. In this paper we focus mainly on ambiguity. We determined that the number of ambiguous gene names within the individual nomenclatures, across the four nomenclatures, and with general English ranged from 0%-10.18%, 1.187%-20.30%, and 0%-2.49% respectively. When actually processing text, we found the rate of ambiguous occurrences (not counting ambiguities stemming from English words) to range from 2.4%-32.9% depending on the organisms considered.

AB - There has been increased work in developing automated systems that involve natural language processing (NLP) to recognize and extract genomic information from the literature. Recognition and identification of biological entities is a critical step in this process. NLP systems generally rely on nomenclatures and ontological specifications as resources for determining the names of the entities, assigning semantic categories that are consistent with the corresponding ontology, and assignment of identifiers that map to well-defined entities within a particular nomenclature. Although nomenclatures and ontologies are valuable for text processing systems, they were developed to aid researchers and are heterogeneous in structure and semantics. A uniform resource that is automatically generated from diverse resources, and that is designed for NLP purposes would be a useful tool for the field, and would further database interoperability. This paper presents work towards this goal. We have automatically created lexical resources from four model organism nomenclature systems (mouse, fly, worm, and yeast), and have studied performance of the resources within an existing NLP system, GENIES. Using nomenclatures is not straightforward because issues concerning ambiguity, synonymy, and name variations are quite challenging. In this paper we focus mainly on ambiguity. We determined that the number of ambiguous gene names within the individual nomenclatures, across the four nomenclatures, and with general English ranged from 0%-10.18%, 1.187%-20.30%, and 0%-2.49% respectively. When actually processing text, we found the rate of ambiguous occurrences (not counting ambiguities stemming from English words) to range from 2.4%-32.9% depending on the organisms considered.

UR - http://www.scopus.com/inward/record.url?scp=2442654362&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=2442654362&partnerID=8YFLogxK

M3 - Article

C2 - 14992507

AN - SCOPUS:2442654362

SN - 2335-6936

SP - 238

EP - 249

JO - Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing

JF - Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing

ER -

Biological nomenclatures: a source of lexical knowledge and ambiguity.

Abstract

ASJC Scopus subject areas

Other files and links

Fingerprint

Cite this