Quantitative Assessment of Dictionary-based Protein Named Entity Tagging

Hongfang D Liu, Zhang Z. Hu, Manabu Torii, Cathy Wu, Carol Friedman

Research output: Contribution to journalArticle

25 Citations (Scopus)

Abstract

Objective: Natural language processing (NLP) approaches have been explored to manage and mine information recorded in biological literature. A critical step for biological literature mining is biological named entity tagging (BNET) that identifies names mentioned in text and normalizes them with entries in biological databases. The aim of this study was to provide quantitative assessment of the complexity of BNET on protein entities through BioThesaurus, a thesaurus of gene/protein names for UniProt knowledgebase (UniProtKB) entries that was acquired using online resources. Methods: We evaluated the complexity through several perspectives: ambiguity (i.e., the number of genes/proteins represented by one name), synonymy (i.e., the number of names associated with the same gene/protein), and coverage (i.e., the percentage of gene/protein names in text included in the thesaurus). We also normalized names in BioThesaurus and measures were obtained twice, once before normalization and once after. Results: The current version of BioThesaurus has over 2.6 million names or 2.1 million normalized names covering more than 1.8 million UniProtKB entries. The average synonymy is 3.53 (2.86 after normalization), ambiguity is 2.31 before normalization and 2.32 after, while the coverage is 94.0% based on the BioCreAtive data set comprising MEDLINE abstracts containing genes/proteins. Conclusion: The study indicated that names for genes/proteins are highly ambiguous and there are usually multiple names for the same gene or protein. It also demonstrated that most gene/protein names appearing in text can be found in BioThesaurus.

Original languageEnglish (US)
Pages (from-to)497-507
Number of pages11
JournalJournal of the American Medical Informatics Association
Volume13
Issue number5
DOIs
StatePublished - Sep 2006
Externally publishedYes

Fingerprint

Names
Proteins
Controlled Vocabulary
Knowledge Bases
Natural Language Processing
MEDLINE
Databases

ASJC Scopus subject areas

  • Medicine(all)

Cite this

Quantitative Assessment of Dictionary-based Protein Named Entity Tagging. / Liu, Hongfang D; Hu, Zhang Z.; Torii, Manabu; Wu, Cathy; Friedman, Carol.

In: Journal of the American Medical Informatics Association, Vol. 13, No. 5, 09.2006, p. 497-507.

Research output: Contribution to journalArticle

Liu, Hongfang D ; Hu, Zhang Z. ; Torii, Manabu ; Wu, Cathy ; Friedman, Carol. / Quantitative Assessment of Dictionary-based Protein Named Entity Tagging. In: Journal of the American Medical Informatics Association. 2006 ; Vol. 13, No. 5. pp. 497-507.
@article{4f62206f696941f8981072a19c321455,
title = "Quantitative Assessment of Dictionary-based Protein Named Entity Tagging",
abstract = "Objective: Natural language processing (NLP) approaches have been explored to manage and mine information recorded in biological literature. A critical step for biological literature mining is biological named entity tagging (BNET) that identifies names mentioned in text and normalizes them with entries in biological databases. The aim of this study was to provide quantitative assessment of the complexity of BNET on protein entities through BioThesaurus, a thesaurus of gene/protein names for UniProt knowledgebase (UniProtKB) entries that was acquired using online resources. Methods: We evaluated the complexity through several perspectives: ambiguity (i.e., the number of genes/proteins represented by one name), synonymy (i.e., the number of names associated with the same gene/protein), and coverage (i.e., the percentage of gene/protein names in text included in the thesaurus). We also normalized names in BioThesaurus and measures were obtained twice, once before normalization and once after. Results: The current version of BioThesaurus has over 2.6 million names or 2.1 million normalized names covering more than 1.8 million UniProtKB entries. The average synonymy is 3.53 (2.86 after normalization), ambiguity is 2.31 before normalization and 2.32 after, while the coverage is 94.0{\%} based on the BioCreAtive data set comprising MEDLINE abstracts containing genes/proteins. Conclusion: The study indicated that names for genes/proteins are highly ambiguous and there are usually multiple names for the same gene or protein. It also demonstrated that most gene/protein names appearing in text can be found in BioThesaurus.",
author = "Liu, {Hongfang D} and Hu, {Zhang Z.} and Manabu Torii and Cathy Wu and Carol Friedman",
year = "2006",
month = "9",
doi = "10.1197/jamia.M2085",
language = "English (US)",
volume = "13",
pages = "497--507",
journal = "Journal of the American Medical Informatics Association : JAMIA",
issn = "1067-5027",
publisher = "Oxford University Press",
number = "5",

}

TY - JOUR

T1 - Quantitative Assessment of Dictionary-based Protein Named Entity Tagging

AU - Liu, Hongfang D

AU - Hu, Zhang Z.

AU - Torii, Manabu

AU - Wu, Cathy

AU - Friedman, Carol

PY - 2006/9

Y1 - 2006/9

N2 - Objective: Natural language processing (NLP) approaches have been explored to manage and mine information recorded in biological literature. A critical step for biological literature mining is biological named entity tagging (BNET) that identifies names mentioned in text and normalizes them with entries in biological databases. The aim of this study was to provide quantitative assessment of the complexity of BNET on protein entities through BioThesaurus, a thesaurus of gene/protein names for UniProt knowledgebase (UniProtKB) entries that was acquired using online resources. Methods: We evaluated the complexity through several perspectives: ambiguity (i.e., the number of genes/proteins represented by one name), synonymy (i.e., the number of names associated with the same gene/protein), and coverage (i.e., the percentage of gene/protein names in text included in the thesaurus). We also normalized names in BioThesaurus and measures were obtained twice, once before normalization and once after. Results: The current version of BioThesaurus has over 2.6 million names or 2.1 million normalized names covering more than 1.8 million UniProtKB entries. The average synonymy is 3.53 (2.86 after normalization), ambiguity is 2.31 before normalization and 2.32 after, while the coverage is 94.0% based on the BioCreAtive data set comprising MEDLINE abstracts containing genes/proteins. Conclusion: The study indicated that names for genes/proteins are highly ambiguous and there are usually multiple names for the same gene or protein. It also demonstrated that most gene/protein names appearing in text can be found in BioThesaurus.

AB - Objective: Natural language processing (NLP) approaches have been explored to manage and mine information recorded in biological literature. A critical step for biological literature mining is biological named entity tagging (BNET) that identifies names mentioned in text and normalizes them with entries in biological databases. The aim of this study was to provide quantitative assessment of the complexity of BNET on protein entities through BioThesaurus, a thesaurus of gene/protein names for UniProt knowledgebase (UniProtKB) entries that was acquired using online resources. Methods: We evaluated the complexity through several perspectives: ambiguity (i.e., the number of genes/proteins represented by one name), synonymy (i.e., the number of names associated with the same gene/protein), and coverage (i.e., the percentage of gene/protein names in text included in the thesaurus). We also normalized names in BioThesaurus and measures were obtained twice, once before normalization and once after. Results: The current version of BioThesaurus has over 2.6 million names or 2.1 million normalized names covering more than 1.8 million UniProtKB entries. The average synonymy is 3.53 (2.86 after normalization), ambiguity is 2.31 before normalization and 2.32 after, while the coverage is 94.0% based on the BioCreAtive data set comprising MEDLINE abstracts containing genes/proteins. Conclusion: The study indicated that names for genes/proteins are highly ambiguous and there are usually multiple names for the same gene or protein. It also demonstrated that most gene/protein names appearing in text can be found in BioThesaurus.

UR - http://www.scopus.com/inward/record.url?scp=33747894083&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=33747894083&partnerID=8YFLogxK

U2 - 10.1197/jamia.M2085

DO - 10.1197/jamia.M2085

M3 - Article

C2 - 16799122

AN - SCOPUS:33747894083

VL - 13

SP - 497

EP - 507

JO - Journal of the American Medical Informatics Association : JAMIA

JF - Journal of the American Medical Informatics Association : JAMIA

SN - 1067-5027

IS - 5

ER -