Evaluating gene/protein name tagging and mapping for article retrieval

Chong Min Lee, Manabu Torii, Jinesh Shah, Yi Ting Tsai, Zhang Zhi Hu, Hongfang D Liu

Research output: Chapter in Book/Report/Conference proceedingConference contribution

1 Citation (Scopus)

Abstract

Background: Tagging gene/protein names in text and mapping them to database entries are critical tasks in biological literature mining. Most of the existing tagging and normalization approaches, however, have not been evaluated for practical use in article retrieval towards efficient biocuration. Results: By utilizing literature cross-reference information provided by NCBI Entrez Gene database, we found that the coverage of gene/protein databases with respect to gene/protein names found in text is around 94%. The upper bound of the recall in retrieving MEDLINE citations by gene/protein names is around 70-80% when citations cross-referred by many genes are overlooked and flexible matching of names are used. Of genes/proteins failed to be retrieved by names, over 30% are caused by citations not discussing cross-referred genes/proteins in the abstracts and around 60% are caused by the gene/protein name tagging system trained on the BioCreAtIvE II gene mention corpus. Conclusions: The study demonstrates that existing gene/protein databases have a decent coverage of gene/protein names used in MEDLINE abstracts. Approaches and data resources for gene/protein tagging and mapping need to be selected appropriately for individual practical tasks.

Original languageEnglish (US)
Title of host publicationCEUR Workshop Proceedings
Pages104-109
Number of pages6
Volume714
StatePublished - 2010
Externally publishedYes
Event4th International Symposium on Semantic Mining in Biomedicine, SMBM 2010 - Cambridge, United Kingdom
Duration: Oct 25 2010Oct 26 2010

Other

Other4th International Symposium on Semantic Mining in Biomedicine, SMBM 2010
CountryUnited Kingdom
CityCambridge
Period10/25/1010/26/10

Fingerprint

Genes
Proteins

ASJC Scopus subject areas

  • Computer Science(all)

Cite this

Lee, C. M., Torii, M., Shah, J., Tsai, Y. T., Hu, Z. Z., & Liu, H. D. (2010). Evaluating gene/protein name tagging and mapping for article retrieval. In CEUR Workshop Proceedings (Vol. 714, pp. 104-109)

Evaluating gene/protein name tagging and mapping for article retrieval. / Lee, Chong Min; Torii, Manabu; Shah, Jinesh; Tsai, Yi Ting; Hu, Zhang Zhi; Liu, Hongfang D.

CEUR Workshop Proceedings. Vol. 714 2010. p. 104-109.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Lee, CM, Torii, M, Shah, J, Tsai, YT, Hu, ZZ & Liu, HD 2010, Evaluating gene/protein name tagging and mapping for article retrieval. in CEUR Workshop Proceedings. vol. 714, pp. 104-109, 4th International Symposium on Semantic Mining in Biomedicine, SMBM 2010, Cambridge, United Kingdom, 10/25/10.
Lee CM, Torii M, Shah J, Tsai YT, Hu ZZ, Liu HD. Evaluating gene/protein name tagging and mapping for article retrieval. In CEUR Workshop Proceedings. Vol. 714. 2010. p. 104-109
Lee, Chong Min ; Torii, Manabu ; Shah, Jinesh ; Tsai, Yi Ting ; Hu, Zhang Zhi ; Liu, Hongfang D. / Evaluating gene/protein name tagging and mapping for article retrieval. CEUR Workshop Proceedings. Vol. 714 2010. pp. 104-109
@inproceedings{79537d8d19fc4e0d96f6f3b8a1ef4653,
title = "Evaluating gene/protein name tagging and mapping for article retrieval",
abstract = "Background: Tagging gene/protein names in text and mapping them to database entries are critical tasks in biological literature mining. Most of the existing tagging and normalization approaches, however, have not been evaluated for practical use in article retrieval towards efficient biocuration. Results: By utilizing literature cross-reference information provided by NCBI Entrez Gene database, we found that the coverage of gene/protein databases with respect to gene/protein names found in text is around 94{\%}. The upper bound of the recall in retrieving MEDLINE citations by gene/protein names is around 70-80{\%} when citations cross-referred by many genes are overlooked and flexible matching of names are used. Of genes/proteins failed to be retrieved by names, over 30{\%} are caused by citations not discussing cross-referred genes/proteins in the abstracts and around 60{\%} are caused by the gene/protein name tagging system trained on the BioCreAtIvE II gene mention corpus. Conclusions: The study demonstrates that existing gene/protein databases have a decent coverage of gene/protein names used in MEDLINE abstracts. Approaches and data resources for gene/protein tagging and mapping need to be selected appropriately for individual practical tasks.",
author = "Lee, {Chong Min} and Manabu Torii and Jinesh Shah and Tsai, {Yi Ting} and Hu, {Zhang Zhi} and Liu, {Hongfang D}",
year = "2010",
language = "English (US)",
volume = "714",
pages = "104--109",
booktitle = "CEUR Workshop Proceedings",

}

TY - GEN

T1 - Evaluating gene/protein name tagging and mapping for article retrieval

AU - Lee, Chong Min

AU - Torii, Manabu

AU - Shah, Jinesh

AU - Tsai, Yi Ting

AU - Hu, Zhang Zhi

AU - Liu, Hongfang D

PY - 2010

Y1 - 2010

N2 - Background: Tagging gene/protein names in text and mapping them to database entries are critical tasks in biological literature mining. Most of the existing tagging and normalization approaches, however, have not been evaluated for practical use in article retrieval towards efficient biocuration. Results: By utilizing literature cross-reference information provided by NCBI Entrez Gene database, we found that the coverage of gene/protein databases with respect to gene/protein names found in text is around 94%. The upper bound of the recall in retrieving MEDLINE citations by gene/protein names is around 70-80% when citations cross-referred by many genes are overlooked and flexible matching of names are used. Of genes/proteins failed to be retrieved by names, over 30% are caused by citations not discussing cross-referred genes/proteins in the abstracts and around 60% are caused by the gene/protein name tagging system trained on the BioCreAtIvE II gene mention corpus. Conclusions: The study demonstrates that existing gene/protein databases have a decent coverage of gene/protein names used in MEDLINE abstracts. Approaches and data resources for gene/protein tagging and mapping need to be selected appropriately for individual practical tasks.

AB - Background: Tagging gene/protein names in text and mapping them to database entries are critical tasks in biological literature mining. Most of the existing tagging and normalization approaches, however, have not been evaluated for practical use in article retrieval towards efficient biocuration. Results: By utilizing literature cross-reference information provided by NCBI Entrez Gene database, we found that the coverage of gene/protein databases with respect to gene/protein names found in text is around 94%. The upper bound of the recall in retrieving MEDLINE citations by gene/protein names is around 70-80% when citations cross-referred by many genes are overlooked and flexible matching of names are used. Of genes/proteins failed to be retrieved by names, over 30% are caused by citations not discussing cross-referred genes/proteins in the abstracts and around 60% are caused by the gene/protein name tagging system trained on the BioCreAtIvE II gene mention corpus. Conclusions: The study demonstrates that existing gene/protein databases have a decent coverage of gene/protein names used in MEDLINE abstracts. Approaches and data resources for gene/protein tagging and mapping need to be selected appropriately for individual practical tasks.

UR - http://www.scopus.com/inward/record.url?scp=84874272372&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84874272372&partnerID=8YFLogxK

M3 - Conference contribution

VL - 714

SP - 104

EP - 109

BT - CEUR Workshop Proceedings

ER -