Evaluating gene/protein name tagging and mapping for article retrieval

Chong Min Lee; Manabu Torii; Jinesh Shah; Yi Ting Tsai; Zhang Zhi Hu; Hongfang Liu

Evaluating gene/protein name tagging and mapping for article retrieval

Chong Min Lee, Manabu Torii, Jinesh Shah, Yi Ting Tsai, Zhang Zhi Hu, Hongfang Liu

Digital Health Sciences

Research output: Contribution to journal › Conference article › peer-review

1 Scopus citations

Abstract

Background: Tagging gene/protein names in text and mapping them to database entries are critical tasks in biological literature mining. Most of the existing tagging and normalization approaches, however, have not been evaluated for practical use in article retrieval towards efficient biocuration. Results: By utilizing literature cross-reference information provided by NCBI Entrez Gene database, we found that the coverage of gene/protein databases with respect to gene/protein names found in text is around 94%. The upper bound of the recall in retrieving MEDLINE citations by gene/protein names is around 70-80% when citations cross-referred by many genes are overlooked and flexible matching of names are used. Of genes/proteins failed to be retrieved by names, over 30% are caused by citations not discussing cross-referred genes/proteins in the abstracts and around 60% are caused by the gene/protein name tagging system trained on the BioCreAtIvE II gene mention corpus. Conclusions: The study demonstrates that existing gene/protein databases have a decent coverage of gene/protein names used in MEDLINE abstracts. Approaches and data resources for gene/protein tagging and mapping need to be selected appropriately for individual practical tasks.

Original language	English (US)
Pages (from-to)	104-109
Number of pages	6
Journal	CEUR Workshop Proceedings
Volume	714
State	Published - 2010
Event	4th International Symposium on Semantic Mining in Biomedicine, SMBM 2010 - Cambridge, United Kingdom Duration: Oct 25 2010 → Oct 26 2010

ASJC Scopus subject areas

General Computer Science

Cite this

@article{79537d8d19fc4e0d96f6f3b8a1ef4653,

title = "Evaluating gene/protein name tagging and mapping for article retrieval",

abstract = "Background: Tagging gene/protein names in text and mapping them to database entries are critical tasks in biological literature mining. Most of the existing tagging and normalization approaches, however, have not been evaluated for practical use in article retrieval towards efficient biocuration. Results: By utilizing literature cross-reference information provided by NCBI Entrez Gene database, we found that the coverage of gene/protein databases with respect to gene/protein names found in text is around 94%. The upper bound of the recall in retrieving MEDLINE citations by gene/protein names is around 70-80% when citations cross-referred by many genes are overlooked and flexible matching of names are used. Of genes/proteins failed to be retrieved by names, over 30% are caused by citations not discussing cross-referred genes/proteins in the abstracts and around 60% are caused by the gene/protein name tagging system trained on the BioCreAtIvE II gene mention corpus. Conclusions: The study demonstrates that existing gene/protein databases have a decent coverage of gene/protein names used in MEDLINE abstracts. Approaches and data resources for gene/protein tagging and mapping need to be selected appropriately for individual practical tasks.",

author = "Lee, {Chong Min} and Manabu Torii and Jinesh Shah and Tsai, {Yi Ting} and Hu, {Zhang Zhi} and Hongfang Liu",

year = "2010",

language = "English (US)",

volume = "714",

pages = "104--109",

journal = "CEUR Workshop Proceedings",

issn = "1613-0073",

publisher = "CEUR-WS",

note = "4th International Symposium on Semantic Mining in Biomedicine, SMBM 2010 ; Conference date: 25-10-2010 Through 26-10-2010",

}

TY - JOUR

T1 - Evaluating gene/protein name tagging and mapping for article retrieval

AU - Lee, Chong Min

AU - Torii, Manabu

AU - Shah, Jinesh

AU - Tsai, Yi Ting

AU - Hu, Zhang Zhi

AU - Liu, Hongfang

PY - 2010

Y1 - 2010

N2 - Background: Tagging gene/protein names in text and mapping them to database entries are critical tasks in biological literature mining. Most of the existing tagging and normalization approaches, however, have not been evaluated for practical use in article retrieval towards efficient biocuration. Results: By utilizing literature cross-reference information provided by NCBI Entrez Gene database, we found that the coverage of gene/protein databases with respect to gene/protein names found in text is around 94%. The upper bound of the recall in retrieving MEDLINE citations by gene/protein names is around 70-80% when citations cross-referred by many genes are overlooked and flexible matching of names are used. Of genes/proteins failed to be retrieved by names, over 30% are caused by citations not discussing cross-referred genes/proteins in the abstracts and around 60% are caused by the gene/protein name tagging system trained on the BioCreAtIvE II gene mention corpus. Conclusions: The study demonstrates that existing gene/protein databases have a decent coverage of gene/protein names used in MEDLINE abstracts. Approaches and data resources for gene/protein tagging and mapping need to be selected appropriately for individual practical tasks.

AB - Background: Tagging gene/protein names in text and mapping them to database entries are critical tasks in biological literature mining. Most of the existing tagging and normalization approaches, however, have not been evaluated for practical use in article retrieval towards efficient biocuration. Results: By utilizing literature cross-reference information provided by NCBI Entrez Gene database, we found that the coverage of gene/protein databases with respect to gene/protein names found in text is around 94%. The upper bound of the recall in retrieving MEDLINE citations by gene/protein names is around 70-80% when citations cross-referred by many genes are overlooked and flexible matching of names are used. Of genes/proteins failed to be retrieved by names, over 30% are caused by citations not discussing cross-referred genes/proteins in the abstracts and around 60% are caused by the gene/protein name tagging system trained on the BioCreAtIvE II gene mention corpus. Conclusions: The study demonstrates that existing gene/protein databases have a decent coverage of gene/protein names used in MEDLINE abstracts. Approaches and data resources for gene/protein tagging and mapping need to be selected appropriately for individual practical tasks.

UR - http://www.scopus.com/inward/record.url?scp=84874272372&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84874272372&partnerID=8YFLogxK

M3 - Conference article

AN - SCOPUS:84874272372

SN - 1613-0073

VL - 714

SP - 104

EP - 109

JO - CEUR Workshop Proceedings

JF - CEUR Workshop Proceedings

T2 - 4th International Symposium on Semantic Mining in Biomedicine, SMBM 2010

Y2 - 25 October 2010 through 26 October 2010

ER -

Evaluating gene/protein name tagging and mapping for article retrieval

Abstract

ASJC Scopus subject areas

Other files and links

Fingerprint

Cite this