Abstract
Background: Tagging gene/protein names in text and mapping them to database entries are critical tasks in biological literature mining. Most of the existing tagging and normalization approaches, however, have not been evaluated for practical use in article retrieval towards efficient biocuration. Results: By utilizing literature cross-reference information provided by NCBI Entrez Gene database, we found that the coverage of gene/protein databases with respect to gene/protein names found in text is around 94%. The upper bound of the recall in retrieving MEDLINE citations by gene/protein names is around 70-80% when citations cross-referred by many genes are overlooked and flexible matching of names are used. Of genes/proteins failed to be retrieved by names, over 30% are caused by citations not discussing cross-referred genes/proteins in the abstracts and around 60% are caused by the gene/protein name tagging system trained on the BioCreAtIvE II gene mention corpus. Conclusions: The study demonstrates that existing gene/protein databases have a decent coverage of gene/protein names used in MEDLINE abstracts. Approaches and data resources for gene/protein tagging and mapping need to be selected appropriately for individual practical tasks.
Original language | English (US) |
---|---|
Pages (from-to) | 104-109 |
Number of pages | 6 |
Journal | CEUR Workshop Proceedings |
Volume | 714 |
State | Published - 2010 |
Event | 4th International Symposium on Semantic Mining in Biomedicine, SMBM 2010 - Cambridge, United Kingdom Duration: Oct 25 2010 → Oct 26 2010 |
ASJC Scopus subject areas
- General Computer Science