BioTagger-GM: A Gene/Protein Name Recognition System

Manabu Torii; Zhangzhi Hu; Cathy H. Wu; Hongfang Liu

doi:10.1197/jamia.M2844

BioTagger-GM: A Gene/Protein Name Recognition System

Manabu Torii, Zhangzhi Hu, Cathy H. Wu, Hongfang Liu

Digital Health Sciences

Research output: Contribution to journal › Article › peer-review

44 Scopus citations

Abstract

Objectives: Biomedical named entity recognition (BNER) is a critical component in automated systems that mine biomedical knowledge in free text. Among different types of entities in the domain, gene/protein would be the most studied one for BNER. Our goal is to develop a gene/protein name recognition system BioTagger-GM that exploits rich information in terminology sources using powerful machine learning frameworks and system combination. Design: BioTagger-GM consists of four main components: (1) dictionary lookup-gene/protein names in BioThesaurus and biomedical terms in UMLS Metathesaurus are tagged in text, (2) machine learning-machine learning systems are trained using dictionary lookup results as one type of feature, (3) post-processing-heuristic rules are used to correct recognition errors, and (4) system combination-a voting scheme is used to combine recognition results from multiple systems. Measurements: The BioCreAtIvE II Gene Mention (GM) corpus was used to evaluate the proposed method. To test its general applicability, the method was also evaluated on the JNLPBA corpus modified for gene/protein name recognition. The performance of the systems was evaluated through cross-validation tests and measured using precision, recall, and F-Measure. Results: BioTagger-GM achieved an F-Measure of 0.8887 on the BioCreAtIvE II GM corpus, which is higher than that of the first-place system in the BioCreAtIvE II challenge. The applicability of the method was also confirmed on the modified JNLPBA corpus. Conclusion: The results suggest that terminology sources, powerful machine learning frameworks, and system combination can be integrated to build an effective BNER system.

Original language	English (US)
Pages (from-to)	247-255
Number of pages	9
Journal	Journal of the American Medical Informatics Association
Volume	16
Issue number	2
DOIs	https://doi.org/10.1197/jamia.M2844
State	Published - Mar 2009

ASJC Scopus subject areas

Health Informatics

Access to Document

10.1197/jamia.M2844

Cite this

@article{3658596484864d82888b213318eeb2b7,

title = "BioTagger-GM: A Gene/Protein Name Recognition System",

abstract = "Objectives: Biomedical named entity recognition (BNER) is a critical component in automated systems that mine biomedical knowledge in free text. Among different types of entities in the domain, gene/protein would be the most studied one for BNER. Our goal is to develop a gene/protein name recognition system BioTagger-GM that exploits rich information in terminology sources using powerful machine learning frameworks and system combination. Design: BioTagger-GM consists of four main components: (1) dictionary lookup-gene/protein names in BioThesaurus and biomedical terms in UMLS Metathesaurus are tagged in text, (2) machine learning-machine learning systems are trained using dictionary lookup results as one type of feature, (3) post-processing-heuristic rules are used to correct recognition errors, and (4) system combination-a voting scheme is used to combine recognition results from multiple systems. Measurements: The BioCreAtIvE II Gene Mention (GM) corpus was used to evaluate the proposed method. To test its general applicability, the method was also evaluated on the JNLPBA corpus modified for gene/protein name recognition. The performance of the systems was evaluated through cross-validation tests and measured using precision, recall, and F-Measure. Results: BioTagger-GM achieved an F-Measure of 0.8887 on the BioCreAtIvE II GM corpus, which is higher than that of the first-place system in the BioCreAtIvE II challenge. The applicability of the method was also confirmed on the modified JNLPBA corpus. Conclusion: The results suggest that terminology sources, powerful machine learning frameworks, and system combination can be integrated to build an effective BNER system.",

author = "Manabu Torii and Zhangzhi Hu and Wu, {Cathy H.} and Hongfang Liu",

year = "2009",

month = mar,

doi = "10.1197/jamia.M2844",

language = "English (US)",

volume = "16",

pages = "247--255",

journal = "Journal of the American Medical Informatics Association",

issn = "1067-5027",

publisher = "Oxford University Press",

number = "2",

}

TY - JOUR

T1 - BioTagger-GM

T2 - A Gene/Protein Name Recognition System

AU - Torii, Manabu

AU - Hu, Zhangzhi

AU - Wu, Cathy H.

AU - Liu, Hongfang

PY - 2009/3

Y1 - 2009/3

N2 - Objectives: Biomedical named entity recognition (BNER) is a critical component in automated systems that mine biomedical knowledge in free text. Among different types of entities in the domain, gene/protein would be the most studied one for BNER. Our goal is to develop a gene/protein name recognition system BioTagger-GM that exploits rich information in terminology sources using powerful machine learning frameworks and system combination. Design: BioTagger-GM consists of four main components: (1) dictionary lookup-gene/protein names in BioThesaurus and biomedical terms in UMLS Metathesaurus are tagged in text, (2) machine learning-machine learning systems are trained using dictionary lookup results as one type of feature, (3) post-processing-heuristic rules are used to correct recognition errors, and (4) system combination-a voting scheme is used to combine recognition results from multiple systems. Measurements: The BioCreAtIvE II Gene Mention (GM) corpus was used to evaluate the proposed method. To test its general applicability, the method was also evaluated on the JNLPBA corpus modified for gene/protein name recognition. The performance of the systems was evaluated through cross-validation tests and measured using precision, recall, and F-Measure. Results: BioTagger-GM achieved an F-Measure of 0.8887 on the BioCreAtIvE II GM corpus, which is higher than that of the first-place system in the BioCreAtIvE II challenge. The applicability of the method was also confirmed on the modified JNLPBA corpus. Conclusion: The results suggest that terminology sources, powerful machine learning frameworks, and system combination can be integrated to build an effective BNER system.

AB - Objectives: Biomedical named entity recognition (BNER) is a critical component in automated systems that mine biomedical knowledge in free text. Among different types of entities in the domain, gene/protein would be the most studied one for BNER. Our goal is to develop a gene/protein name recognition system BioTagger-GM that exploits rich information in terminology sources using powerful machine learning frameworks and system combination. Design: BioTagger-GM consists of four main components: (1) dictionary lookup-gene/protein names in BioThesaurus and biomedical terms in UMLS Metathesaurus are tagged in text, (2) machine learning-machine learning systems are trained using dictionary lookup results as one type of feature, (3) post-processing-heuristic rules are used to correct recognition errors, and (4) system combination-a voting scheme is used to combine recognition results from multiple systems. Measurements: The BioCreAtIvE II Gene Mention (GM) corpus was used to evaluate the proposed method. To test its general applicability, the method was also evaluated on the JNLPBA corpus modified for gene/protein name recognition. The performance of the systems was evaluated through cross-validation tests and measured using precision, recall, and F-Measure. Results: BioTagger-GM achieved an F-Measure of 0.8887 on the BioCreAtIvE II GM corpus, which is higher than that of the first-place system in the BioCreAtIvE II challenge. The applicability of the method was also confirmed on the modified JNLPBA corpus. Conclusion: The results suggest that terminology sources, powerful machine learning frameworks, and system combination can be integrated to build an effective BNER system.

UR - http://www.scopus.com/inward/record.url?scp=60549093731&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=60549093731&partnerID=8YFLogxK

U2 - 10.1197/jamia.M2844

DO - 10.1197/jamia.M2844

M3 - Article

C2 - 19074302

AN - SCOPUS:60549093731

SN - 1067-5027

VL - 16

SP - 247

EP - 255

JO - Journal of the American Medical Informatics Association

JF - Journal of the American Medical Informatics Association

IS - 2

ER -

BioTagger-GM: A Gene/Protein Name Recognition System

Abstract

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this