Detecting concept mentions in biomedical text using hidden Markov model: Multiple concept types at once or one at a time?

Manabu Torii; Kavishwar Wagholikar; Hongfang Liu

doi:10.1186/2041-1480-5-3

Detecting concept mentions in biomedical text using hidden Markov model: Multiple concept types at once or one at a time?

Manabu Torii, Kavishwar Wagholikar, Hongfang Liu

Digital Health Sciences

Research output: Contribution to journal › Article › peer-review

8 Scopus citations

Abstract

Background: Identifying phrases that refer to particular concept types is a critical step in extracting information from documents. Provided with annotated documents as training data, supervised machine learning can automate this process. When building a machine learning model for this task, the model may be built to detect all types simultaneously (all-types-at-once) or it may be built for one or a few selected types at a time (one-type- or a-few-types-at-a-time). It is of interest to investigate which strategy yields better detection performance. Results: Hidden Markov models using the different strategies were evaluated on a clinical corpus annotated with three concept types (i2b2/VA corpus) and a biology literature corpus annotated with five concept types (JNLPBA corpus). Ten-fold cross-validation tests were conducted and the experimental results showed that models trained for multiple concept types consistently yielded better performance than those trained for a single concept type. F-scores observed for the former strategies were higher than those observed for the latter by 0.9 to 2.6% on the i2b2/VA corpus and 1.4 to 10.1% on the JNLPBA corpus, depending on the target concept types. Improved boundary detection and reduced type confusion were observed for the all-types-at-once strategy. Conclusions: The current results suggest that detection of concept phrases could be improved by simultaneously tackling multiple concept types. This also suggests that we should annotate multiple concept types in developing a new corpus for machine learning models. Further investigation is expected to gain insights in the underlying mechanism to achieve good performance when multiple concept types are considered.

Original language	English (US)
Article number	3
Journal	Journal of Biomedical Semantics
Volume	5
Issue number	1
DOIs	https://doi.org/10.1186/2041-1480-5-3
State	Published - Jan 17 2014

Keywords

Data mining
Electronic health records
Information storage and retrieval
Natural language processing

ASJC Scopus subject areas

Information Systems
Computer Science Applications
Health Informatics
Computer Networks and Communications

Access to Document

10.1186/2041-1480-5-3

Cite this

@article{13dff57451964ffa94562214213174a4,

title = "Detecting concept mentions in biomedical text using hidden Markov model: Multiple concept types at once or one at a time?",

abstract = "Background: Identifying phrases that refer to particular concept types is a critical step in extracting information from documents. Provided with annotated documents as training data, supervised machine learning can automate this process. When building a machine learning model for this task, the model may be built to detect all types simultaneously (all-types-at-once) or it may be built for one or a few selected types at a time (one-type- or a-few-types-at-a-time). It is of interest to investigate which strategy yields better detection performance. Results: Hidden Markov models using the different strategies were evaluated on a clinical corpus annotated with three concept types (i2b2/VA corpus) and a biology literature corpus annotated with five concept types (JNLPBA corpus). Ten-fold cross-validation tests were conducted and the experimental results showed that models trained for multiple concept types consistently yielded better performance than those trained for a single concept type. F-scores observed for the former strategies were higher than those observed for the latter by 0.9 to 2.6% on the i2b2/VA corpus and 1.4 to 10.1% on the JNLPBA corpus, depending on the target concept types. Improved boundary detection and reduced type confusion were observed for the all-types-at-once strategy. Conclusions: The current results suggest that detection of concept phrases could be improved by simultaneously tackling multiple concept types. This also suggests that we should annotate multiple concept types in developing a new corpus for machine learning models. Further investigation is expected to gain insights in the underlying mechanism to achieve good performance when multiple concept types are considered.",

keywords = "Data mining, Electronic health records, Information storage and retrieval, Natural language processing",

author = "Manabu Torii and Kavishwar Wagholikar and Hongfang Liu",

note = "Publisher Copyright: {\textcopyright} 2014 Torii et al.; licensee BioMed Central Ltd.",

year = "2014",

month = jan,

day = "17",

doi = "10.1186/2041-1480-5-3",

language = "English (US)",

volume = "5",

journal = "Journal of Biomedical Semantics",

issn = "2041-1480",

publisher = "BioMed Central",

number = "1",

}

TY - JOUR

T1 - Detecting concept mentions in biomedical text using hidden Markov model

T2 - Multiple concept types at once or one at a time?

AU - Torii, Manabu

AU - Wagholikar, Kavishwar

AU - Liu, Hongfang

PY - 2014/1/17

Y1 - 2014/1/17

N2 - Background: Identifying phrases that refer to particular concept types is a critical step in extracting information from documents. Provided with annotated documents as training data, supervised machine learning can automate this process. When building a machine learning model for this task, the model may be built to detect all types simultaneously (all-types-at-once) or it may be built for one or a few selected types at a time (one-type- or a-few-types-at-a-time). It is of interest to investigate which strategy yields better detection performance. Results: Hidden Markov models using the different strategies were evaluated on a clinical corpus annotated with three concept types (i2b2/VA corpus) and a biology literature corpus annotated with five concept types (JNLPBA corpus). Ten-fold cross-validation tests were conducted and the experimental results showed that models trained for multiple concept types consistently yielded better performance than those trained for a single concept type. F-scores observed for the former strategies were higher than those observed for the latter by 0.9 to 2.6% on the i2b2/VA corpus and 1.4 to 10.1% on the JNLPBA corpus, depending on the target concept types. Improved boundary detection and reduced type confusion were observed for the all-types-at-once strategy. Conclusions: The current results suggest that detection of concept phrases could be improved by simultaneously tackling multiple concept types. This also suggests that we should annotate multiple concept types in developing a new corpus for machine learning models. Further investigation is expected to gain insights in the underlying mechanism to achieve good performance when multiple concept types are considered.

AB - Background: Identifying phrases that refer to particular concept types is a critical step in extracting information from documents. Provided with annotated documents as training data, supervised machine learning can automate this process. When building a machine learning model for this task, the model may be built to detect all types simultaneously (all-types-at-once) or it may be built for one or a few selected types at a time (one-type- or a-few-types-at-a-time). It is of interest to investigate which strategy yields better detection performance. Results: Hidden Markov models using the different strategies were evaluated on a clinical corpus annotated with three concept types (i2b2/VA corpus) and a biology literature corpus annotated with five concept types (JNLPBA corpus). Ten-fold cross-validation tests were conducted and the experimental results showed that models trained for multiple concept types consistently yielded better performance than those trained for a single concept type. F-scores observed for the former strategies were higher than those observed for the latter by 0.9 to 2.6% on the i2b2/VA corpus and 1.4 to 10.1% on the JNLPBA corpus, depending on the target concept types. Improved boundary detection and reduced type confusion were observed for the all-types-at-once strategy. Conclusions: The current results suggest that detection of concept phrases could be improved by simultaneously tackling multiple concept types. This also suggests that we should annotate multiple concept types in developing a new corpus for machine learning models. Further investigation is expected to gain insights in the underlying mechanism to achieve good performance when multiple concept types are considered.

KW - Data mining

KW - Electronic health records

KW - Information storage and retrieval

KW - Natural language processing

UR - http://www.scopus.com/inward/record.url?scp=84920719766&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84920719766&partnerID=8YFLogxK

U2 - 10.1186/2041-1480-5-3

DO - 10.1186/2041-1480-5-3

M3 - Article

AN - SCOPUS:84920719766

SN - 2041-1480

VL - 5

JO - Journal of Biomedical Semantics

JF - Journal of Biomedical Semantics

IS - 1

M1 - 3

ER -

Detecting concept mentions in biomedical text using hidden Markov model: Multiple concept types at once or one at a time?

Abstract

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this