TY - JOUR
T1 - Detecting concept mentions in biomedical text using hidden Markov model
T2 - Multiple concept types at once or one at a time?
AU - Torii, Manabu
AU - Wagholikar, Kavishwar
AU - Liu, Hongfang
N1 - Publisher Copyright:
© 2014 Torii et al.; licensee BioMed Central Ltd.
Copyright:
Copyright 2015 Elsevier B.V., All rights reserved.
PY - 2014/1/17
Y1 - 2014/1/17
N2 - Background: Identifying phrases that refer to particular concept types is a critical step in extracting information from documents. Provided with annotated documents as training data, supervised machine learning can automate this process. When building a machine learning model for this task, the model may be built to detect all types simultaneously (all-types-at-once) or it may be built for one or a few selected types at a time (one-type- or a-few-types-at-a-time). It is of interest to investigate which strategy yields better detection performance. Results: Hidden Markov models using the different strategies were evaluated on a clinical corpus annotated with three concept types (i2b2/VA corpus) and a biology literature corpus annotated with five concept types (JNLPBA corpus). Ten-fold cross-validation tests were conducted and the experimental results showed that models trained for multiple concept types consistently yielded better performance than those trained for a single concept type. F-scores observed for the former strategies were higher than those observed for the latter by 0.9 to 2.6% on the i2b2/VA corpus and 1.4 to 10.1% on the JNLPBA corpus, depending on the target concept types. Improved boundary detection and reduced type confusion were observed for the all-types-at-once strategy. Conclusions: The current results suggest that detection of concept phrases could be improved by simultaneously tackling multiple concept types. This also suggests that we should annotate multiple concept types in developing a new corpus for machine learning models. Further investigation is expected to gain insights in the underlying mechanism to achieve good performance when multiple concept types are considered.
AB - Background: Identifying phrases that refer to particular concept types is a critical step in extracting information from documents. Provided with annotated documents as training data, supervised machine learning can automate this process. When building a machine learning model for this task, the model may be built to detect all types simultaneously (all-types-at-once) or it may be built for one or a few selected types at a time (one-type- or a-few-types-at-a-time). It is of interest to investigate which strategy yields better detection performance. Results: Hidden Markov models using the different strategies were evaluated on a clinical corpus annotated with three concept types (i2b2/VA corpus) and a biology literature corpus annotated with five concept types (JNLPBA corpus). Ten-fold cross-validation tests were conducted and the experimental results showed that models trained for multiple concept types consistently yielded better performance than those trained for a single concept type. F-scores observed for the former strategies were higher than those observed for the latter by 0.9 to 2.6% on the i2b2/VA corpus and 1.4 to 10.1% on the JNLPBA corpus, depending on the target concept types. Improved boundary detection and reduced type confusion were observed for the all-types-at-once strategy. Conclusions: The current results suggest that detection of concept phrases could be improved by simultaneously tackling multiple concept types. This also suggests that we should annotate multiple concept types in developing a new corpus for machine learning models. Further investigation is expected to gain insights in the underlying mechanism to achieve good performance when multiple concept types are considered.
KW - Data mining
KW - Electronic health records
KW - Information storage and retrieval
KW - Natural language processing
UR - http://www.scopus.com/inward/record.url?scp=84920719766&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84920719766&partnerID=8YFLogxK
U2 - 10.1186/2041-1480-5-3
DO - 10.1186/2041-1480-5-3
M3 - Article
AN - SCOPUS:84920719766
VL - 5
JO - Journal of Biomedical Semantics
JF - Journal of Biomedical Semantics
SN - 2041-1480
IS - 1
M1 - 3
ER -