TY - JOUR
T1 - Pooling annotated corpora for clinical concept extraction
AU - Wagholikar, Kavishwar B.
AU - Torii, Manabu
AU - Jonnalagadda, Siddhartha R.
AU - Liu, Hongfang
N1 - Funding Information:
We are thankful to the people who developed and made available the machine learning tools: GENIA tagger and Mallet, terminology services: UMLS and the labeled corpora used in this study. The i2b2 corpus of deidentified clinical reports used in this research were provided by the i2b2 National Center for Biomedical Computing funded by U54LM008748 and were originally prepared for the Shared Tasks for Challenges in NLP for Clinical Data organized by Dr. Ozlem Uzuner, i2b2 and SUNY. We are also thankful for the pioneering effort of the Mayo NLP team, for developing the resources which made the current study possible. The authors are thankful for the constructive comments of the reviewers. This study was supported by National Science Foundation ABI:0845523, Strategic Health IT Advanced Research Projects (SHARP) award (90TR0002) to Mayo Clinic from Health and Human Services (HHS) and National Library of Medicine (NLM:1K99LM011389, 5R01LM009959-02) grants.
Publisher Copyright:
© 2013 Wagholikar et al.
PY - 2013/1/8
Y1 - 2013/1/8
N2 - Background: The availability of annotated corpora has facilitated the application of machine learning algorithms to concept extraction from clinical notes. However, high expenditure and labor are required for creating the annotations. A potential alternative is to reuse existing corpora from other institutions by pooling with local corpora, for training machine taggers. In this paper we have investigated the latter approach by pooling corpora from 2010 i2b2/VA NLP challenge and Mayo Clinic Rochester, to evaluate taggers for recognition of medical problems. The corpora were annotated for medical problems, but with different guidelines. The taggers were constructed using an existing tagging system MedTagger that consisted of dictionary lookup, part of speech (POS) tagging and machine learning for named entity prediction and concept extraction. We hope that our current work will be a useful case study for facilitating reuse of annotated corpora across institutions. Results: We found that pooling was effective when the size of the local corpus was small and after some of the guideline differences were reconciled. The benefits of pooling, however, diminished as more locally annotated documents were included in the training data. We examined the annotation guidelines to identify factors that determine the effect of pooling. Conclusions: The effectiveness of pooling corpora, is dependent on several factors, which include compatibility of annotation guidelines, distribution of report types and size of local and foreign corpora. Simple methods to rectify some of the guideline differences can facilitate pooling. Our findings need to be confirmed with further studies on different corpora. To facilitate the pooling and reuse of annotated corpora, we suggest that - i) the NLP community should develop a standard annotation guideline that addresses the potential areas of guideline differences that are partly identified in this paper; ii) corpora should be annotated with a two-pass method that focuses first on concept recognition, followed by normalization to existing ontologies; and iii) metadata such as type of the report should be created during the annotation process.
AB - Background: The availability of annotated corpora has facilitated the application of machine learning algorithms to concept extraction from clinical notes. However, high expenditure and labor are required for creating the annotations. A potential alternative is to reuse existing corpora from other institutions by pooling with local corpora, for training machine taggers. In this paper we have investigated the latter approach by pooling corpora from 2010 i2b2/VA NLP challenge and Mayo Clinic Rochester, to evaluate taggers for recognition of medical problems. The corpora were annotated for medical problems, but with different guidelines. The taggers were constructed using an existing tagging system MedTagger that consisted of dictionary lookup, part of speech (POS) tagging and machine learning for named entity prediction and concept extraction. We hope that our current work will be a useful case study for facilitating reuse of annotated corpora across institutions. Results: We found that pooling was effective when the size of the local corpus was small and after some of the guideline differences were reconciled. The benefits of pooling, however, diminished as more locally annotated documents were included in the training data. We examined the annotation guidelines to identify factors that determine the effect of pooling. Conclusions: The effectiveness of pooling corpora, is dependent on several factors, which include compatibility of annotation guidelines, distribution of report types and size of local and foreign corpora. Simple methods to rectify some of the guideline differences can facilitate pooling. Our findings need to be confirmed with further studies on different corpora. To facilitate the pooling and reuse of annotated corpora, we suggest that - i) the NLP community should develop a standard annotation guideline that addresses the potential areas of guideline differences that are partly identified in this paper; ii) corpora should be annotated with a two-pass method that focuses first on concept recognition, followed by normalization to existing ontologies; and iii) metadata such as type of the report should be created during the annotation process.
UR - http://www.scopus.com/inward/record.url?scp=84922014807&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84922014807&partnerID=8YFLogxK
U2 - 10.1186/2041-1480-4-3
DO - 10.1186/2041-1480-4-3
M3 - Article
AN - SCOPUS:84922014807
SN - 2041-1480
VL - 4
JO - Journal of Biomedical Semantics
JF - Journal of Biomedical Semantics
IS - 1
M1 - 3
ER -