TY - GEN
T1 - Using synthetic clinical data to train an HMM-based POS tagger
AU - Knoll, Benjamin C.
AU - Melton, Genevieve B.
AU - Liu, Hongfang
AU - Xu, Hua
AU - Pakhomov, Serguei V.S.
N1 - Publisher Copyright:
© 2016 IEEE.
PY - 2016/4/18
Y1 - 2016/4/18
N2 - The accuracy of part of speech (POS) tagging reported in medical natural language processing (NLP) literature is typically very high when training and testing data sets are from the same domain and have similar characteristics, but is lower when these differ. This presents a problem for clinical NLP, where it is difficult to obtain large corpora of training data suitable for localized tasks. We experimented with implementing the TnT POS tagger and training it on a manually tagged small corpus of publicly available synthetic clinical reports supplemented with widely used public corpora (GENIA and Penn Treebank). We describe this implementation and report the evaluation results on MiPACQ, a large corpus of manually tagged clinical text. Our tagger achieves accuracy comparable to POS taggers trained on large amounts of real clinical data (91-93%). This demonstrates that medical NLP developers do not need to rely on large restricted resources for POS tagging.
AB - The accuracy of part of speech (POS) tagging reported in medical natural language processing (NLP) literature is typically very high when training and testing data sets are from the same domain and have similar characteristics, but is lower when these differ. This presents a problem for clinical NLP, where it is difficult to obtain large corpora of training data suitable for localized tasks. We experimented with implementing the TnT POS tagger and training it on a manually tagged small corpus of publicly available synthetic clinical reports supplemented with widely used public corpora (GENIA and Penn Treebank). We describe this implementation and report the evaluation results on MiPACQ, a large corpus of manually tagged clinical text. Our tagger achieves accuracy comparable to POS taggers trained on large amounts of real clinical data (91-93%). This demonstrates that medical NLP developers do not need to rely on large restricted resources for POS tagging.
UR - http://www.scopus.com/inward/record.url?scp=84968548082&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84968548082&partnerID=8YFLogxK
U2 - 10.1109/BHI.2016.7455882
DO - 10.1109/BHI.2016.7455882
M3 - Conference contribution
AN - SCOPUS:84968548082
T3 - 3rd IEEE EMBS International Conference on Biomedical and Health Informatics, BHI 2016
SP - 252
EP - 255
BT - 3rd IEEE EMBS International Conference on Biomedical and Health Informatics, BHI 2016
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 3rd IEEE EMBS International Conference on Biomedical and Health Informatics, BHI 2016
Y2 - 24 February 2016 through 27 February 2016
ER -