Using synthetic clinical data to train an HMM-based POS tagger

Benjamin C. Knoll, Genevieve B. Melton, Hongfang D Liu, Hua Xu, Serguei V S Pakhomov

Research output: Chapter in Book/Report/Conference proceedingConference contribution

1 Scopus citations

Abstract

The accuracy of part of speech (POS) tagging reported in medical natural language processing (NLP) literature is typically very high when training and testing data sets are from the same domain and have similar characteristics, but is lower when these differ. This presents a problem for clinical NLP, where it is difficult to obtain large corpora of training data suitable for localized tasks. We experimented with implementing the TnT POS tagger and training it on a manually tagged small corpus of publicly available synthetic clinical reports supplemented with widely used public corpora (GENIA and Penn Treebank). We describe this implementation and report the evaluation results on MiPACQ, a large corpus of manually tagged clinical text. Our tagger achieves accuracy comparable to POS taggers trained on large amounts of real clinical data (91-93%). This demonstrates that medical NLP developers do not need to rely on large restricted resources for POS tagging.

Original languageEnglish (US)
Title of host publication3rd IEEE EMBS International Conference on Biomedical and Health Informatics, BHI 2016
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages252-255
Number of pages4
ISBN (Print)9781509024551
DOIs
StatePublished - Apr 18 2016
Event3rd IEEE EMBS International Conference on Biomedical and Health Informatics, BHI 2016 - Las Vegas, United States
Duration: Feb 24 2016Feb 27 2016

Other

Other3rd IEEE EMBS International Conference on Biomedical and Health Informatics, BHI 2016
CountryUnited States
CityLas Vegas
Period2/24/162/27/16

ASJC Scopus subject areas

  • Health Informatics
  • Health Information Management

Fingerprint Dive into the research topics of 'Using synthetic clinical data to train an HMM-based POS tagger'. Together they form a unique fingerprint.

  • Cite this

    Knoll, B. C., Melton, G. B., Liu, H. D., Xu, H., & Pakhomov, S. V. S. (2016). Using synthetic clinical data to train an HMM-based POS tagger. In 3rd IEEE EMBS International Conference on Biomedical and Health Informatics, BHI 2016 (pp. 252-255). [7455882] Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/BHI.2016.7455882