Using synthetic clinical data to train an HMM-based POS tagger

Benjamin C. Knoll; Genevieve B. Melton; Hongfang Liu; Hua Xu; Serguei V.S. Pakhomov

doi:10.1109/BHI.2016.7455882

Using synthetic clinical data to train an HMM-based POS tagger

Benjamin C. Knoll, Genevieve B. Melton, Hongfang Liu, Hua Xu, Serguei V.S. Pakhomov

Digital Health Sciences

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

2 Scopus citations

Abstract

The accuracy of part of speech (POS) tagging reported in medical natural language processing (NLP) literature is typically very high when training and testing data sets are from the same domain and have similar characteristics, but is lower when these differ. This presents a problem for clinical NLP, where it is difficult to obtain large corpora of training data suitable for localized tasks. We experimented with implementing the TnT POS tagger and training it on a manually tagged small corpus of publicly available synthetic clinical reports supplemented with widely used public corpora (GENIA and Penn Treebank). We describe this implementation and report the evaluation results on MiPACQ, a large corpus of manually tagged clinical text. Our tagger achieves accuracy comparable to POS taggers trained on large amounts of real clinical data (91-93%). This demonstrates that medical NLP developers do not need to rely on large restricted resources for POS tagging.

Original language	English (US)
Title of host publication	3rd IEEE EMBS International Conference on Biomedical and Health Informatics, BHI 2016
Publisher	Institute of Electrical and Electronics Engineers Inc.
Pages	252-255
Number of pages	4
ISBN (Electronic)	9781509024551
DOIs	https://doi.org/10.1109/BHI.2016.7455882
State	Published - Apr 18 2016
Event	3rd IEEE EMBS International Conference on Biomedical and Health Informatics, BHI 2016 - Las Vegas, United States Duration: Feb 24 2016 → Feb 27 2016

Publication series

Name	3rd IEEE EMBS International Conference on Biomedical and Health Informatics, BHI 2016

Other

Other	3rd IEEE EMBS International Conference on Biomedical and Health Informatics, BHI 2016
Country/Territory	United States
City	Las Vegas
Period	2/24/16 → 2/27/16

ASJC Scopus subject areas

Health Informatics
Health Information Management

Access to Document

10.1109/BHI.2016.7455882

Cite this

Knoll, B. C., Melton, G. B., Liu, H., Xu, H., & Pakhomov, S. V. S. (2016). Using synthetic clinical data to train an HMM-based POS tagger. In 3rd IEEE EMBS International Conference on Biomedical and Health Informatics, BHI 2016 (pp. 252-255). Article 7455882 (3rd IEEE EMBS International Conference on Biomedical and Health Informatics, BHI 2016). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/BHI.2016.7455882

Using synthetic clinical data to train an HMM-based POS tagger. / Knoll, Benjamin C.; Melton, Genevieve B.; Liu, Hongfang et al.
3rd IEEE EMBS International Conference on Biomedical and Health Informatics, BHI 2016. Institute of Electrical and Electronics Engineers Inc., 2016. p. 252-255 7455882 (3rd IEEE EMBS International Conference on Biomedical and Health Informatics, BHI 2016).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

Knoll, BC, Melton, GB, Liu, H, Xu, H & Pakhomov, SVS 2016, Using synthetic clinical data to train an HMM-based POS tagger. in 3rd IEEE EMBS International Conference on Biomedical and Health Informatics, BHI 2016., 7455882, 3rd IEEE EMBS International Conference on Biomedical and Health Informatics, BHI 2016, Institute of Electrical and Electronics Engineers Inc., pp. 252-255, 3rd IEEE EMBS International Conference on Biomedical and Health Informatics, BHI 2016, Las Vegas, United States, 2/24/16. https://doi.org/10.1109/BHI.2016.7455882

Knoll BC, Melton GB, Liu H, Xu H, Pakhomov SVS. Using synthetic clinical data to train an HMM-based POS tagger. In 3rd IEEE EMBS International Conference on Biomedical and Health Informatics, BHI 2016. Institute of Electrical and Electronics Engineers Inc. 2016. p. 252-255. 7455882. (3rd IEEE EMBS International Conference on Biomedical and Health Informatics, BHI 2016). doi: 10.1109/BHI.2016.7455882

Knoll, Benjamin C. ; Melton, Genevieve B. ; Liu, Hongfang et al. / Using synthetic clinical data to train an HMM-based POS tagger. 3rd IEEE EMBS International Conference on Biomedical and Health Informatics, BHI 2016. Institute of Electrical and Electronics Engineers Inc., 2016. pp. 252-255 (3rd IEEE EMBS International Conference on Biomedical and Health Informatics, BHI 2016).

@inproceedings{d416f7b83f004f1bb23fc51ad7231f1a,

title = "Using synthetic clinical data to train an HMM-based POS tagger",

abstract = "The accuracy of part of speech (POS) tagging reported in medical natural language processing (NLP) literature is typically very high when training and testing data sets are from the same domain and have similar characteristics, but is lower when these differ. This presents a problem for clinical NLP, where it is difficult to obtain large corpora of training data suitable for localized tasks. We experimented with implementing the TnT POS tagger and training it on a manually tagged small corpus of publicly available synthetic clinical reports supplemented with widely used public corpora (GENIA and Penn Treebank). We describe this implementation and report the evaluation results on MiPACQ, a large corpus of manually tagged clinical text. Our tagger achieves accuracy comparable to POS taggers trained on large amounts of real clinical data (91-93%). This demonstrates that medical NLP developers do not need to rely on large restricted resources for POS tagging.",

author = "Knoll, {Benjamin C.} and Melton, {Genevieve B.} and Hongfang Liu and Hua Xu and Pakhomov, {Serguei V.S.}",

note = "Publisher Copyright: {\textcopyright} 2016 IEEE.; 3rd IEEE EMBS International Conference on Biomedical and Health Informatics, BHI 2016 ; Conference date: 24-02-2016 Through 27-02-2016",

year = "2016",

month = apr,

day = "18",

doi = "10.1109/BHI.2016.7455882",

language = "English (US)",

series = "3rd IEEE EMBS International Conference on Biomedical and Health Informatics, BHI 2016",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

pages = "252--255",

booktitle = "3rd IEEE EMBS International Conference on Biomedical and Health Informatics, BHI 2016",

}

TY - GEN

T1 - Using synthetic clinical data to train an HMM-based POS tagger

AU - Knoll, Benjamin C.

AU - Melton, Genevieve B.

AU - Liu, Hongfang

AU - Xu, Hua

AU - Pakhomov, Serguei V.S.

PY - 2016/4/18

Y1 - 2016/4/18

N2 - The accuracy of part of speech (POS) tagging reported in medical natural language processing (NLP) literature is typically very high when training and testing data sets are from the same domain and have similar characteristics, but is lower when these differ. This presents a problem for clinical NLP, where it is difficult to obtain large corpora of training data suitable for localized tasks. We experimented with implementing the TnT POS tagger and training it on a manually tagged small corpus of publicly available synthetic clinical reports supplemented with widely used public corpora (GENIA and Penn Treebank). We describe this implementation and report the evaluation results on MiPACQ, a large corpus of manually tagged clinical text. Our tagger achieves accuracy comparable to POS taggers trained on large amounts of real clinical data (91-93%). This demonstrates that medical NLP developers do not need to rely on large restricted resources for POS tagging.

AB - The accuracy of part of speech (POS) tagging reported in medical natural language processing (NLP) literature is typically very high when training and testing data sets are from the same domain and have similar characteristics, but is lower when these differ. This presents a problem for clinical NLP, where it is difficult to obtain large corpora of training data suitable for localized tasks. We experimented with implementing the TnT POS tagger and training it on a manually tagged small corpus of publicly available synthetic clinical reports supplemented with widely used public corpora (GENIA and Penn Treebank). We describe this implementation and report the evaluation results on MiPACQ, a large corpus of manually tagged clinical text. Our tagger achieves accuracy comparable to POS taggers trained on large amounts of real clinical data (91-93%). This demonstrates that medical NLP developers do not need to rely on large restricted resources for POS tagging.

UR - http://www.scopus.com/inward/record.url?scp=84968548082&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84968548082&partnerID=8YFLogxK

U2 - 10.1109/BHI.2016.7455882

DO - 10.1109/BHI.2016.7455882

M3 - Conference contribution

AN - SCOPUS:84968548082

T3 - 3rd IEEE EMBS International Conference on Biomedical and Health Informatics, BHI 2016

SP - 252

EP - 255

BT - 3rd IEEE EMBS International Conference on Biomedical and Health Informatics, BHI 2016

PB - Institute of Electrical and Electronics Engineers Inc.

T2 - 3rd IEEE EMBS International Conference on Biomedical and Health Informatics, BHI 2016

Y2 - 24 February 2016 through 27 February 2016

ER -

Using synthetic clinical data to train an HMM-based POS tagger

Abstract

Publication series

Other

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this