TY - JOUR
T1 - Are synthetic clinical notes useful for real natural language processing tasks
T2 - A case study on clinical entity recognition
AU - Li, Jianfu
AU - Zhou, Yujia
AU - Jiang, Xiaoqian
AU - Natarajan, Karthik
AU - Pakhomov, Serguei Vs
AU - Liu, Hongfang
AU - Xu, Hua
N1 - Publisher Copyright:
© 2021 The Author(s).
PY - 2021/10/1
Y1 - 2021/10/1
N2 - Objective:: Developing clinical natural language processing systems often requires access to many clinical documents, which are not widely available to the public due to privacy and security concerns. To address this challenge, we propose to develop methods to generate synthetic clinical notes and evaluate their utility in real clinical natural language processing tasks. Materials and Methods:: We implemented 4 state-of-the-art text generation models, namely CharRNN, SegGAN, GPT-2, and CTRL, to generate clinical text for the History and Present Illness section. We then manually annotated clinical entities for randomly selected 500 History and Present Illness notes generated from the best-performing algorithm. To compare the utility of natural and synthetic corpora, we trained named entity recognition (NER) models from all 3 corpora and evaluated their performance on 2 independent natural corpora. Results:: Our evaluation shows GPT-2 achieved the best BLEU (bilingual evaluation understudy) score (with a BLEU-2 of 0.92). NER models trained on synthetic corpus generated by GPT-2 showed slightly better performance on 2 independent corpora: strict F1 scores of 0.709 and 0.748, respectively, when compared with the NER models trained on natural corpus (F1 scores of 0.706 and 0.737, respectively), indicating the good utility of synthetic corpora in clinical NER model development. In addition, we also demonstrated that an augmented method that combines both natural and synthetic corpora achieved better performance than that uses the natural corpus only. Conclusions:: Recent advances in text generation have made it possible to generate synthetic clinical notes that could be useful for training NER models for information extraction from natural clinical notes, thus lowering the privacy concern and increasing data availability. Further investigation is needed to apply this technology to practice.
AB - Objective:: Developing clinical natural language processing systems often requires access to many clinical documents, which are not widely available to the public due to privacy and security concerns. To address this challenge, we propose to develop methods to generate synthetic clinical notes and evaluate their utility in real clinical natural language processing tasks. Materials and Methods:: We implemented 4 state-of-the-art text generation models, namely CharRNN, SegGAN, GPT-2, and CTRL, to generate clinical text for the History and Present Illness section. We then manually annotated clinical entities for randomly selected 500 History and Present Illness notes generated from the best-performing algorithm. To compare the utility of natural and synthetic corpora, we trained named entity recognition (NER) models from all 3 corpora and evaluated their performance on 2 independent natural corpora. Results:: Our evaluation shows GPT-2 achieved the best BLEU (bilingual evaluation understudy) score (with a BLEU-2 of 0.92). NER models trained on synthetic corpus generated by GPT-2 showed slightly better performance on 2 independent corpora: strict F1 scores of 0.709 and 0.748, respectively, when compared with the NER models trained on natural corpus (F1 scores of 0.706 and 0.737, respectively), indicating the good utility of synthetic corpora in clinical NER model development. In addition, we also demonstrated that an augmented method that combines both natural and synthetic corpora achieved better performance than that uses the natural corpus only. Conclusions:: Recent advances in text generation have made it possible to generate synthetic clinical notes that could be useful for training NER models for information extraction from natural clinical notes, thus lowering the privacy concern and increasing data availability. Further investigation is needed to apply this technology to practice.
KW - clinical notes
KW - named entity recognition
KW - natural language processing
KW - neural language model
KW - text generation
UR - http://www.scopus.com/inward/record.url?scp=85116958829&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85116958829&partnerID=8YFLogxK
U2 - 10.1093/jamia/ocab112
DO - 10.1093/jamia/ocab112
M3 - Article
C2 - 34272955
AN - SCOPUS:85116958829
SN - 1067-5027
VL - 28
SP - 2193
EP - 2201
JO - Journal of the American Medical Informatics Association
JF - Journal of the American Medical Informatics Association
IS - 10
ER -