Are synthetic clinical notes useful for real natural language processing tasks: A case study on clinical entity recognition

Jianfu Li; Yujia Zhou; Xiaoqian Jiang; Karthik Natarajan; Serguei Vs Pakhomov; Hongfang Liu; Hua Xu

doi:10.1093/jamia/ocab112

Are synthetic clinical notes useful for real natural language processing tasks: A case study on clinical entity recognition

Jianfu Li, Yujia Zhou, Xiaoqian Jiang, Karthik Natarajan, Serguei Vs Pakhomov, Hongfang Liu, Hua Xu

Digital Health Sciences

Research output: Contribution to journal › Article › peer-review

Abstract

Objective:: Developing clinical natural language processing systems often requires access to many clinical documents, which are not widely available to the public due to privacy and security concerns. To address this challenge, we propose to develop methods to generate synthetic clinical notes and evaluate their utility in real clinical natural language processing tasks. Materials and Methods:: We implemented 4 state-of-the-art text generation models, namely CharRNN, SegGAN, GPT-2, and CTRL, to generate clinical text for the History and Present Illness section. We then manually annotated clinical entities for randomly selected 500 History and Present Illness notes generated from the best-performing algorithm. To compare the utility of natural and synthetic corpora, we trained named entity recognition (NER) models from all 3 corpora and evaluated their performance on 2 independent natural corpora. Results:: Our evaluation shows GPT-2 achieved the best BLEU (bilingual evaluation understudy) score (with a BLEU-2 of 0.92). NER models trained on synthetic corpus generated by GPT-2 showed slightly better performance on 2 independent corpora: strict F1 scores of 0.709 and 0.748, respectively, when compared with the NER models trained on natural corpus (F1 scores of 0.706 and 0.737, respectively), indicating the good utility of synthetic corpora in clinical NER model development. In addition, we also demonstrated that an augmented method that combines both natural and synthetic corpora achieved better performance than that uses the natural corpus only. Conclusions:: Recent advances in text generation have made it possible to generate synthetic clinical notes that could be useful for training NER models for information extraction from natural clinical notes, thus lowering the privacy concern and increasing data availability. Further investigation is needed to apply this technology to practice.

Original language	English (US)
Pages (from-to)	2193-2201
Number of pages	9
Journal	Journal of the American Medical Informatics Association
Volume	28
Issue number	10
DOIs	https://doi.org/10.1093/jamia/ocab112
State	Published - Oct 1 2021

Keywords

clinical notes
named entity recognition
natural language processing
neural language model
text generation

ASJC Scopus subject areas

Health Informatics

Access to Document

10.1093/jamia/ocab112

Cite this

@article{685b539dbea6461da48cd91e456dd752,

title = "Are synthetic clinical notes useful for real natural language processing tasks: A case study on clinical entity recognition",

abstract = "Objective:: Developing clinical natural language processing systems often requires access to many clinical documents, which are not widely available to the public due to privacy and security concerns. To address this challenge, we propose to develop methods to generate synthetic clinical notes and evaluate their utility in real clinical natural language processing tasks. Materials and Methods:: We implemented 4 state-of-the-art text generation models, namely CharRNN, SegGAN, GPT-2, and CTRL, to generate clinical text for the History and Present Illness section. We then manually annotated clinical entities for randomly selected 500 History and Present Illness notes generated from the best-performing algorithm. To compare the utility of natural and synthetic corpora, we trained named entity recognition (NER) models from all 3 corpora and evaluated their performance on 2 independent natural corpora. Results:: Our evaluation shows GPT-2 achieved the best BLEU (bilingual evaluation understudy) score (with a BLEU-2 of 0.92). NER models trained on synthetic corpus generated by GPT-2 showed slightly better performance on 2 independent corpora: strict F1 scores of 0.709 and 0.748, respectively, when compared with the NER models trained on natural corpus (F1 scores of 0.706 and 0.737, respectively), indicating the good utility of synthetic corpora in clinical NER model development. In addition, we also demonstrated that an augmented method that combines both natural and synthetic corpora achieved better performance than that uses the natural corpus only. Conclusions:: Recent advances in text generation have made it possible to generate synthetic clinical notes that could be useful for training NER models for information extraction from natural clinical notes, thus lowering the privacy concern and increasing data availability. Further investigation is needed to apply this technology to practice.",

keywords = "clinical notes, named entity recognition, natural language processing, neural language model, text generation",

author = "Jianfu Li and Yujia Zhou and Xiaoqian Jiang and Karthik Natarajan and Pakhomov, {Serguei Vs} and Hongfang Liu and Hua Xu",

note = "Publisher Copyright: {\textcopyright} 2021 The Author(s).",

year = "2021",

month = oct,

day = "1",

doi = "10.1093/jamia/ocab112",

language = "English (US)",

volume = "28",

pages = "2193--2201",

journal = "Journal of the American Medical Informatics Association",

issn = "1067-5027",

publisher = "Oxford University Press",

number = "10",

}

TY - JOUR

T1 - Are synthetic clinical notes useful for real natural language processing tasks

T2 - A case study on clinical entity recognition

AU - Li, Jianfu

AU - Zhou, Yujia

AU - Jiang, Xiaoqian

AU - Natarajan, Karthik

AU - Pakhomov, Serguei Vs

AU - Liu, Hongfang

AU - Xu, Hua

PY - 2021/10/1

Y1 - 2021/10/1

N2 - Objective:: Developing clinical natural language processing systems often requires access to many clinical documents, which are not widely available to the public due to privacy and security concerns. To address this challenge, we propose to develop methods to generate synthetic clinical notes and evaluate their utility in real clinical natural language processing tasks. Materials and Methods:: We implemented 4 state-of-the-art text generation models, namely CharRNN, SegGAN, GPT-2, and CTRL, to generate clinical text for the History and Present Illness section. We then manually annotated clinical entities for randomly selected 500 History and Present Illness notes generated from the best-performing algorithm. To compare the utility of natural and synthetic corpora, we trained named entity recognition (NER) models from all 3 corpora and evaluated their performance on 2 independent natural corpora. Results:: Our evaluation shows GPT-2 achieved the best BLEU (bilingual evaluation understudy) score (with a BLEU-2 of 0.92). NER models trained on synthetic corpus generated by GPT-2 showed slightly better performance on 2 independent corpora: strict F1 scores of 0.709 and 0.748, respectively, when compared with the NER models trained on natural corpus (F1 scores of 0.706 and 0.737, respectively), indicating the good utility of synthetic corpora in clinical NER model development. In addition, we also demonstrated that an augmented method that combines both natural and synthetic corpora achieved better performance than that uses the natural corpus only. Conclusions:: Recent advances in text generation have made it possible to generate synthetic clinical notes that could be useful for training NER models for information extraction from natural clinical notes, thus lowering the privacy concern and increasing data availability. Further investigation is needed to apply this technology to practice.

AB - Objective:: Developing clinical natural language processing systems often requires access to many clinical documents, which are not widely available to the public due to privacy and security concerns. To address this challenge, we propose to develop methods to generate synthetic clinical notes and evaluate their utility in real clinical natural language processing tasks. Materials and Methods:: We implemented 4 state-of-the-art text generation models, namely CharRNN, SegGAN, GPT-2, and CTRL, to generate clinical text for the History and Present Illness section. We then manually annotated clinical entities for randomly selected 500 History and Present Illness notes generated from the best-performing algorithm. To compare the utility of natural and synthetic corpora, we trained named entity recognition (NER) models from all 3 corpora and evaluated their performance on 2 independent natural corpora. Results:: Our evaluation shows GPT-2 achieved the best BLEU (bilingual evaluation understudy) score (with a BLEU-2 of 0.92). NER models trained on synthetic corpus generated by GPT-2 showed slightly better performance on 2 independent corpora: strict F1 scores of 0.709 and 0.748, respectively, when compared with the NER models trained on natural corpus (F1 scores of 0.706 and 0.737, respectively), indicating the good utility of synthetic corpora in clinical NER model development. In addition, we also demonstrated that an augmented method that combines both natural and synthetic corpora achieved better performance than that uses the natural corpus only. Conclusions:: Recent advances in text generation have made it possible to generate synthetic clinical notes that could be useful for training NER models for information extraction from natural clinical notes, thus lowering the privacy concern and increasing data availability. Further investigation is needed to apply this technology to practice.

KW - clinical notes

KW - named entity recognition

KW - natural language processing

KW - neural language model

KW - text generation

UR - http://www.scopus.com/inward/record.url?scp=85116958829&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85116958829&partnerID=8YFLogxK

U2 - 10.1093/jamia/ocab112

DO - 10.1093/jamia/ocab112

M3 - Article

C2 - 34272955

AN - SCOPUS:85116958829

SN - 1067-5027

VL - 28

SP - 2193

EP - 2201

JO - Journal of the American Medical Informatics Association

JF - Journal of the American Medical Informatics Association

IS - 10

ER -

Are synthetic clinical notes useful for real natural language processing tasks: A case study on clinical entity recognition

Abstract

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this