TY - GEN
T1 - Sublanguage Characteristics of Clinical Documents
AU - Moon, Sungrim
AU - He, Huan
AU - Liu, Hongfang
N1 - Funding Information:
ACKNOWLEDGMENT The research was supported by the National Institute of Health R01LM011934.
Publisher Copyright:
© 2022 IEEE.
PY - 2022
Y1 - 2022
N2 - Understanding the common or different characteristics of sublanguages in clinical documents through corpus analysis is essential for downstream applications of clinical natural language processing (NLP). Here, we conducted a sublanguage analysis of a corpus consisting of 500,000 clinical documents concerning clinical sections. We analyzed sublanguage characteristics per practice setting or document type for the top ten most frequent clinical sections. The named entity (NE) for the problem, test, and treatment concepts was extracted using fine-tuned bio-clinical Bidirectional Encoder Representations from Transformers (BERT). Fast-clustering using sentence-BERT was applied, and clustering results, a case study of terms containing 'pain,' were visualized using SandDance. Our results confirmed that document types with a narrow scope (i.e., limited evaluation) presented high term frequencies in diverse disjoint clusters than document types with a broad scope (i.e., Discharge Summary). Family Medicine and Primary Care practice settings presented similar cluster distributions (i.e., the frequent use of similar co-occurring words with 'pain'), implying the similar sublanguage. In contrast, Emergency Medicine showed a distinct sublanguage with high term frequencies in disjoint clusters than other practices. Those findings suggest that analyzing term distribution with respect to different combinations of the section, practicing setting, and document type provide important information when developing or implementing NLP systems.
AB - Understanding the common or different characteristics of sublanguages in clinical documents through corpus analysis is essential for downstream applications of clinical natural language processing (NLP). Here, we conducted a sublanguage analysis of a corpus consisting of 500,000 clinical documents concerning clinical sections. We analyzed sublanguage characteristics per practice setting or document type for the top ten most frequent clinical sections. The named entity (NE) for the problem, test, and treatment concepts was extracted using fine-tuned bio-clinical Bidirectional Encoder Representations from Transformers (BERT). Fast-clustering using sentence-BERT was applied, and clustering results, a case study of terms containing 'pain,' were visualized using SandDance. Our results confirmed that document types with a narrow scope (i.e., limited evaluation) presented high term frequencies in diverse disjoint clusters than document types with a broad scope (i.e., Discharge Summary). Family Medicine and Primary Care practice settings presented similar cluster distributions (i.e., the frequent use of similar co-occurring words with 'pain'), implying the similar sublanguage. In contrast, Emergency Medicine showed a distinct sublanguage with high term frequencies in disjoint clusters than other practices. Those findings suggest that analyzing term distribution with respect to different combinations of the section, practicing setting, and document type provide important information when developing or implementing NLP systems.
KW - clinical documents
KW - clinical section
KW - clustering
KW - document type
KW - named entity recognition
KW - natural language processing
KW - practice setting
KW - sublanguage analysis
UR - http://www.scopus.com/inward/record.url?scp=85146711579&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85146711579&partnerID=8YFLogxK
U2 - 10.1109/BIBM55620.2022.9995620
DO - 10.1109/BIBM55620.2022.9995620
M3 - Conference contribution
AN - SCOPUS:85146711579
T3 - Proceedings - 2022 IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2022
SP - 3280
EP - 3286
BT - Proceedings - 2022 IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2022
A2 - Adjeroh, Donald
A2 - Long, Qi
A2 - Shi, Xinghua
A2 - Guo, Fei
A2 - Hu, Xiaohua
A2 - Aluru, Srinivas
A2 - Narasimhan, Giri
A2 - Wang, Jianxin
A2 - Kang, Mingon
A2 - Mondal, Ananda M.
A2 - Liu, Jin
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2022 IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2022
Y2 - 6 December 2022 through 8 December 2022
ER -