TY - JOUR
T1 - A deep representation empowered distant supervision paradigm for clinical information extraction
AU - Yanshan, Wang
AU - Sohn, Sunghwan
AU - Liu, Sijia
AU - Shen, Feichen
AU - Wang, Liwei
AU - Atkinson, Elizabeth J.
AU - Amin, Shreyasee
AU - Liu, Hongfang D
N1 - Publisher Copyright:
Copyright © 2018, The Authors. All rights reserved.
Copyright:
Copyright 2020 Elsevier B.V., All rights reserved.
PY - 2018/4/20
Y1 - 2018/4/20
N2 - Objective: To automatically create large labeled training datasets and reduce the efforts of feature engineering for training accurate machine learning models for clinical information extraction. Materials and Methods: We propose a distant supervision paradigm empowered by deep representation for extracting information from clinical text. In this paradigm, the rule-based NLP algorithms are utilized to generate weak labels and create large training datasets automatically. Additionally, we use pre-trained word embeddings as deep representation to eliminate the need of task-specific feature engineering for machine learning. We evaluated the effectiveness of the proposed paradigm on two clinical information extraction tasks: smoking status extraction and proximal femur (hip) fracture extraction. We tested three prevalent machine learning models, namely, Convolutional Neural Networks (CNN), Support Vector Machine (SVM), and Random Forrest (RF). Results: The results indicate that CNN is the best fit to the proposed distant supervision paradigm. It outperforms the rule-based NLP algorithms given large datasets by capturing additional extraction patterns. We also verified the advantage of word embedding feature representation in the paradigm over term frequency-inverse document frequency (tf-idf) and topic modeling representations. Discussion: In the clinical domain, the limited amount of labeled data is always a bottleneck for applying machine learning. Additionally, the performance of machine learning approaches highly depends on task-specific feature engineering. The proposed paradigm could alleviate those problems by leveraging rule-based NLP algorithms to automatically assign weak labels and eliminating the need of task-specific feature engineering using word embedding feature representation.
AB - Objective: To automatically create large labeled training datasets and reduce the efforts of feature engineering for training accurate machine learning models for clinical information extraction. Materials and Methods: We propose a distant supervision paradigm empowered by deep representation for extracting information from clinical text. In this paradigm, the rule-based NLP algorithms are utilized to generate weak labels and create large training datasets automatically. Additionally, we use pre-trained word embeddings as deep representation to eliminate the need of task-specific feature engineering for machine learning. We evaluated the effectiveness of the proposed paradigm on two clinical information extraction tasks: smoking status extraction and proximal femur (hip) fracture extraction. We tested three prevalent machine learning models, namely, Convolutional Neural Networks (CNN), Support Vector Machine (SVM), and Random Forrest (RF). Results: The results indicate that CNN is the best fit to the proposed distant supervision paradigm. It outperforms the rule-based NLP algorithms given large datasets by capturing additional extraction patterns. We also verified the advantage of word embedding feature representation in the paradigm over term frequency-inverse document frequency (tf-idf) and topic modeling representations. Discussion: In the clinical domain, the limited amount of labeled data is always a bottleneck for applying machine learning. Additionally, the performance of machine learning approaches highly depends on task-specific feature engineering. The proposed paradigm could alleviate those problems by leveraging rule-based NLP algorithms to automatically assign weak labels and eliminating the need of task-specific feature engineering using word embedding feature representation.
KW - Clinical information extraction
KW - Distant supervision
KW - Electronic health records
KW - Machine learning
KW - Natural language processing
UR - http://www.scopus.com/inward/record.url?scp=85094257777&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85094257777&partnerID=8YFLogxK
M3 - Article
AN - SCOPUS:85094257777
JO - American Journal of Physiology - Renal Fluid and Electrolyte Physiology
JF - American Journal of Physiology - Renal Fluid and Electrolyte Physiology
SN - 1931-857X
ER -