A deep representation empowered distant supervision paradigm for clinical information extraction

Wang Yanshan, Sunghwan Sohn, Sijia Liu, Feichen Shen, Liwei Wang, Elizabeth J. Atkinson, Shreyasee Amin, Hongfang D Liu

Research output: Contribution to journalArticlepeer-review


Objective: To automatically create large labeled training datasets and reduce the efforts of feature engineering for training accurate machine learning models for clinical information extraction. Materials and Methods: We propose a distant supervision paradigm empowered by deep representation for extracting information from clinical text. In this paradigm, the rule-based NLP algorithms are utilized to generate weak labels and create large training datasets automatically. Additionally, we use pre-trained word embeddings as deep representation to eliminate the need of task-specific feature engineering for machine learning. We evaluated the effectiveness of the proposed paradigm on two clinical information extraction tasks: smoking status extraction and proximal femur (hip) fracture extraction. We tested three prevalent machine learning models, namely, Convolutional Neural Networks (CNN), Support Vector Machine (SVM), and Random Forrest (RF). Results: The results indicate that CNN is the best fit to the proposed distant supervision paradigm. It outperforms the rule-based NLP algorithms given large datasets by capturing additional extraction patterns. We also verified the advantage of word embedding feature representation in the paradigm over term frequency-inverse document frequency (tf-idf) and topic modeling representations. Discussion: In the clinical domain, the limited amount of labeled data is always a bottleneck for applying machine learning. Additionally, the performance of machine learning approaches highly depends on task-specific feature engineering. The proposed paradigm could alleviate those problems by leveraging rule-based NLP algorithms to automatically assign weak labels and eliminating the need of task-specific feature engineering using word embedding feature representation.

Original languageEnglish (US)
JournalUnknown Journal
StatePublished - Apr 20 2018


  • Clinical information extraction
  • Distant supervision
  • Electronic health records
  • Machine learning
  • Natural language processing

ASJC Scopus subject areas

  • General

Fingerprint Dive into the research topics of 'A deep representation empowered distant supervision paradigm for clinical information extraction'. Together they form a unique fingerprint.

Cite this