Classifying the lifestyle status for Alzheimer’s disease from clinical notes using deep learning with weak supervision

Zitao Shen; Dalton Schutte; Yoonkwon Yi; Anusha Bompelli; Fang Yu; Yanshan Wang; Rui Zhang

doi:10.1186/s12911-022-01819-4

Classifying the lifestyle status for Alzheimer’s disease from clinical notes using deep learning with weak supervision

Zitao Shen, Dalton Schutte, Yoonkwon Yi, Anusha Bompelli, Fang Yu, Yanshan Wang, Rui Zhang

Digital Health Sciences

Research output: Contribution to journal › Article › peer-review

Abstract

Background: Since no effective therapies exist for Alzheimer’s disease (AD), prevention has become more critical through lifestyle status changes and interventions. Analyzing electronic health records (EHRs) of patients with AD can help us better understand lifestyle’s effect on AD. However, lifestyle information is typically stored in clinical narratives. Thus, the objective of the study was to compare different natural language processing (NLP) models on classifying the lifestyle statuses (e.g., physical activity and excessive diet) from clinical texts in English. Methods: Based on the collected concept unique identifiers (CUIs) associated with the lifestyle status, we extracted all related EHRs for patients with AD from the Clinical Data Repository (CDR) of the University of Minnesota (UMN). We automatically generated labels for the training data by using a rule-based NLP algorithm. We conducted weak supervision for pre-trained Bidirectional Encoder Representations from Transformers (BERT) models and three traditional machine learning models as baseline models on the weakly labeled training corpus. These models include the BERT base model, PubMedBERT (abstracts + full text), PubMedBERT (only abstracts), Unified Medical Language System (UMLS) BERT, Bio BERT, Bio-clinical BERT, logistic regression, support vector machine, and random forest. The rule-based model used for weak supervision was tested on the GSC for comparison. We performed two case studies: physical activity and excessive diet, in order to validate the effectiveness of BERT models in classifying lifestyle status for all models were evaluated and compared on the developed Gold Standard Corpus (GSC) on the two case studies. Results: The UMLS BERT model achieved the best performance for classifying status of physical activity, with its precision, recall, and F-1 scores of 0.93, 0.93, and 0.92, respectively. Regarding classifying excessive diet, the Bio-clinical BERT model showed the best performance with precision, recall, and F-1 scores of 0.93, 0.93, and 0.93, respectively. Conclusion: The proposed approach leveraging weak supervision could significantly increase the sample size, which is required for training the deep learning models. By comparing with the traditional machine learning models, the study also demonstrates the high performance of BERT models for classifying lifestyle status for Alzheimer’s disease in clinical notes.

Original language	English (US)
Article number	88
Journal	BMC Medical Informatics and Decision Making
Volume	22
DOIs	https://doi.org/10.1186/s12911-022-01819-4
State	Published - Jul 2022

Keywords

Alzheimer’s disease
Clinical text classification
Deep learning
Electronic health records
Machine learning
Natural language processing

ASJC Scopus subject areas

Health Policy
Health Informatics
Computer Science Applications

Access to Document

10.1186/s12911-022-01819-4

Cite this

@article{e1213c3a298b41d091c0355fd44263d6,

title = "Classifying the lifestyle status for Alzheimer{\textquoteright}s disease from clinical notes using deep learning with weak supervision",

abstract = "Background: Since no effective therapies exist for Alzheimer{\textquoteright}s disease (AD), prevention has become more critical through lifestyle status changes and interventions. Analyzing electronic health records (EHRs) of patients with AD can help us better understand lifestyle{\textquoteright}s effect on AD. However, lifestyle information is typically stored in clinical narratives. Thus, the objective of the study was to compare different natural language processing (NLP) models on classifying the lifestyle statuses (e.g., physical activity and excessive diet) from clinical texts in English. Methods: Based on the collected concept unique identifiers (CUIs) associated with the lifestyle status, we extracted all related EHRs for patients with AD from the Clinical Data Repository (CDR) of the University of Minnesota (UMN). We automatically generated labels for the training data by using a rule-based NLP algorithm. We conducted weak supervision for pre-trained Bidirectional Encoder Representations from Transformers (BERT) models and three traditional machine learning models as baseline models on the weakly labeled training corpus. These models include the BERT base model, PubMedBERT (abstracts + full text), PubMedBERT (only abstracts), Unified Medical Language System (UMLS) BERT, Bio BERT, Bio-clinical BERT, logistic regression, support vector machine, and random forest. The rule-based model used for weak supervision was tested on the GSC for comparison. We performed two case studies: physical activity and excessive diet, in order to validate the effectiveness of BERT models in classifying lifestyle status for all models were evaluated and compared on the developed Gold Standard Corpus (GSC) on the two case studies. Results: The UMLS BERT model achieved the best performance for classifying status of physical activity, with its precision, recall, and F-1 scores of 0.93, 0.93, and 0.92, respectively. Regarding classifying excessive diet, the Bio-clinical BERT model showed the best performance with precision, recall, and F-1 scores of 0.93, 0.93, and 0.93, respectively. Conclusion: The proposed approach leveraging weak supervision could significantly increase the sample size, which is required for training the deep learning models. By comparing with the traditional machine learning models, the study also demonstrates the high performance of BERT models for classifying lifestyle status for Alzheimer{\textquoteright}s disease in clinical notes.",

keywords = "Alzheimer{\textquoteright}s disease, Clinical text classification, Deep learning, Electronic health records, Machine learning, Natural language processing",

author = "Zitao Shen and Dalton Schutte and Yoonkwon Yi and Anusha Bompelli and Fang Yu and Yanshan Wang and Rui Zhang",

note = "Publisher Copyright: {\textcopyright} 2022, The Author(s).",

year = "2022",

month = jul,

doi = "10.1186/s12911-022-01819-4",

language = "English (US)",

volume = "22",

journal = "BMC Medical Informatics and Decision Making",

issn = "1472-6947",

publisher = "BioMed Central",

}

TY - JOUR

T1 - Classifying the lifestyle status for Alzheimer’s disease from clinical notes using deep learning with weak supervision

AU - Shen, Zitao

AU - Schutte, Dalton

AU - Yi, Yoonkwon

AU - Bompelli, Anusha

AU - Yu, Fang

AU - Wang, Yanshan

AU - Zhang, Rui

PY - 2022/7

Y1 - 2022/7

N2 - Background: Since no effective therapies exist for Alzheimer’s disease (AD), prevention has become more critical through lifestyle status changes and interventions. Analyzing electronic health records (EHRs) of patients with AD can help us better understand lifestyle’s effect on AD. However, lifestyle information is typically stored in clinical narratives. Thus, the objective of the study was to compare different natural language processing (NLP) models on classifying the lifestyle statuses (e.g., physical activity and excessive diet) from clinical texts in English. Methods: Based on the collected concept unique identifiers (CUIs) associated with the lifestyle status, we extracted all related EHRs for patients with AD from the Clinical Data Repository (CDR) of the University of Minnesota (UMN). We automatically generated labels for the training data by using a rule-based NLP algorithm. We conducted weak supervision for pre-trained Bidirectional Encoder Representations from Transformers (BERT) models and three traditional machine learning models as baseline models on the weakly labeled training corpus. These models include the BERT base model, PubMedBERT (abstracts + full text), PubMedBERT (only abstracts), Unified Medical Language System (UMLS) BERT, Bio BERT, Bio-clinical BERT, logistic regression, support vector machine, and random forest. The rule-based model used for weak supervision was tested on the GSC for comparison. We performed two case studies: physical activity and excessive diet, in order to validate the effectiveness of BERT models in classifying lifestyle status for all models were evaluated and compared on the developed Gold Standard Corpus (GSC) on the two case studies. Results: The UMLS BERT model achieved the best performance for classifying status of physical activity, with its precision, recall, and F-1 scores of 0.93, 0.93, and 0.92, respectively. Regarding classifying excessive diet, the Bio-clinical BERT model showed the best performance with precision, recall, and F-1 scores of 0.93, 0.93, and 0.93, respectively. Conclusion: The proposed approach leveraging weak supervision could significantly increase the sample size, which is required for training the deep learning models. By comparing with the traditional machine learning models, the study also demonstrates the high performance of BERT models for classifying lifestyle status for Alzheimer’s disease in clinical notes.

AB - Background: Since no effective therapies exist for Alzheimer’s disease (AD), prevention has become more critical through lifestyle status changes and interventions. Analyzing electronic health records (EHRs) of patients with AD can help us better understand lifestyle’s effect on AD. However, lifestyle information is typically stored in clinical narratives. Thus, the objective of the study was to compare different natural language processing (NLP) models on classifying the lifestyle statuses (e.g., physical activity and excessive diet) from clinical texts in English. Methods: Based on the collected concept unique identifiers (CUIs) associated with the lifestyle status, we extracted all related EHRs for patients with AD from the Clinical Data Repository (CDR) of the University of Minnesota (UMN). We automatically generated labels for the training data by using a rule-based NLP algorithm. We conducted weak supervision for pre-trained Bidirectional Encoder Representations from Transformers (BERT) models and three traditional machine learning models as baseline models on the weakly labeled training corpus. These models include the BERT base model, PubMedBERT (abstracts + full text), PubMedBERT (only abstracts), Unified Medical Language System (UMLS) BERT, Bio BERT, Bio-clinical BERT, logistic regression, support vector machine, and random forest. The rule-based model used for weak supervision was tested on the GSC for comparison. We performed two case studies: physical activity and excessive diet, in order to validate the effectiveness of BERT models in classifying lifestyle status for all models were evaluated and compared on the developed Gold Standard Corpus (GSC) on the two case studies. Results: The UMLS BERT model achieved the best performance for classifying status of physical activity, with its precision, recall, and F-1 scores of 0.93, 0.93, and 0.92, respectively. Regarding classifying excessive diet, the Bio-clinical BERT model showed the best performance with precision, recall, and F-1 scores of 0.93, 0.93, and 0.93, respectively. Conclusion: The proposed approach leveraging weak supervision could significantly increase the sample size, which is required for training the deep learning models. By comparing with the traditional machine learning models, the study also demonstrates the high performance of BERT models for classifying lifestyle status for Alzheimer’s disease in clinical notes.

KW - Alzheimer’s disease

KW - Clinical text classification

KW - Deep learning

KW - Electronic health records

KW - Machine learning

KW - Natural language processing

UR - http://www.scopus.com/inward/record.url?scp=85133566428&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85133566428&partnerID=8YFLogxK

U2 - 10.1186/s12911-022-01819-4

DO - 10.1186/s12911-022-01819-4

M3 - Article

C2 - 35799294

AN - SCOPUS:85133566428

SN - 1472-6947

VL - 22

JO - BMC Medical Informatics and Decision Making

JF - BMC Medical Informatics and Decision Making

M1 - 88

ER -

Classifying the lifestyle status for Alzheimer’s disease from clinical notes using deep learning with weak supervision

Abstract

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this