Natural language processing for the identification of silent brain infarcts from neuroimaging reports

Sunyang Fu, Lester Y. Leung, Yanshan Wang, Anne Olivia Raulli, David F Kallmes, Kristin A. Kinsman, Kristoff B. Nelson, Michael S. Clark, Patrick H Luetmer, Paul R. Kingsbury, David M. Kent, Hongfang D Liu

Research output: Contribution to journalArticle

1 Citation (Scopus)

Abstract

Background: Silent brain infarction (SBI) is defined as the presence of 1 or more brain lesions, presumed to be because of vascular occlusion, found by neuroimaging (magnetic resonance imaging or computed tomography) in patients without clinical manifestations of stroke. It is more common than stroke and can be detected in 20% of healthy elderly people. Early detection of SBI may mitigate the risk of stroke by offering preventative treatment plans. Natural language processing (NLP) techniques offer an opportunity to systematically identify SBI cases from electronic health records (EHRs) by extracting, normalizing, and classifying SBI-related incidental findings interpreted by radiologists from neuroimaging reports. Objective: This study aimed to develop NLP systems to determine individuals with incidentally discovered SBIs from neuroimaging reports at 2 sites: Mayo Clinic and Tufts Medical Center. Methods: Both rule-based and machine learning approaches were adopted in developing the NLP system. The rule-based system was implemented using the open source NLP pipeline MedTagger, developed by Mayo Clinic. Features for rule-based systems, including significant words and patterns related to SBI, were generated using pointwise mutual information. The machine learning models adopted convolutional neural network (CNN), random forest, support vector machine, and logistic regression. The performance of the NLP algorithm was compared with a manually created gold standard. The gold standard dataset includes 1000 radiology reports randomly retrieved from the 2 study sites (Mayo and Tufts) corresponding to patients with no prior or current diagnosis of stroke or dementia. 400 out of the 1000 reports were randomly sampled and double read to determine interannotator agreements. The gold standard dataset was equally split to 3 subsets for training, developing, and testing. Results: Among the 400 reports selected to determine interannotator agreement, 5 reports were removed due to invalid scan types. The interannotator agreements across Mayo and Tufts neuroimaging reports were 0.87 and 0.91, respectively. The rule-based system yielded the best performance of predicting SBI with an accuracy, sensitivity, specificity, positive predictive value (PPV), and negative predictive value (NPV) of 0.991, 0.925, 1.000, 1.000, and 0.990, respectively. The CNN achieved the best score on predicting white matter disease (WMD) with an accuracy, sensitivity, specificity, PPV, and NPV of 0.994, 0.994, 0.994, 0.994, and 0.994, respectively. Conclusions: We adopted a standardized data abstraction and modeling process to developed NLP techniques (rule-based and machine learning) to detect incidental SBIs and WMDs from annotated neuroimaging reports. Validation statistics suggested a high feasibility of detecting SBIs and WMDs from EHRs using NLP.

Original languageEnglish (US)
Article numbere12109
JournalJournal of medical Internet research
Volume21
Issue number5
DOIs
StatePublished - May 1 2019

Fingerprint

Natural Language Processing
Brain Infarction
Neuroimaging
Brain
Stroke
Electronic Health Records
Leukoencephalopathies
Sensitivity and Specificity
Incidental Findings
Radiology
Blood Vessels
Dementia
Logistic Models
Tomography
Magnetic Resonance Imaging

Keywords

  • Electronic health records
  • Natural language processing
  • Neuroimaging

ASJC Scopus subject areas

  • Health Informatics

Cite this

Natural language processing for the identification of silent brain infarcts from neuroimaging reports. / Fu, Sunyang; Leung, Lester Y.; Wang, Yanshan; Raulli, Anne Olivia; Kallmes, David F; Kinsman, Kristin A.; Nelson, Kristoff B.; Clark, Michael S.; Luetmer, Patrick H; Kingsbury, Paul R.; Kent, David M.; Liu, Hongfang D.

In: Journal of medical Internet research, Vol. 21, No. 5, e12109, 01.05.2019.

Research output: Contribution to journalArticle

Fu, S, Leung, LY, Wang, Y, Raulli, AO, Kallmes, DF, Kinsman, KA, Nelson, KB, Clark, MS, Luetmer, PH, Kingsbury, PR, Kent, DM & Liu, HD 2019, 'Natural language processing for the identification of silent brain infarcts from neuroimaging reports', Journal of medical Internet research, vol. 21, no. 5, e12109. https://doi.org/10.2196/12109
Fu, Sunyang ; Leung, Lester Y. ; Wang, Yanshan ; Raulli, Anne Olivia ; Kallmes, David F ; Kinsman, Kristin A. ; Nelson, Kristoff B. ; Clark, Michael S. ; Luetmer, Patrick H ; Kingsbury, Paul R. ; Kent, David M. ; Liu, Hongfang D. / Natural language processing for the identification of silent brain infarcts from neuroimaging reports. In: Journal of medical Internet research. 2019 ; Vol. 21, No. 5.
@article{a8713b4c676a44229333d1f48ab1d9df,
title = "Natural language processing for the identification of silent brain infarcts from neuroimaging reports",
abstract = "Background: Silent brain infarction (SBI) is defined as the presence of 1 or more brain lesions, presumed to be because of vascular occlusion, found by neuroimaging (magnetic resonance imaging or computed tomography) in patients without clinical manifestations of stroke. It is more common than stroke and can be detected in 20{\%} of healthy elderly people. Early detection of SBI may mitigate the risk of stroke by offering preventative treatment plans. Natural language processing (NLP) techniques offer an opportunity to systematically identify SBI cases from electronic health records (EHRs) by extracting, normalizing, and classifying SBI-related incidental findings interpreted by radiologists from neuroimaging reports. Objective: This study aimed to develop NLP systems to determine individuals with incidentally discovered SBIs from neuroimaging reports at 2 sites: Mayo Clinic and Tufts Medical Center. Methods: Both rule-based and machine learning approaches were adopted in developing the NLP system. The rule-based system was implemented using the open source NLP pipeline MedTagger, developed by Mayo Clinic. Features for rule-based systems, including significant words and patterns related to SBI, were generated using pointwise mutual information. The machine learning models adopted convolutional neural network (CNN), random forest, support vector machine, and logistic regression. The performance of the NLP algorithm was compared with a manually created gold standard. The gold standard dataset includes 1000 radiology reports randomly retrieved from the 2 study sites (Mayo and Tufts) corresponding to patients with no prior or current diagnosis of stroke or dementia. 400 out of the 1000 reports were randomly sampled and double read to determine interannotator agreements. The gold standard dataset was equally split to 3 subsets for training, developing, and testing. Results: Among the 400 reports selected to determine interannotator agreement, 5 reports were removed due to invalid scan types. The interannotator agreements across Mayo and Tufts neuroimaging reports were 0.87 and 0.91, respectively. The rule-based system yielded the best performance of predicting SBI with an accuracy, sensitivity, specificity, positive predictive value (PPV), and negative predictive value (NPV) of 0.991, 0.925, 1.000, 1.000, and 0.990, respectively. The CNN achieved the best score on predicting white matter disease (WMD) with an accuracy, sensitivity, specificity, PPV, and NPV of 0.994, 0.994, 0.994, 0.994, and 0.994, respectively. Conclusions: We adopted a standardized data abstraction and modeling process to developed NLP techniques (rule-based and machine learning) to detect incidental SBIs and WMDs from annotated neuroimaging reports. Validation statistics suggested a high feasibility of detecting SBIs and WMDs from EHRs using NLP.",
keywords = "Electronic health records, Natural language processing, Neuroimaging",
author = "Sunyang Fu and Leung, {Lester Y.} and Yanshan Wang and Raulli, {Anne Olivia} and Kallmes, {David F} and Kinsman, {Kristin A.} and Nelson, {Kristoff B.} and Clark, {Michael S.} and Luetmer, {Patrick H} and Kingsbury, {Paul R.} and Kent, {David M.} and Liu, {Hongfang D}",
year = "2019",
month = "5",
day = "1",
doi = "10.2196/12109",
language = "English (US)",
volume = "21",
journal = "Journal of Medical Internet Research",
issn = "1439-4456",
publisher = "Journal of medical Internet Research",
number = "5",

}

TY - JOUR

T1 - Natural language processing for the identification of silent brain infarcts from neuroimaging reports

AU - Fu, Sunyang

AU - Leung, Lester Y.

AU - Wang, Yanshan

AU - Raulli, Anne Olivia

AU - Kallmes, David F

AU - Kinsman, Kristin A.

AU - Nelson, Kristoff B.

AU - Clark, Michael S.

AU - Luetmer, Patrick H

AU - Kingsbury, Paul R.

AU - Kent, David M.

AU - Liu, Hongfang D

PY - 2019/5/1

Y1 - 2019/5/1

N2 - Background: Silent brain infarction (SBI) is defined as the presence of 1 or more brain lesions, presumed to be because of vascular occlusion, found by neuroimaging (magnetic resonance imaging or computed tomography) in patients without clinical manifestations of stroke. It is more common than stroke and can be detected in 20% of healthy elderly people. Early detection of SBI may mitigate the risk of stroke by offering preventative treatment plans. Natural language processing (NLP) techniques offer an opportunity to systematically identify SBI cases from electronic health records (EHRs) by extracting, normalizing, and classifying SBI-related incidental findings interpreted by radiologists from neuroimaging reports. Objective: This study aimed to develop NLP systems to determine individuals with incidentally discovered SBIs from neuroimaging reports at 2 sites: Mayo Clinic and Tufts Medical Center. Methods: Both rule-based and machine learning approaches were adopted in developing the NLP system. The rule-based system was implemented using the open source NLP pipeline MedTagger, developed by Mayo Clinic. Features for rule-based systems, including significant words and patterns related to SBI, were generated using pointwise mutual information. The machine learning models adopted convolutional neural network (CNN), random forest, support vector machine, and logistic regression. The performance of the NLP algorithm was compared with a manually created gold standard. The gold standard dataset includes 1000 radiology reports randomly retrieved from the 2 study sites (Mayo and Tufts) corresponding to patients with no prior or current diagnosis of stroke or dementia. 400 out of the 1000 reports were randomly sampled and double read to determine interannotator agreements. The gold standard dataset was equally split to 3 subsets for training, developing, and testing. Results: Among the 400 reports selected to determine interannotator agreement, 5 reports were removed due to invalid scan types. The interannotator agreements across Mayo and Tufts neuroimaging reports were 0.87 and 0.91, respectively. The rule-based system yielded the best performance of predicting SBI with an accuracy, sensitivity, specificity, positive predictive value (PPV), and negative predictive value (NPV) of 0.991, 0.925, 1.000, 1.000, and 0.990, respectively. The CNN achieved the best score on predicting white matter disease (WMD) with an accuracy, sensitivity, specificity, PPV, and NPV of 0.994, 0.994, 0.994, 0.994, and 0.994, respectively. Conclusions: We adopted a standardized data abstraction and modeling process to developed NLP techniques (rule-based and machine learning) to detect incidental SBIs and WMDs from annotated neuroimaging reports. Validation statistics suggested a high feasibility of detecting SBIs and WMDs from EHRs using NLP.

AB - Background: Silent brain infarction (SBI) is defined as the presence of 1 or more brain lesions, presumed to be because of vascular occlusion, found by neuroimaging (magnetic resonance imaging or computed tomography) in patients without clinical manifestations of stroke. It is more common than stroke and can be detected in 20% of healthy elderly people. Early detection of SBI may mitigate the risk of stroke by offering preventative treatment plans. Natural language processing (NLP) techniques offer an opportunity to systematically identify SBI cases from electronic health records (EHRs) by extracting, normalizing, and classifying SBI-related incidental findings interpreted by radiologists from neuroimaging reports. Objective: This study aimed to develop NLP systems to determine individuals with incidentally discovered SBIs from neuroimaging reports at 2 sites: Mayo Clinic and Tufts Medical Center. Methods: Both rule-based and machine learning approaches were adopted in developing the NLP system. The rule-based system was implemented using the open source NLP pipeline MedTagger, developed by Mayo Clinic. Features for rule-based systems, including significant words and patterns related to SBI, were generated using pointwise mutual information. The machine learning models adopted convolutional neural network (CNN), random forest, support vector machine, and logistic regression. The performance of the NLP algorithm was compared with a manually created gold standard. The gold standard dataset includes 1000 radiology reports randomly retrieved from the 2 study sites (Mayo and Tufts) corresponding to patients with no prior or current diagnosis of stroke or dementia. 400 out of the 1000 reports were randomly sampled and double read to determine interannotator agreements. The gold standard dataset was equally split to 3 subsets for training, developing, and testing. Results: Among the 400 reports selected to determine interannotator agreement, 5 reports were removed due to invalid scan types. The interannotator agreements across Mayo and Tufts neuroimaging reports were 0.87 and 0.91, respectively. The rule-based system yielded the best performance of predicting SBI with an accuracy, sensitivity, specificity, positive predictive value (PPV), and negative predictive value (NPV) of 0.991, 0.925, 1.000, 1.000, and 0.990, respectively. The CNN achieved the best score on predicting white matter disease (WMD) with an accuracy, sensitivity, specificity, PPV, and NPV of 0.994, 0.994, 0.994, 0.994, and 0.994, respectively. Conclusions: We adopted a standardized data abstraction and modeling process to developed NLP techniques (rule-based and machine learning) to detect incidental SBIs and WMDs from annotated neuroimaging reports. Validation statistics suggested a high feasibility of detecting SBIs and WMDs from EHRs using NLP.

KW - Electronic health records

KW - Natural language processing

KW - Neuroimaging

UR - http://www.scopus.com/inward/record.url?scp=85067395195&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85067395195&partnerID=8YFLogxK

U2 - 10.2196/12109

DO - 10.2196/12109

M3 - Article

AN - SCOPUS:85067395195

VL - 21

JO - Journal of Medical Internet Research

JF - Journal of Medical Internet Research

SN - 1439-4456

IS - 5

M1 - e12109

ER -