Comparison of Natural Language Processing Rules-based and Machine-learning Systems to Identify Lumbar Spine Imaging Findings Related to Low Back Pain

W. Katherine Tan; Saeed Hassanpour; Patrick J. Heagerty; Sean D. Rundell; Pradeep Suri; Hannu T. Huhdanpaa; Kathryn James; David S. Carrell; Curtis P. Langlotz; Nancy L. Organ; Eric N. Meier; Karen J. Sherman; David F. Kallmes; Patrick H. Luetmer; Brent Griffith; David R. Nerenz; Jeffrey G. Jarvik

doi:10.1016/j.acra.2018.03.008

Comparison of Natural Language Processing Rules-based and Machine-learning Systems to Identify Lumbar Spine Imaging Findings Related to Low Back Pain

W. Katherine Tan, Saeed Hassanpour, Patrick J. Heagerty, Sean D. Rundell, Pradeep Suri, Hannu T. Huhdanpaa, Kathryn James, David S. Carrell, Curtis P. Langlotz, Nancy L. Organ, Eric N. Meier, Karen J. Sherman, David F. Kallmes, Patrick H. Luetmer, Brent Griffith, David R. Nerenz, Jeffrey G. Jarvik

Radiology

Research output: Contribution to journal › Article › peer-review

24 Scopus citations

Abstract

Rationale and Objectives: To evaluate a natural language processing (NLP) system built with open-source tools for identification of lumbar spine imaging findings related to low back pain on magnetic resonance and x-ray radiology reports from four health systems. Materials and Methods: We used a limited data set (de-identified except for dates) sampled from lumbar spine imaging reports of a prospectively assembled cohort of adults. From N = 178,333 reports, we randomly selected N = 871 to form a reference-standard dataset, consisting of N = 413 x-ray reports and N = 458 MR reports. Using standardized criteria, four spine experts annotated the presence of 26 findings, where 71 reports were annotated by all four experts and 800 were each annotated by two experts. We calculated inter-rater agreement and finding prevalence from annotated data. We randomly split the annotated data into development (80%) and testing (20%) sets. We developed an NLP system from both rule-based and machine-learned models. We validated the system using accuracy metrics such as sensitivity, specificity, and area under the receiver operating characteristic curve (AUC). Results: The multirater annotated dataset achieved inter-rater agreement of Cohen's kappa > 0.60 (substantial agreement) for 25 of 26 findings, with finding prevalence ranging from 3% to 89%. In the testing sample, rule-based and machine-learned predictions both had comparable average specificity (0.97 and 0.95, respectively). The machine-learned approach had a higher average sensitivity (0.94, compared to 0.83 for rules-based), and a higher overall AUC (0.98, compared to 0.90 for rules-based). Conclusions: Our NLP system performed well in identifying the 26 lumbar spine findings, as benchmarked by reference-standard annotation by medical experts. Machine-learned models provided substantial gains in model sensitivity with slight loss of specificity, and overall higher AUC.

Original language	English (US)
Pages (from-to)	1422-1432
Number of pages	11
Journal	Academic radiology
Volume	25
Issue number	11
DOIs	https://doi.org/10.1016/j.acra.2018.03.008
State	Published - Nov 2018

Keywords

Natural language processing
low back pain
lumbar spine diagnostic imaging

ASJC Scopus subject areas

Radiology Nuclear Medicine and imaging

Access to Document

10.1016/j.acra.2018.03.008

Cite this

Tan, W. K., Hassanpour, S., Heagerty, P. J., Rundell, S. D., Suri, P., Huhdanpaa, H. T., James, K., Carrell, D. S., Langlotz, C. P., Organ, N. L., Meier, E. N., Sherman, K. J., Kallmes, D. F., Luetmer, P. H., Griffith, B., Nerenz, D. R., & Jarvik, J. G. (2018). Comparison of Natural Language Processing Rules-based and Machine-learning Systems to Identify Lumbar Spine Imaging Findings Related to Low Back Pain. Academic radiology, 25(11), 1422-1432. https://doi.org/10.1016/j.acra.2018.03.008

Tan, WK, Hassanpour, S, Heagerty, PJ, Rundell, SD, Suri, P, Huhdanpaa, HT, James, K, Carrell, DS, Langlotz, CP, Organ, NL, Meier, EN, Sherman, KJ, Kallmes, DF , Luetmer, PH, Griffith, B, Nerenz, DR & Jarvik, JG 2018, 'Comparison of Natural Language Processing Rules-based and Machine-learning Systems to Identify Lumbar Spine Imaging Findings Related to Low Back Pain', Academic radiology, vol. 25, no. 11, pp. 1422-1432. https://doi.org/10.1016/j.acra.2018.03.008

@article{462cf881202b4caebc67141ba2a1f33f,

title = "Comparison of Natural Language Processing Rules-based and Machine-learning Systems to Identify Lumbar Spine Imaging Findings Related to Low Back Pain",

abstract = "Rationale and Objectives: To evaluate a natural language processing (NLP) system built with open-source tools for identification of lumbar spine imaging findings related to low back pain on magnetic resonance and x-ray radiology reports from four health systems. Materials and Methods: We used a limited data set (de-identified except for dates) sampled from lumbar spine imaging reports of a prospectively assembled cohort of adults. From N = 178,333 reports, we randomly selected N = 871 to form a reference-standard dataset, consisting of N = 413 x-ray reports and N = 458 MR reports. Using standardized criteria, four spine experts annotated the presence of 26 findings, where 71 reports were annotated by all four experts and 800 were each annotated by two experts. We calculated inter-rater agreement and finding prevalence from annotated data. We randomly split the annotated data into development (80%) and testing (20%) sets. We developed an NLP system from both rule-based and machine-learned models. We validated the system using accuracy metrics such as sensitivity, specificity, and area under the receiver operating characteristic curve (AUC). Results: The multirater annotated dataset achieved inter-rater agreement of Cohen's kappa > 0.60 (substantial agreement) for 25 of 26 findings, with finding prevalence ranging from 3% to 89%. In the testing sample, rule-based and machine-learned predictions both had comparable average specificity (0.97 and 0.95, respectively). The machine-learned approach had a higher average sensitivity (0.94, compared to 0.83 for rules-based), and a higher overall AUC (0.98, compared to 0.90 for rules-based). Conclusions: Our NLP system performed well in identifying the 26 lumbar spine findings, as benchmarked by reference-standard annotation by medical experts. Machine-learned models provided substantial gains in model sensitivity with slight loss of specificity, and overall higher AUC.",

keywords = "Natural language processing, low back pain, lumbar spine diagnostic imaging",

author = "Tan, {W. Katherine} and Saeed Hassanpour and Heagerty, {Patrick J.} and Rundell, {Sean D.} and Pradeep Suri and Huhdanpaa, {Hannu T.} and Kathryn James and Carrell, {David S.} and Langlotz, {Curtis P.} and Organ, {Nancy L.} and Meier, {Eric N.} and Sherman, {Karen J.} and Kallmes, {David F.} and Luetmer, {Patrick H.} and Brent Griffith and Nerenz, {David R.} and Jarvik, {Jeffrey G.}",

note = "Publisher Copyright: {\textcopyright} 2018 The Association of University Radiologists",

year = "2018",

month = nov,

doi = "10.1016/j.acra.2018.03.008",

language = "English (US)",

volume = "25",

pages = "1422--1432",

journal = "Academic radiology",

issn = "1076-6332",

publisher = "Elsevier USA",

number = "11",

}

TY - JOUR

T1 - Comparison of Natural Language Processing Rules-based and Machine-learning Systems to Identify Lumbar Spine Imaging Findings Related to Low Back Pain

AU - Tan, W. Katherine

AU - Hassanpour, Saeed

AU - Heagerty, Patrick J.

AU - Rundell, Sean D.

AU - Suri, Pradeep

AU - Huhdanpaa, Hannu T.

AU - James, Kathryn

AU - Carrell, David S.

AU - Langlotz, Curtis P.

AU - Organ, Nancy L.

AU - Meier, Eric N.

AU - Sherman, Karen J.

AU - Kallmes, David F.

AU - Luetmer, Patrick H.

AU - Griffith, Brent

AU - Nerenz, David R.

AU - Jarvik, Jeffrey G.

PY - 2018/11

Y1 - 2018/11

N2 - Rationale and Objectives: To evaluate a natural language processing (NLP) system built with open-source tools for identification of lumbar spine imaging findings related to low back pain on magnetic resonance and x-ray radiology reports from four health systems. Materials and Methods: We used a limited data set (de-identified except for dates) sampled from lumbar spine imaging reports of a prospectively assembled cohort of adults. From N = 178,333 reports, we randomly selected N = 871 to form a reference-standard dataset, consisting of N = 413 x-ray reports and N = 458 MR reports. Using standardized criteria, four spine experts annotated the presence of 26 findings, where 71 reports were annotated by all four experts and 800 were each annotated by two experts. We calculated inter-rater agreement and finding prevalence from annotated data. We randomly split the annotated data into development (80%) and testing (20%) sets. We developed an NLP system from both rule-based and machine-learned models. We validated the system using accuracy metrics such as sensitivity, specificity, and area under the receiver operating characteristic curve (AUC). Results: The multirater annotated dataset achieved inter-rater agreement of Cohen's kappa > 0.60 (substantial agreement) for 25 of 26 findings, with finding prevalence ranging from 3% to 89%. In the testing sample, rule-based and machine-learned predictions both had comparable average specificity (0.97 and 0.95, respectively). The machine-learned approach had a higher average sensitivity (0.94, compared to 0.83 for rules-based), and a higher overall AUC (0.98, compared to 0.90 for rules-based). Conclusions: Our NLP system performed well in identifying the 26 lumbar spine findings, as benchmarked by reference-standard annotation by medical experts. Machine-learned models provided substantial gains in model sensitivity with slight loss of specificity, and overall higher AUC.

AB - Rationale and Objectives: To evaluate a natural language processing (NLP) system built with open-source tools for identification of lumbar spine imaging findings related to low back pain on magnetic resonance and x-ray radiology reports from four health systems. Materials and Methods: We used a limited data set (de-identified except for dates) sampled from lumbar spine imaging reports of a prospectively assembled cohort of adults. From N = 178,333 reports, we randomly selected N = 871 to form a reference-standard dataset, consisting of N = 413 x-ray reports and N = 458 MR reports. Using standardized criteria, four spine experts annotated the presence of 26 findings, where 71 reports were annotated by all four experts and 800 were each annotated by two experts. We calculated inter-rater agreement and finding prevalence from annotated data. We randomly split the annotated data into development (80%) and testing (20%) sets. We developed an NLP system from both rule-based and machine-learned models. We validated the system using accuracy metrics such as sensitivity, specificity, and area under the receiver operating characteristic curve (AUC). Results: The multirater annotated dataset achieved inter-rater agreement of Cohen's kappa > 0.60 (substantial agreement) for 25 of 26 findings, with finding prevalence ranging from 3% to 89%. In the testing sample, rule-based and machine-learned predictions both had comparable average specificity (0.97 and 0.95, respectively). The machine-learned approach had a higher average sensitivity (0.94, compared to 0.83 for rules-based), and a higher overall AUC (0.98, compared to 0.90 for rules-based). Conclusions: Our NLP system performed well in identifying the 26 lumbar spine findings, as benchmarked by reference-standard annotation by medical experts. Machine-learned models provided substantial gains in model sensitivity with slight loss of specificity, and overall higher AUC.

KW - Natural language processing

KW - low back pain

KW - lumbar spine diagnostic imaging

UR - http://www.scopus.com/inward/record.url?scp=85044660518&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85044660518&partnerID=8YFLogxK

U2 - 10.1016/j.acra.2018.03.008

DO - 10.1016/j.acra.2018.03.008

M3 - Article

C2 - 29605561

AN - SCOPUS:85044660518

SN - 1076-6332

VL - 25

SP - 1422

EP - 1432

JO - Academic radiology

JF - Academic radiology

IS - 11

ER -

Comparison of Natural Language Processing Rules-based and Machine-learning Systems to Identify Lumbar Spine Imaging Findings Related to Low Back Pain

Abstract

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this