Natural Language Processing Approaches to Detect the Timeline of Metastatic Recurrence of Breast Cancer

Imon Banerjee; Selen Bozkurt; Jennifer Lee Caswell-Jin; Allison W. Kurian; Daniel L. Rubin

doi:10.1200/CCI.19.00034

Natural Language Processing Approaches to Detect the Timeline of Metastatic Recurrence of Breast Cancer

Imon Banerjee, Selen Bozkurt, Jennifer Lee Caswell-Jin, Allison W. Kurian, Daniel L. Rubin

Diagnostic Radiology

Research output: Contribution to journal › Article › peer-review

Abstract

PURPOSE Electronic medical records (EMRs) and population-based cancer registries contain information on cancer outcomes and treatment, yet rarely capture information on the timing of metastatic cancer recurrence, which is essential to understand cancer survival outcomes. We developed a natural language processing (NLP) system to identify patient-specific timelines of metastatic breast cancer recurrence. PATIENTS AND METHODS We used the OncoSHARE database, which includes merged data from the California Cancer Registry and EMRs of 8,956 women diagnosed with breast cancer in 2000 to 2018. We curated a comprehensive vocabulary by interviewing expert clinicians and processing radiology and pathology reports and progress notes. We developed and evaluated the following two distinct NLP approaches to analyze free-text notes: a traditional rule-based model, using rules for metastatic detection from the literature and curated by domain experts; and a contemporary neural network model. For each 3-month period (quarter) from 2000 to 2018, we applied both models to infer recurrence status for that quarter. We trained the NLP models using 894 randomly selected patient records that were manually reviewed by clinical experts and evaluated model performance using 179 hold-out patients (20%) as a test set. RESULTS The median follow-up time was 19 quarters (5 years) for the training set and 15 quarters (4 years) for the test set. The neural network model predicted the timing of distant metastatic recurrence with a sensitivity of 0.83 and specificity of 0.73, outperforming the rule-based model, which had a specificity of 0.35 and sensitivity of 0.88 (P, .001). CONCLUSION We developed an NLP method that enables identification of the occurrence and timing of metastatic breast cancer recurrence from EMRs. This approach may be adaptable to other cancer sites and could help to unlock the potential of EMRs for research on real-world cancer outcomes.

Original language	English (US)
Pages (from-to)	1-12
Number of pages	12
Journal	JCO Clinical Cancer Informatics
Volume	3
DOIs	https://doi.org/10.1200/CCI.19.00034
State	Published - 2019

ASJC Scopus subject areas

Oncology
Health Informatics
Cancer Research

Access to Document

10.1200/CCI.19.00034

Cite this

@article{ee4fba2ab2c04d5b9c3b9eb7dfbf2220,

title = "Natural Language Processing Approaches to Detect the Timeline of Metastatic Recurrence of Breast Cancer",

abstract = "PURPOSE Electronic medical records (EMRs) and population-based cancer registries contain information on cancer outcomes and treatment, yet rarely capture information on the timing of metastatic cancer recurrence, which is essential to understand cancer survival outcomes. We developed a natural language processing (NLP) system to identify patient-specific timelines of metastatic breast cancer recurrence. PATIENTS AND METHODS We used the OncoSHARE database, which includes merged data from the California Cancer Registry and EMRs of 8,956 women diagnosed with breast cancer in 2000 to 2018. We curated a comprehensive vocabulary by interviewing expert clinicians and processing radiology and pathology reports and progress notes. We developed and evaluated the following two distinct NLP approaches to analyze free-text notes: a traditional rule-based model, using rules for metastatic detection from the literature and curated by domain experts; and a contemporary neural network model. For each 3-month period (quarter) from 2000 to 2018, we applied both models to infer recurrence status for that quarter. We trained the NLP models using 894 randomly selected patient records that were manually reviewed by clinical experts and evaluated model performance using 179 hold-out patients (20%) as a test set. RESULTS The median follow-up time was 19 quarters (5 years) for the training set and 15 quarters (4 years) for the test set. The neural network model predicted the timing of distant metastatic recurrence with a sensitivity of 0.83 and specificity of 0.73, outperforming the rule-based model, which had a specificity of 0.35 and sensitivity of 0.88 (P, .001). CONCLUSION We developed an NLP method that enables identification of the occurrence and timing of metastatic breast cancer recurrence from EMRs. This approach may be adaptable to other cancer sites and could help to unlock the potential of EMRs for research on real-world cancer outcomes.",

author = "Imon Banerjee and Selen Bozkurt and {Lee Caswell-Jin}, Jennifer and Kurian, {Allison W.} and Rubin, {Daniel L.}",

note = "Publisher Copyright: {\textcopyright} 2019 by American Society of Clinical Oncology",

year = "2019",

doi = "10.1200/CCI.19.00034",

language = "English (US)",

volume = "3",

pages = "1--12",

journal = "JCO Clinical Cancer Informatics",

issn = "2473-4276",

publisher = "American Society of Clinical Oncology",

}

TY - JOUR

T1 - Natural Language Processing Approaches to Detect the Timeline of Metastatic Recurrence of Breast Cancer

AU - Banerjee, Imon

AU - Bozkurt, Selen

AU - Lee Caswell-Jin, Jennifer

AU - Kurian, Allison W.

AU - Rubin, Daniel L.

PY - 2019

Y1 - 2019

N2 - PURPOSE Electronic medical records (EMRs) and population-based cancer registries contain information on cancer outcomes and treatment, yet rarely capture information on the timing of metastatic cancer recurrence, which is essential to understand cancer survival outcomes. We developed a natural language processing (NLP) system to identify patient-specific timelines of metastatic breast cancer recurrence. PATIENTS AND METHODS We used the OncoSHARE database, which includes merged data from the California Cancer Registry and EMRs of 8,956 women diagnosed with breast cancer in 2000 to 2018. We curated a comprehensive vocabulary by interviewing expert clinicians and processing radiology and pathology reports and progress notes. We developed and evaluated the following two distinct NLP approaches to analyze free-text notes: a traditional rule-based model, using rules for metastatic detection from the literature and curated by domain experts; and a contemporary neural network model. For each 3-month period (quarter) from 2000 to 2018, we applied both models to infer recurrence status for that quarter. We trained the NLP models using 894 randomly selected patient records that were manually reviewed by clinical experts and evaluated model performance using 179 hold-out patients (20%) as a test set. RESULTS The median follow-up time was 19 quarters (5 years) for the training set and 15 quarters (4 years) for the test set. The neural network model predicted the timing of distant metastatic recurrence with a sensitivity of 0.83 and specificity of 0.73, outperforming the rule-based model, which had a specificity of 0.35 and sensitivity of 0.88 (P, .001). CONCLUSION We developed an NLP method that enables identification of the occurrence and timing of metastatic breast cancer recurrence from EMRs. This approach may be adaptable to other cancer sites and could help to unlock the potential of EMRs for research on real-world cancer outcomes.

AB - PURPOSE Electronic medical records (EMRs) and population-based cancer registries contain information on cancer outcomes and treatment, yet rarely capture information on the timing of metastatic cancer recurrence, which is essential to understand cancer survival outcomes. We developed a natural language processing (NLP) system to identify patient-specific timelines of metastatic breast cancer recurrence. PATIENTS AND METHODS We used the OncoSHARE database, which includes merged data from the California Cancer Registry and EMRs of 8,956 women diagnosed with breast cancer in 2000 to 2018. We curated a comprehensive vocabulary by interviewing expert clinicians and processing radiology and pathology reports and progress notes. We developed and evaluated the following two distinct NLP approaches to analyze free-text notes: a traditional rule-based model, using rules for metastatic detection from the literature and curated by domain experts; and a contemporary neural network model. For each 3-month period (quarter) from 2000 to 2018, we applied both models to infer recurrence status for that quarter. We trained the NLP models using 894 randomly selected patient records that were manually reviewed by clinical experts and evaluated model performance using 179 hold-out patients (20%) as a test set. RESULTS The median follow-up time was 19 quarters (5 years) for the training set and 15 quarters (4 years) for the test set. The neural network model predicted the timing of distant metastatic recurrence with a sensitivity of 0.83 and specificity of 0.73, outperforming the rule-based model, which had a specificity of 0.35 and sensitivity of 0.88 (P, .001). CONCLUSION We developed an NLP method that enables identification of the occurrence and timing of metastatic breast cancer recurrence from EMRs. This approach may be adaptable to other cancer sites and could help to unlock the potential of EMRs for research on real-world cancer outcomes.

UR - http://www.scopus.com/inward/record.url?scp=85078789466&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85078789466&partnerID=8YFLogxK

U2 - 10.1200/CCI.19.00034

DO - 10.1200/CCI.19.00034

M3 - Article

C2 - 31584836

AN - SCOPUS:85078789466

SN - 2473-4276

VL - 3

SP - 1

EP - 12

JO - JCO Clinical Cancer Informatics

JF - JCO Clinical Cancer Informatics

ER -

Natural Language Processing Approaches to Detect the Timeline of Metastatic Recurrence of Breast Cancer

Abstract

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this