CancerBERT: A cancer domain-specific language model for extracting breast cancer phenotypes from electronic health records

Sicheng Zhou; Nan Wang; Liwei Wang; Hongfang Liu; Rui Zhang

doi:10.1093/jamia/ocac040

CancerBERT: A cancer domain-specific language model for extracting breast cancer phenotypes from electronic health records

Sicheng Zhou, Nan Wang, Liwei Wang, Hongfang Liu, Rui Zhang

Digital Health Sciences

Research output: Contribution to journal › Article › peer-review

Abstract

Objective: Accurate extraction of breast cancer patients' phenotypes is important for clinical decision support and clinical research. This study developed and evaluated cancer domain pretrained CancerBERT models for extracting breast cancer phenotypes from clinical texts. We also investigated the effect of customized cancer-related vocabulary on the performance of CancerBERT models. Materials and Methods: A cancer-related corpus of breast cancer patients was extracted from the electronic health records of a local hospital. We annotated named entities in 200 pathology reports and 50 clinical notes for 8 cancer phenotypes for fine-Tuning and evaluation. We kept pretraining the BlueBERT model on the cancer corpus with expanded vocabularies (using both term frequency-based and manually reviewed methods) to obtain CancerBERT models. The CancerBERT models were evaluated and compared with other baseline models on the cancer phenotype extraction task. Results: All CancerBERT models outperformed all other models on the cancer phenotyping NER task. Both CancerBERT models with customized vocabularies outperformed the CancerBERT with the original BERT vocabulary. The CancerBERT model with manually reviewed customized vocabulary achieved the best performance with macro F1 scores equal to 0.876 (95% CI, 0.873-0.879) and 0.904 (95% CI, 0.902-0.906) for exact match and lenient match, respectively. Conclusions: The CancerBERT models were developed to extract the cancer phenotypes in clinical notes and pathology reports. The results validated that using customized vocabulary may further improve the performances of domain specific BERT models in clinical NLP tasks. The CancerBERT models developed in the study would further help clinical decision support.

Original language	English (US)
Pages (from-to)	1208-1216
Number of pages	9
Journal	Journal of the American Medical Informatics Association
Volume	29
Issue number	7
DOIs	https://doi.org/10.1093/jamia/ocac040
State	Published - Jul 1 2022

Keywords

CancerBERT
cancer phenotyping
electronic health record
name entity recognition
natural language processing

ASJC Scopus subject areas

Health Informatics

Access to Document

10.1093/jamia/ocac040

Cite this

@article{73b7f75f79a84141b9466c5123a6c964,

title = "CancerBERT: A cancer domain-specific language model for extracting breast cancer phenotypes from electronic health records",

abstract = "Objective: Accurate extraction of breast cancer patients' phenotypes is important for clinical decision support and clinical research. This study developed and evaluated cancer domain pretrained CancerBERT models for extracting breast cancer phenotypes from clinical texts. We also investigated the effect of customized cancer-related vocabulary on the performance of CancerBERT models. Materials and Methods: A cancer-related corpus of breast cancer patients was extracted from the electronic health records of a local hospital. We annotated named entities in 200 pathology reports and 50 clinical notes for 8 cancer phenotypes for fine-Tuning and evaluation. We kept pretraining the BlueBERT model on the cancer corpus with expanded vocabularies (using both term frequency-based and manually reviewed methods) to obtain CancerBERT models. The CancerBERT models were evaluated and compared with other baseline models on the cancer phenotype extraction task. Results: All CancerBERT models outperformed all other models on the cancer phenotyping NER task. Both CancerBERT models with customized vocabularies outperformed the CancerBERT with the original BERT vocabulary. The CancerBERT model with manually reviewed customized vocabulary achieved the best performance with macro F1 scores equal to 0.876 (95% CI, 0.873-0.879) and 0.904 (95% CI, 0.902-0.906) for exact match and lenient match, respectively. Conclusions: The CancerBERT models were developed to extract the cancer phenotypes in clinical notes and pathology reports. The results validated that using customized vocabulary may further improve the performances of domain specific BERT models in clinical NLP tasks. The CancerBERT models developed in the study would further help clinical decision support.",

keywords = "CancerBERT, cancer phenotyping, electronic health record, name entity recognition, natural language processing",

author = "Sicheng Zhou and Nan Wang and Liwei Wang and Hongfang Liu and Rui Zhang",

note = "Publisher Copyright: {\textcopyright} 2022 The Author(s) 2022. Published by Oxford University Press on behalf of the American Medical Informatics Association.",

year = "2022",

month = jul,

day = "1",

doi = "10.1093/jamia/ocac040",

language = "English (US)",

volume = "29",

pages = "1208--1216",

journal = "Journal of the American Medical Informatics Association",

issn = "1067-5027",

publisher = "Oxford University Press",

number = "7",

}

TY - JOUR

T1 - CancerBERT

T2 - A cancer domain-specific language model for extracting breast cancer phenotypes from electronic health records

AU - Zhou, Sicheng

AU - Wang, Nan

AU - Wang, Liwei

AU - Liu, Hongfang

AU - Zhang, Rui

PY - 2022/7/1

Y1 - 2022/7/1

N2 - Objective: Accurate extraction of breast cancer patients' phenotypes is important for clinical decision support and clinical research. This study developed and evaluated cancer domain pretrained CancerBERT models for extracting breast cancer phenotypes from clinical texts. We also investigated the effect of customized cancer-related vocabulary on the performance of CancerBERT models. Materials and Methods: A cancer-related corpus of breast cancer patients was extracted from the electronic health records of a local hospital. We annotated named entities in 200 pathology reports and 50 clinical notes for 8 cancer phenotypes for fine-Tuning and evaluation. We kept pretraining the BlueBERT model on the cancer corpus with expanded vocabularies (using both term frequency-based and manually reviewed methods) to obtain CancerBERT models. The CancerBERT models were evaluated and compared with other baseline models on the cancer phenotype extraction task. Results: All CancerBERT models outperformed all other models on the cancer phenotyping NER task. Both CancerBERT models with customized vocabularies outperformed the CancerBERT with the original BERT vocabulary. The CancerBERT model with manually reviewed customized vocabulary achieved the best performance with macro F1 scores equal to 0.876 (95% CI, 0.873-0.879) and 0.904 (95% CI, 0.902-0.906) for exact match and lenient match, respectively. Conclusions: The CancerBERT models were developed to extract the cancer phenotypes in clinical notes and pathology reports. The results validated that using customized vocabulary may further improve the performances of domain specific BERT models in clinical NLP tasks. The CancerBERT models developed in the study would further help clinical decision support.

AB - Objective: Accurate extraction of breast cancer patients' phenotypes is important for clinical decision support and clinical research. This study developed and evaluated cancer domain pretrained CancerBERT models for extracting breast cancer phenotypes from clinical texts. We also investigated the effect of customized cancer-related vocabulary on the performance of CancerBERT models. Materials and Methods: A cancer-related corpus of breast cancer patients was extracted from the electronic health records of a local hospital. We annotated named entities in 200 pathology reports and 50 clinical notes for 8 cancer phenotypes for fine-Tuning and evaluation. We kept pretraining the BlueBERT model on the cancer corpus with expanded vocabularies (using both term frequency-based and manually reviewed methods) to obtain CancerBERT models. The CancerBERT models were evaluated and compared with other baseline models on the cancer phenotype extraction task. Results: All CancerBERT models outperformed all other models on the cancer phenotyping NER task. Both CancerBERT models with customized vocabularies outperformed the CancerBERT with the original BERT vocabulary. The CancerBERT model with manually reviewed customized vocabulary achieved the best performance with macro F1 scores equal to 0.876 (95% CI, 0.873-0.879) and 0.904 (95% CI, 0.902-0.906) for exact match and lenient match, respectively. Conclusions: The CancerBERT models were developed to extract the cancer phenotypes in clinical notes and pathology reports. The results validated that using customized vocabulary may further improve the performances of domain specific BERT models in clinical NLP tasks. The CancerBERT models developed in the study would further help clinical decision support.

KW - CancerBERT

KW - cancer phenotyping

KW - electronic health record

KW - name entity recognition

KW - natural language processing

UR - http://www.scopus.com/inward/record.url?scp=85132050037&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85132050037&partnerID=8YFLogxK

U2 - 10.1093/jamia/ocac040

DO - 10.1093/jamia/ocac040

M3 - Article

C2 - 35333345

AN - SCOPUS:85132050037

SN - 1067-5027

VL - 29

SP - 1208

EP - 1216

JO - Journal of the American Medical Informatics Association

JF - Journal of the American Medical Informatics Association

IS - 7

ER -

CancerBERT: A cancer domain-specific language model for extracting breast cancer phenotypes from electronic health records

Abstract

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this