TY - GEN
T1 - Automatic Breast Cancer Cohort Detection from Social Media for Studying Factors Affecting Patient-Centered Outcomes
AU - Al-Garadi, Mohammed Ali
AU - Yang, Yuan Chi
AU - Lakamana, Sahithi
AU - Lin, Jie
AU - Li, Sabrina
AU - Xie, Angel
AU - Hogg-Bremer, Whitney
AU - Torres, Mylin
AU - Banerjee, Imon
AU - Sarker, Abeed
N1 - Publisher Copyright:
© 2020, Springer Nature Switzerland AG.
PY - 2020
Y1 - 2020
N2 - Breast cancer patients often discontinue their long-term treatments, such as hormone therapy, increasing the risk of cancer recurrence. These discontinuations may be caused by adverse patient-centered outcomes (PCOs) due to hormonal drug side effects or other factors. PCOs are not detectable through laboratory tests, and are sparsely documented in electronic health records. Thus, there is a need to explore complementary sources of information for PCOs associated with breast cancer treatments. Social media is a promising resource, but extracting true PCOs from it first requires the accurate detection of real breast cancer patients. We describe a natural language processing (NLP) pipeline for automatically detecting breast cancer patients from Twitter based on their self-reports. The pipeline uses breast cancer-related keywords to collect streaming data from Twitter, applies NLP patterns to filter out noisy posts, and then employs a machine learning classifier trained using manually-annotated data (n = 5,019) for distinguishing firsthand self-reports of breast cancer from other tweets. A classifier based on bidirectional encoder representations from transformers (BERT) showed human-like performance and achieved Fscore of 0.857 (inter-annotator agreement: 0.845; Cohen’s kappa) for the positive class, considerably outperforming the next best classifier—a recurrent neural network with bidirectional long short-term memory (Fscore: 0.670). Qualitative analyses of posts from automatically-detected users revealed discussions about side effects, non-adherence and mental health conditions, illustrating the feasibility of our social media-based approach for studying breast cancer related PCOs from a large population.
AB - Breast cancer patients often discontinue their long-term treatments, such as hormone therapy, increasing the risk of cancer recurrence. These discontinuations may be caused by adverse patient-centered outcomes (PCOs) due to hormonal drug side effects or other factors. PCOs are not detectable through laboratory tests, and are sparsely documented in electronic health records. Thus, there is a need to explore complementary sources of information for PCOs associated with breast cancer treatments. Social media is a promising resource, but extracting true PCOs from it first requires the accurate detection of real breast cancer patients. We describe a natural language processing (NLP) pipeline for automatically detecting breast cancer patients from Twitter based on their self-reports. The pipeline uses breast cancer-related keywords to collect streaming data from Twitter, applies NLP patterns to filter out noisy posts, and then employs a machine learning classifier trained using manually-annotated data (n = 5,019) for distinguishing firsthand self-reports of breast cancer from other tweets. A classifier based on bidirectional encoder representations from transformers (BERT) showed human-like performance and achieved Fscore of 0.857 (inter-annotator agreement: 0.845; Cohen’s kappa) for the positive class, considerably outperforming the next best classifier—a recurrent neural network with bidirectional long short-term memory (Fscore: 0.670). Qualitative analyses of posts from automatically-detected users revealed discussions about side effects, non-adherence and mental health conditions, illustrating the feasibility of our social media-based approach for studying breast cancer related PCOs from a large population.
KW - Breast cancer
KW - Natural language processing
KW - Social media
UR - http://www.scopus.com/inward/record.url?scp=85092228494&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85092228494&partnerID=8YFLogxK
U2 - 10.1007/978-3-030-59137-3_10
DO - 10.1007/978-3-030-59137-3_10
M3 - Conference contribution
AN - SCOPUS:85092228494
SN - 9783030591366
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 100
EP - 110
BT - Artificial Intelligence in Medicine - 18th International Conference on Artificial Intelligence in Medicine, AIME 2020, Proceedings
A2 - Michalowski, Martin
A2 - Moskovitch, Robert
PB - Springer Science and Business Media Deutschland GmbH
T2 - 18th International Conference on Artificial Intelligence in Medicine, AIME 2020
Y2 - 25 August 2020 through 28 August 2020
ER -