TY - JOUR
T1 - Developing EHR-driven heart failure risk prediction models using CPXR(Log) with the probabilistic loss function
AU - Taslimitehrani, Vahid
AU - Dong, Guozhu
AU - Pereira, Naveen L.
AU - Panahiazar, Maryam
AU - Pathak, Jyotishman
N1 - Funding Information:
This material is based upon work supported by the National Institute of Health under grant numbers R01 GM105688 and R01 MH105384 , Agency for Healthcare Research and Quality (AHRQ) under grant number R01 HS023077 and the Mayo Clinic Robert D. and Patricia E. Kern Center for the Science of Health Care Delivery.
Publisher Copyright:
© 2016 Elsevier Inc.
PY - 2016/4/1
Y1 - 2016/4/1
N2 - Computerized survival prediction in healthcare identifying the risk of disease mortality, helps healthcare providers to effectively manage their patients by providing appropriate treatment options. In this study, we propose to apply a classification algorithm, Contrast Pattern Aided Logistic Regression (CPXR(Log)) with the probabilistic loss function, to develop and validate prognostic risk models to predict 1, 2, and 5 year survival in heart failure (HF) using data from electronic health records (EHRs) at Mayo Clinic. The CPXR(Log) constructs a pattern aided logistic regression model defined by several patterns and corresponding local logistic regression models. One of the models generated by CPXR(Log) achieved an AUC and accuracy of 0.94 and 0.91, respectively, and significantly outperformed prognostic models reported in prior studies. Data extracted from EHRs allowed incorporation of patient co-morbidities into our models which helped improve the performance of the CPXR(Log) models (15.9% AUC improvement), although did not improve the accuracy of the models built by other classifiers. We also propose a probabilistic loss function to determine the large error and small error instances. The new loss function used in the algorithm outperforms other functions used in the previous studies by 1% improvement in the AUC. This study revealed that using EHR data to build prediction models can be very challenging using existing classification methods due to the high dimensionality and complexity of EHR data. The risk models developed by CPXR(Log) also reveal that HF is a highly heterogeneous disease, i.e., different subgroups of HF patients require different types of considerations with their diagnosis and treatment. Our risk models provided two valuable insights for application of predictive modeling techniques in biomedicine: Logistic risk models often make systematic prediction errors, and it is prudent to use subgroup based prediction models such as those given by CPXR(Log) when investigating heterogeneous diseases.
AB - Computerized survival prediction in healthcare identifying the risk of disease mortality, helps healthcare providers to effectively manage their patients by providing appropriate treatment options. In this study, we propose to apply a classification algorithm, Contrast Pattern Aided Logistic Regression (CPXR(Log)) with the probabilistic loss function, to develop and validate prognostic risk models to predict 1, 2, and 5 year survival in heart failure (HF) using data from electronic health records (EHRs) at Mayo Clinic. The CPXR(Log) constructs a pattern aided logistic regression model defined by several patterns and corresponding local logistic regression models. One of the models generated by CPXR(Log) achieved an AUC and accuracy of 0.94 and 0.91, respectively, and significantly outperformed prognostic models reported in prior studies. Data extracted from EHRs allowed incorporation of patient co-morbidities into our models which helped improve the performance of the CPXR(Log) models (15.9% AUC improvement), although did not improve the accuracy of the models built by other classifiers. We also propose a probabilistic loss function to determine the large error and small error instances. The new loss function used in the algorithm outperforms other functions used in the previous studies by 1% improvement in the AUC. This study revealed that using EHR data to build prediction models can be very challenging using existing classification methods due to the high dimensionality and complexity of EHR data. The risk models developed by CPXR(Log) also reveal that HF is a highly heterogeneous disease, i.e., different subgroups of HF patients require different types of considerations with their diagnosis and treatment. Our risk models provided two valuable insights for application of predictive modeling techniques in biomedicine: Logistic risk models often make systematic prediction errors, and it is prudent to use subgroup based prediction models such as those given by CPXR(Log) when investigating heterogeneous diseases.
KW - Contrast pattern aided logistic regression
KW - Heart failure
KW - Predictive modeling
KW - Survival analysis
UR - http://www.scopus.com/inward/record.url?scp=84962921269&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84962921269&partnerID=8YFLogxK
U2 - 10.1016/j.jbi.2016.01.009
DO - 10.1016/j.jbi.2016.01.009
M3 - Article
C2 - 26844760
AN - SCOPUS:84962921269
VL - 60
SP - 260
EP - 269
JO - Journal of Biomedical Informatics
JF - Journal of Biomedical Informatics
SN - 1532-0464
ER -