Impact of Diverse Data Sources on Computational Phenotyping

Liwei Wang; Janet E. Olson; Suzette J. Bielinski; Jennifer L. St. Sauver; Sunyang Fu; Huan He; Mine S. Cicek; Matthew A. Hathcock; James R. Cerhan; Hongfang Liu

doi:10.3389/fgene.2020.00556

Impact of Diverse Data Sources on Computational Phenotyping

Liwei Wang, Janet E. Olson, Suzette J. Bielinski, Jennifer L. St. Sauver, Sunyang Fu, Huan He, Mine S. Cicek, Matthew A. Hathcock, James R. Cerhan, Hongfang Liu

Research output: Contribution to journal › Article › peer-review

Abstract

Electronic health records (EHRs) are widely adopted with a great potential to serve as a rich, integrated source of phenotype information. Computational phenotyping, which extracts phenotypes from EHR data automatically, can accelerate the adoption and utilization of phenotype-driven efforts to advance scientific discovery and improve healthcare delivery. A list of computational phenotyping algorithms has been published but data fragmentation, i.e., incomplete data within one single data source, has been raised as an inherent limitation of computational phenotyping. In this study, we investigated the impact of diverse data sources on two published computational phenotyping algorithms, rheumatoid arthritis (RA) and type 2 diabetes mellitus (T2DM), using Mayo EHRs and Rochester Epidemiology Project (REP) which links medical records from multiple health care systems. Results showed that both RA (less prevalent) and T2DM (more prevalent) case selections were markedly impacted by data fragmentation, with positive predictive value (PPV) of 91.4 and 92.4%, false-negative rate (FNR) of 26.6 and 14% in Mayo data, respectively, PPV of 97.2 and 98.3%, FNR of 5.2 and 3.3% in REP. T2DM controls also contain biases, with PPV of 91.2% and FNR of 1.2% for Mayo. We further elaborated underlying reasons impacting the performance.

Original language	English (US)
Article number	556
Journal	Frontiers in Genetics
Volume	11
DOIs	https://doi.org/10.3389/fgene.2020.00556
State	Published - Jun 3 2020

Keywords

computational phenotyping
diverse data sources
phenotyping algorithms
rheumatoid arthritis
type 2 diabetes mellitus

ASJC Scopus subject areas

Molecular Medicine
Genetics
Genetics(clinical)

Access to Document

10.3389/fgene.2020.00556

Cite this

@article{1fefb063073d4b3c8f387d031f1f54f9,

title = "Impact of Diverse Data Sources on Computational Phenotyping",

abstract = "Electronic health records (EHRs) are widely adopted with a great potential to serve as a rich, integrated source of phenotype information. Computational phenotyping, which extracts phenotypes from EHR data automatically, can accelerate the adoption and utilization of phenotype-driven efforts to advance scientific discovery and improve healthcare delivery. A list of computational phenotyping algorithms has been published but data fragmentation, i.e., incomplete data within one single data source, has been raised as an inherent limitation of computational phenotyping. In this study, we investigated the impact of diverse data sources on two published computational phenotyping algorithms, rheumatoid arthritis (RA) and type 2 diabetes mellitus (T2DM), using Mayo EHRs and Rochester Epidemiology Project (REP) which links medical records from multiple health care systems. Results showed that both RA (less prevalent) and T2DM (more prevalent) case selections were markedly impacted by data fragmentation, with positive predictive value (PPV) of 91.4 and 92.4%, false-negative rate (FNR) of 26.6 and 14% in Mayo data, respectively, PPV of 97.2 and 98.3%, FNR of 5.2 and 3.3% in REP. T2DM controls also contain biases, with PPV of 91.2% and FNR of 1.2% for Mayo. We further elaborated underlying reasons impacting the performance.",

keywords = "computational phenotyping, diverse data sources, phenotyping algorithms, rheumatoid arthritis, type 2 diabetes mellitus",

author = "Liwei Wang and Olson, {Janet E.} and Bielinski, {Suzette J.} and {St. Sauver}, {Jennifer L.} and Sunyang Fu and Huan He and Cicek, {Mine S.} and Hathcock, {Matthew A.} and Cerhan, {James R.} and Hongfang Liu",

note = "Publisher Copyright: {\textcopyright} Copyright {\textcopyright} 2020 Wang, Olson, Bielinski, St. Sauver, Fu, He, Cicek, Hathcock, Cerhan and Liu.",

year = "2020",

month = jun,

day = "3",

doi = "10.3389/fgene.2020.00556",

language = "English (US)",

volume = "11",

journal = "Frontiers in Genetics",

issn = "1664-8021",

publisher = "Frontiers Media S. A.",

}

TY - JOUR

T1 - Impact of Diverse Data Sources on Computational Phenotyping

AU - Wang, Liwei

AU - Olson, Janet E.

AU - Bielinski, Suzette J.

AU - St. Sauver, Jennifer L.

AU - Fu, Sunyang

AU - He, Huan

AU - Cicek, Mine S.

AU - Hathcock, Matthew A.

AU - Cerhan, James R.

AU - Liu, Hongfang

PY - 2020/6/3

Y1 - 2020/6/3

N2 - Electronic health records (EHRs) are widely adopted with a great potential to serve as a rich, integrated source of phenotype information. Computational phenotyping, which extracts phenotypes from EHR data automatically, can accelerate the adoption and utilization of phenotype-driven efforts to advance scientific discovery and improve healthcare delivery. A list of computational phenotyping algorithms has been published but data fragmentation, i.e., incomplete data within one single data source, has been raised as an inherent limitation of computational phenotyping. In this study, we investigated the impact of diverse data sources on two published computational phenotyping algorithms, rheumatoid arthritis (RA) and type 2 diabetes mellitus (T2DM), using Mayo EHRs and Rochester Epidemiology Project (REP) which links medical records from multiple health care systems. Results showed that both RA (less prevalent) and T2DM (more prevalent) case selections were markedly impacted by data fragmentation, with positive predictive value (PPV) of 91.4 and 92.4%, false-negative rate (FNR) of 26.6 and 14% in Mayo data, respectively, PPV of 97.2 and 98.3%, FNR of 5.2 and 3.3% in REP. T2DM controls also contain biases, with PPV of 91.2% and FNR of 1.2% for Mayo. We further elaborated underlying reasons impacting the performance.

AB - Electronic health records (EHRs) are widely adopted with a great potential to serve as a rich, integrated source of phenotype information. Computational phenotyping, which extracts phenotypes from EHR data automatically, can accelerate the adoption and utilization of phenotype-driven efforts to advance scientific discovery and improve healthcare delivery. A list of computational phenotyping algorithms has been published but data fragmentation, i.e., incomplete data within one single data source, has been raised as an inherent limitation of computational phenotyping. In this study, we investigated the impact of diverse data sources on two published computational phenotyping algorithms, rheumatoid arthritis (RA) and type 2 diabetes mellitus (T2DM), using Mayo EHRs and Rochester Epidemiology Project (REP) which links medical records from multiple health care systems. Results showed that both RA (less prevalent) and T2DM (more prevalent) case selections were markedly impacted by data fragmentation, with positive predictive value (PPV) of 91.4 and 92.4%, false-negative rate (FNR) of 26.6 and 14% in Mayo data, respectively, PPV of 97.2 and 98.3%, FNR of 5.2 and 3.3% in REP. T2DM controls also contain biases, with PPV of 91.2% and FNR of 1.2% for Mayo. We further elaborated underlying reasons impacting the performance.

KW - computational phenotyping

KW - diverse data sources

KW - phenotyping algorithms

KW - rheumatoid arthritis

KW - type 2 diabetes mellitus

UR - http://www.scopus.com/inward/record.url?scp=85086785834&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85086785834&partnerID=8YFLogxK

U2 - 10.3389/fgene.2020.00556

DO - 10.3389/fgene.2020.00556

M3 - Article

AN - SCOPUS:85086785834

SN - 1664-8021

VL - 11

JO - Frontiers in Genetics

JF - Frontiers in Genetics

M1 - 556

ER -

Impact of Diverse Data Sources on Computational Phenotyping

Abstract

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this