Measuring the effect of inter-study variability on estimating prediction error

Shuyi Ma; Jaeyun Sung; Andrew T. Magis; Yuliang Wang; Donald Geman; Nathan D. Price

doi:10.1371/journal.pone.0110840

Measuring the effect of inter-study variability on estimating prediction error

Shuyi Ma, Jaeyun Sung, Andrew T. Magis, Yuliang Wang, Donald Geman, Nathan D. Price

Surgery

Research output: Contribution to journal › Article › peer-review

11 Scopus citations

Abstract

Background: The biomarker discovery field is replete with molecular signatures that have not translated into the clinic despite ostensibly promising performance in predicting disease phenotypes. One widely cited reason is lack of classification consistency, largely due to failure to maintain performance from study to study. This failure is widely attributed to variability in data collected for the same phenotype among disparate studies, due to technical factors unrelated to phenotypes (e.g., laboratory settings resulting in "batch-effects") and non-phenotype-associated biological variation in the underlying populations. These sources of variability persist in new data collection technologies. Methods: Here we quantify the impact of these combined "study-effects" on a disease signature's predictive performance by comparing two types of validation methods: ordinary randomized cross-validation (RCV), which extracts random subsets of samples for testing, and inter-study validation (ISV), which excludes an entire study for testing. Whereas RCV hardwires an assumption of training and testing on identically distributed data, this key property is lost in ISV, yielding systematic decreases in performance estimates relative to RCV. Measuring the RCV-ISV difference as a function of number of studies quantifies influence of study-effects on performance. Results: As a case study, we gathered publicly available gene expression data from 1,470 microarray samples of 6 lung phenotypes from 26 independent experimental studies and 769 RNA-seq samples of 2 lung phenotypes from 4 independent studies. We find that the RCV-ISV performance discrepancy is greater in phenotypes with few studies, and that the ISV performance converges toward RCV performance as data from additional studies are incorporated into classification. Conclusions: We show that by examining how fast ISV performance approaches RCV as the number of studies is increased, one can estimate when "sufficient" diversity has been achieved for learning a molecular signature likely to translate without significant loss of accuracy to new clinical settings.

Original language	English (US)
Article number	e110840
Journal	PloS one
Volume	9
Issue number	10
DOIs	https://doi.org/10.1371/journal.pone.0110840
State	Published - Oct 17 2014

ASJC Scopus subject areas

General

Access to Document

10.1371/journal.pone.0110840

Cite this

@article{2cd56932c9b14810aa937ed822c387be,

title = "Measuring the effect of inter-study variability on estimating prediction error",

abstract = "Background: The biomarker discovery field is replete with molecular signatures that have not translated into the clinic despite ostensibly promising performance in predicting disease phenotypes. One widely cited reason is lack of classification consistency, largely due to failure to maintain performance from study to study. This failure is widely attributed to variability in data collected for the same phenotype among disparate studies, due to technical factors unrelated to phenotypes (e.g., laboratory settings resulting in {"}batch-effects{"}) and non-phenotype-associated biological variation in the underlying populations. These sources of variability persist in new data collection technologies. Methods: Here we quantify the impact of these combined {"}study-effects{"} on a disease signature's predictive performance by comparing two types of validation methods: ordinary randomized cross-validation (RCV), which extracts random subsets of samples for testing, and inter-study validation (ISV), which excludes an entire study for testing. Whereas RCV hardwires an assumption of training and testing on identically distributed data, this key property is lost in ISV, yielding systematic decreases in performance estimates relative to RCV. Measuring the RCV-ISV difference as a function of number of studies quantifies influence of study-effects on performance. Results: As a case study, we gathered publicly available gene expression data from 1,470 microarray samples of 6 lung phenotypes from 26 independent experimental studies and 769 RNA-seq samples of 2 lung phenotypes from 4 independent studies. We find that the RCV-ISV performance discrepancy is greater in phenotypes with few studies, and that the ISV performance converges toward RCV performance as data from additional studies are incorporated into classification. Conclusions: We show that by examining how fast ISV performance approaches RCV as the number of studies is increased, one can estimate when {"}sufficient{"} diversity has been achieved for learning a molecular signature likely to translate without significant loss of accuracy to new clinical settings.",

author = "Shuyi Ma and Jaeyun Sung and Magis, {Andrew T.} and Yuliang Wang and Donald Geman and Price, {Nathan D.}",

note = "Publisher Copyright: {\textcopyright} 2014 Ma et al.",

year = "2014",

month = oct,

day = "17",

doi = "10.1371/journal.pone.0110840",

language = "English (US)",

volume = "9",

journal = "PloS one",

issn = "1932-6203",

publisher = "Public Library of Science",

number = "10",

}

TY - JOUR

T1 - Measuring the effect of inter-study variability on estimating prediction error

AU - Ma, Shuyi

AU - Sung, Jaeyun

AU - Magis, Andrew T.

AU - Wang, Yuliang

AU - Geman, Donald

AU - Price, Nathan D.

PY - 2014/10/17

Y1 - 2014/10/17

N2 - Background: The biomarker discovery field is replete with molecular signatures that have not translated into the clinic despite ostensibly promising performance in predicting disease phenotypes. One widely cited reason is lack of classification consistency, largely due to failure to maintain performance from study to study. This failure is widely attributed to variability in data collected for the same phenotype among disparate studies, due to technical factors unrelated to phenotypes (e.g., laboratory settings resulting in "batch-effects") and non-phenotype-associated biological variation in the underlying populations. These sources of variability persist in new data collection technologies. Methods: Here we quantify the impact of these combined "study-effects" on a disease signature's predictive performance by comparing two types of validation methods: ordinary randomized cross-validation (RCV), which extracts random subsets of samples for testing, and inter-study validation (ISV), which excludes an entire study for testing. Whereas RCV hardwires an assumption of training and testing on identically distributed data, this key property is lost in ISV, yielding systematic decreases in performance estimates relative to RCV. Measuring the RCV-ISV difference as a function of number of studies quantifies influence of study-effects on performance. Results: As a case study, we gathered publicly available gene expression data from 1,470 microarray samples of 6 lung phenotypes from 26 independent experimental studies and 769 RNA-seq samples of 2 lung phenotypes from 4 independent studies. We find that the RCV-ISV performance discrepancy is greater in phenotypes with few studies, and that the ISV performance converges toward RCV performance as data from additional studies are incorporated into classification. Conclusions: We show that by examining how fast ISV performance approaches RCV as the number of studies is increased, one can estimate when "sufficient" diversity has been achieved for learning a molecular signature likely to translate without significant loss of accuracy to new clinical settings.

AB - Background: The biomarker discovery field is replete with molecular signatures that have not translated into the clinic despite ostensibly promising performance in predicting disease phenotypes. One widely cited reason is lack of classification consistency, largely due to failure to maintain performance from study to study. This failure is widely attributed to variability in data collected for the same phenotype among disparate studies, due to technical factors unrelated to phenotypes (e.g., laboratory settings resulting in "batch-effects") and non-phenotype-associated biological variation in the underlying populations. These sources of variability persist in new data collection technologies. Methods: Here we quantify the impact of these combined "study-effects" on a disease signature's predictive performance by comparing two types of validation methods: ordinary randomized cross-validation (RCV), which extracts random subsets of samples for testing, and inter-study validation (ISV), which excludes an entire study for testing. Whereas RCV hardwires an assumption of training and testing on identically distributed data, this key property is lost in ISV, yielding systematic decreases in performance estimates relative to RCV. Measuring the RCV-ISV difference as a function of number of studies quantifies influence of study-effects on performance. Results: As a case study, we gathered publicly available gene expression data from 1,470 microarray samples of 6 lung phenotypes from 26 independent experimental studies and 769 RNA-seq samples of 2 lung phenotypes from 4 independent studies. We find that the RCV-ISV performance discrepancy is greater in phenotypes with few studies, and that the ISV performance converges toward RCV performance as data from additional studies are incorporated into classification. Conclusions: We show that by examining how fast ISV performance approaches RCV as the number of studies is increased, one can estimate when "sufficient" diversity has been achieved for learning a molecular signature likely to translate without significant loss of accuracy to new clinical settings.

UR - http://www.scopus.com/inward/record.url?scp=84908140306&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84908140306&partnerID=8YFLogxK

U2 - 10.1371/journal.pone.0110840

DO - 10.1371/journal.pone.0110840

M3 - Article

C2 - 25330348

AN - SCOPUS:84908140306

SN - 1932-6203

VL - 9

JO - PloS one

JF - PloS one

IS - 10

M1 - e110840

ER -

Measuring the effect of inter-study variability on estimating prediction error

Abstract

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this