A new statistic for identifying batch effects in high-throughput genomic data that uses guided principal component analysis

Sarah E. Reese; Kellie J. Archer; Terry M. Therneau; Elizabeth J. Atkinson; Celine M. Vachon; Mariza De Andrade; Jean Pierre A. Kocher; Jeanette E. Eckel-Passow

doi:10.1093/bioinformatics/btt480

A new statistic for identifying batch effects in high-throughput genomic data that uses guided principal component analysis

Sarah E. Reese, Kellie J. Archer, Terry M. Therneau, Elizabeth J. Atkinson, Celine M. Vachon, Mariza De Andrade, Jean Pierre A. Kocher, Jeanette E. Eckel-Passow

Research output: Contribution to journal › Article › peer-review

65 Scopus citations

Abstract

Motivation: Batch effects are due to probe-specific systematic variation between groups of samples (batches) resulting from experimental features that are not of biological interest. Principal component analysis (PCA) is commonly used as a visual tool to determine whether batch effects exist after applying a global normalization method. However, PCA yields linear combinations of the variables that contribute maximum variance and thus will not necessarily detect batch effects if they are not the largest source of variability in the data. Results: We present an extension of PCA to quantify the existence of batch effects, called guided PCA (gPCA). We describe a test statistic that uses gPCA to test whether a batch effect exists. We apply our proposed test statistic derived using gPCA to simulated data and to two copy number variation case studies: the first study consisted of 614 samples from a breast cancer family study using Illumina Human 660 bead-chip arrays, whereas the second case study consisted of 703 samples from a family blood pressure study that used Affymetrix SNP Array 6.0. We demonstrate that our statistic has good statistical properties and is able to identify significant batch effects in two copy number variation case studies. Conclusion: We developed a new statistic that uses gPCA to identify whether batch effects exist in high-throughput genomic data. Although our examples pertain to copy number data, gPCA is general and can be used on other data types as well.

Original language	English (US)
Pages (from-to)	2877-2883
Number of pages	7
Journal	Bioinformatics
Volume	29
Issue number	22
DOIs	https://doi.org/10.1093/bioinformatics/btt480
State	Published - Nov 15 2013

ASJC Scopus subject areas

Statistics and Probability
Biochemistry
Molecular Biology
Computer Science Applications
Computational Theory and Mathematics
Computational Mathematics

Access to Document

10.1093/bioinformatics/btt480

Cite this

@article{251a2f59714840baa1e62d0204ca614b,

title = "A new statistic for identifying batch effects in high-throughput genomic data that uses guided principal component analysis",

abstract = "Motivation: Batch effects are due to probe-specific systematic variation between groups of samples (batches) resulting from experimental features that are not of biological interest. Principal component analysis (PCA) is commonly used as a visual tool to determine whether batch effects exist after applying a global normalization method. However, PCA yields linear combinations of the variables that contribute maximum variance and thus will not necessarily detect batch effects if they are not the largest source of variability in the data. Results: We present an extension of PCA to quantify the existence of batch effects, called guided PCA (gPCA). We describe a test statistic that uses gPCA to test whether a batch effect exists. We apply our proposed test statistic derived using gPCA to simulated data and to two copy number variation case studies: the first study consisted of 614 samples from a breast cancer family study using Illumina Human 660 bead-chip arrays, whereas the second case study consisted of 703 samples from a family blood pressure study that used Affymetrix SNP Array 6.0. We demonstrate that our statistic has good statistical properties and is able to identify significant batch effects in two copy number variation case studies. Conclusion: We developed a new statistic that uses gPCA to identify whether batch effects exist in high-throughput genomic data. Although our examples pertain to copy number data, gPCA is general and can be used on other data types as well.",

author = "Reese, {Sarah E.} and Archer, {Kellie J.} and Therneau, {Terry M.} and Atkinson, {Elizabeth J.} and Vachon, {Celine M.} and {De Andrade}, Mariza and Kocher, {Jean Pierre A.} and Eckel-Passow, {Jeanette E.}",

note = "Funding Information: Funding: National Institutes of Health research grants (R01 HL87660 to M.d.A.); (R01 CA128931 and CA140286 to C.M.V.); (T32 ES007334 S.E.R. and K.J.A.); Mayo Clinic Center for Individualized Medicine.",

year = "2013",

month = nov,

day = "15",

doi = "10.1093/bioinformatics/btt480",

language = "English (US)",

volume = "29",

pages = "2877--2883",

journal = "Bioinformatics",

issn = "1367-4803",

publisher = "Oxford University Press",

number = "22",

}

TY - JOUR

T1 - A new statistic for identifying batch effects in high-throughput genomic data that uses guided principal component analysis

AU - Reese, Sarah E.

AU - Archer, Kellie J.

AU - Therneau, Terry M.

AU - Atkinson, Elizabeth J.

AU - Vachon, Celine M.

AU - De Andrade, Mariza

AU - Kocher, Jean Pierre A.

AU - Eckel-Passow, Jeanette E.

N1 - Funding Information: Funding: National Institutes of Health research grants (R01 HL87660 to M.d.A.); (R01 CA128931 and CA140286 to C.M.V.); (T32 ES007334 S.E.R. and K.J.A.); Mayo Clinic Center for Individualized Medicine.

PY - 2013/11/15

Y1 - 2013/11/15

N2 - Motivation: Batch effects are due to probe-specific systematic variation between groups of samples (batches) resulting from experimental features that are not of biological interest. Principal component analysis (PCA) is commonly used as a visual tool to determine whether batch effects exist after applying a global normalization method. However, PCA yields linear combinations of the variables that contribute maximum variance and thus will not necessarily detect batch effects if they are not the largest source of variability in the data. Results: We present an extension of PCA to quantify the existence of batch effects, called guided PCA (gPCA). We describe a test statistic that uses gPCA to test whether a batch effect exists. We apply our proposed test statistic derived using gPCA to simulated data and to two copy number variation case studies: the first study consisted of 614 samples from a breast cancer family study using Illumina Human 660 bead-chip arrays, whereas the second case study consisted of 703 samples from a family blood pressure study that used Affymetrix SNP Array 6.0. We demonstrate that our statistic has good statistical properties and is able to identify significant batch effects in two copy number variation case studies. Conclusion: We developed a new statistic that uses gPCA to identify whether batch effects exist in high-throughput genomic data. Although our examples pertain to copy number data, gPCA is general and can be used on other data types as well.

AB - Motivation: Batch effects are due to probe-specific systematic variation between groups of samples (batches) resulting from experimental features that are not of biological interest. Principal component analysis (PCA) is commonly used as a visual tool to determine whether batch effects exist after applying a global normalization method. However, PCA yields linear combinations of the variables that contribute maximum variance and thus will not necessarily detect batch effects if they are not the largest source of variability in the data. Results: We present an extension of PCA to quantify the existence of batch effects, called guided PCA (gPCA). We describe a test statistic that uses gPCA to test whether a batch effect exists. We apply our proposed test statistic derived using gPCA to simulated data and to two copy number variation case studies: the first study consisted of 614 samples from a breast cancer family study using Illumina Human 660 bead-chip arrays, whereas the second case study consisted of 703 samples from a family blood pressure study that used Affymetrix SNP Array 6.0. We demonstrate that our statistic has good statistical properties and is able to identify significant batch effects in two copy number variation case studies. Conclusion: We developed a new statistic that uses gPCA to identify whether batch effects exist in high-throughput genomic data. Although our examples pertain to copy number data, gPCA is general and can be used on other data types as well.

UR - http://www.scopus.com/inward/record.url?scp=84890087087&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84890087087&partnerID=8YFLogxK

U2 - 10.1093/bioinformatics/btt480

DO - 10.1093/bioinformatics/btt480

M3 - Article

C2 - 23958724

AN - SCOPUS:84890087087

SN - 1367-4803

VL - 29

SP - 2877

EP - 2883

JO - Bioinformatics

JF - Bioinformatics

IS - 22

ER -

A new statistic for identifying batch effects in high-throughput genomic data that uses guided principal component analysis

Abstract

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this