Calculating sample size estimates for RNA sequencing data

Steven N. Hart; Terry M. Therneau; Yuji Zhang; Gregory A. Poland; Jean Pierre Kocher

doi:10.1089/cmb.2012.0283

Calculating sample size estimates for RNA sequencing data

Steven N. Hart, Terry M. Therneau, Yuji Zhang, Gregory A. Poland, Jean Pierre Kocher

Research output: Contribution to journal › Article › peer-review

121 Scopus citations

Abstract

Background: Given the high technical reproducibility and orders of magnitude greater resolution than microarrays, next-generation sequencing of mRNA (RNA-Seq) is quickly becoming the de facto standard for measuring levels of gene expression in biological experiments. Two important questions must be taken into consideration when designing a particular experiment, namely, 1) how deep does one need to sequence? and, 2) how many biological replicates are necessary to observe a significant change in expression? Results: Based on the gene expression distributions from 127 RNA-Seq experiments, we find evidence that 91% ± 4% of all annotated genes are sequenced at a frequency of 0.1 times per million bases mapped, regardless of sample source. Based on this observation, and combining this information with other parameters such as biological variation and technical variation that we empirically estimate from our large datasets, we developed a model to estimate the statistical power needed to identify differentially expressed genes from RNA-Seq experiments. Conclusions: Our results provide a needed reference for ensuring RNA-Seq gene expression studies are conducted with the optimally sample size, power, and sequencing depth. We also make available both R code and an Excel worksheet for investigators to calculate for their own experiments.

Original language	English (US)
Pages (from-to)	970-978
Number of pages	9
Journal	Journal of Computational Biology
Volume	20
Issue number	12
DOIs	https://doi.org/10.1089/cmb.2012.0283
State	Published - Dec 1 2013

ASJC Scopus subject areas

Modeling and Simulation
Molecular Biology
Genetics
Computational Mathematics
Computational Theory and Mathematics

Access to Document

10.1089/cmb.2012.0283

Cite this

@article{50cdef16ae764ecf9b9fd0194b3bf825,

title = "Calculating sample size estimates for RNA sequencing data",

abstract = "Background: Given the high technical reproducibility and orders of magnitude greater resolution than microarrays, next-generation sequencing of mRNA (RNA-Seq) is quickly becoming the de facto standard for measuring levels of gene expression in biological experiments. Two important questions must be taken into consideration when designing a particular experiment, namely, 1) how deep does one need to sequence? and, 2) how many biological replicates are necessary to observe a significant change in expression? Results: Based on the gene expression distributions from 127 RNA-Seq experiments, we find evidence that 91% ± 4% of all annotated genes are sequenced at a frequency of 0.1 times per million bases mapped, regardless of sample source. Based on this observation, and combining this information with other parameters such as biological variation and technical variation that we empirically estimate from our large datasets, we developed a model to estimate the statistical power needed to identify differentially expressed genes from RNA-Seq experiments. Conclusions: Our results provide a needed reference for ensuring RNA-Seq gene expression studies are conducted with the optimally sample size, power, and sequencing depth. We also make available both R code and an Excel worksheet for investigators to calculate for their own experiments.",

author = "Hart, {Steven N.} and Therneau, {Terry M.} and Yuji Zhang and Poland, {Gregory A.} and Kocher, {Jean Pierre}",

year = "2013",

month = dec,

day = "1",

doi = "10.1089/cmb.2012.0283",

language = "English (US)",

volume = "20",

pages = "970--978",

journal = "Journal of Computational Biology",

issn = "1066-5277",

publisher = "Mary Ann Liebert Inc.",

number = "12",

}

TY - JOUR

T1 - Calculating sample size estimates for RNA sequencing data

AU - Hart, Steven N.

AU - Therneau, Terry M.

AU - Zhang, Yuji

AU - Poland, Gregory A.

AU - Kocher, Jean Pierre

PY - 2013/12/1

Y1 - 2013/12/1

N2 - Background: Given the high technical reproducibility and orders of magnitude greater resolution than microarrays, next-generation sequencing of mRNA (RNA-Seq) is quickly becoming the de facto standard for measuring levels of gene expression in biological experiments. Two important questions must be taken into consideration when designing a particular experiment, namely, 1) how deep does one need to sequence? and, 2) how many biological replicates are necessary to observe a significant change in expression? Results: Based on the gene expression distributions from 127 RNA-Seq experiments, we find evidence that 91% ± 4% of all annotated genes are sequenced at a frequency of 0.1 times per million bases mapped, regardless of sample source. Based on this observation, and combining this information with other parameters such as biological variation and technical variation that we empirically estimate from our large datasets, we developed a model to estimate the statistical power needed to identify differentially expressed genes from RNA-Seq experiments. Conclusions: Our results provide a needed reference for ensuring RNA-Seq gene expression studies are conducted with the optimally sample size, power, and sequencing depth. We also make available both R code and an Excel worksheet for investigators to calculate for their own experiments.

AB - Background: Given the high technical reproducibility and orders of magnitude greater resolution than microarrays, next-generation sequencing of mRNA (RNA-Seq) is quickly becoming the de facto standard for measuring levels of gene expression in biological experiments. Two important questions must be taken into consideration when designing a particular experiment, namely, 1) how deep does one need to sequence? and, 2) how many biological replicates are necessary to observe a significant change in expression? Results: Based on the gene expression distributions from 127 RNA-Seq experiments, we find evidence that 91% ± 4% of all annotated genes are sequenced at a frequency of 0.1 times per million bases mapped, regardless of sample source. Based on this observation, and combining this information with other parameters such as biological variation and technical variation that we empirically estimate from our large datasets, we developed a model to estimate the statistical power needed to identify differentially expressed genes from RNA-Seq experiments. Conclusions: Our results provide a needed reference for ensuring RNA-Seq gene expression studies are conducted with the optimally sample size, power, and sequencing depth. We also make available both R code and an Excel worksheet for investigators to calculate for their own experiments.

UR - http://www.scopus.com/inward/record.url?scp=84888343631&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84888343631&partnerID=8YFLogxK

U2 - 10.1089/cmb.2012.0283

DO - 10.1089/cmb.2012.0283

M3 - Article

C2 - 23961961

AN - SCOPUS:84888343631

SN - 1066-5277

VL - 20

SP - 970

EP - 978

JO - Journal of Computational Biology

JF - Journal of Computational Biology

IS - 12

ER -

Calculating sample size estimates for RNA sequencing data

Abstract

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this