TY - JOUR
T1 - Statistical method evaluation for differentially methylated CpGs in base resolution next-generation DNA sequencing data
AU - Zhang, Yun
AU - Baheti, Saurabh
AU - Sun, Zhifu
N1 - Publisher Copyright:
© The Author 2017. Published by Oxford University Press. All rights reserved.
PY - 2018/5/1
Y1 - 2018/5/1
N2 - High-throughput bisulfite methylation sequencing such as reduced representation bisulfite sequencing (RRBS), Agilent SureSelect Human Methyl-Seq (Methyl-seq) or whole-genome bisulfite sequencing is commonly used for base resolution methylome research. These data are represented either by the ratio of methylated cytosine versus total coverage at a CpG site or numbers of methylated and unmethylated cytosines. Multiple statistical methods can be used to detect differentially methylated CpGs (DMCs) between conditions, and these methods are often the base for the next step of differentially methylated region identification. The ratio data have a flexibility of fitting to many linear models, but the raw count data take consideration of coverage information. There is an array of options in each datatype for DMC detection; however, it is not clear which is an optimal statistical method. In this study, we systematically evaluated four statistic methods on methylation ratio data and four methods on count-based data and compared their performances with regard to type I error control, sensitivity and specificity of DMC detection and computational resource demands using real RRBS data along with simulation. Our results show that the ratio-based tests are generally more conservative (less sensitive) than the countbased tests. However, some count-based methods have high false-positive rates and should be avoided. The beta-binomial model gives a good balance between sensitivity and specificity and is preferred method. Selection of methods in different settings, signal versus noise and sample size estimation are also discussed.
AB - High-throughput bisulfite methylation sequencing such as reduced representation bisulfite sequencing (RRBS), Agilent SureSelect Human Methyl-Seq (Methyl-seq) or whole-genome bisulfite sequencing is commonly used for base resolution methylome research. These data are represented either by the ratio of methylated cytosine versus total coverage at a CpG site or numbers of methylated and unmethylated cytosines. Multiple statistical methods can be used to detect differentially methylated CpGs (DMCs) between conditions, and these methods are often the base for the next step of differentially methylated region identification. The ratio data have a flexibility of fitting to many linear models, but the raw count data take consideration of coverage information. There is an array of options in each datatype for DMC detection; however, it is not clear which is an optimal statistical method. In this study, we systematically evaluated four statistic methods on methylation ratio data and four methods on count-based data and compared their performances with regard to type I error control, sensitivity and specificity of DMC detection and computational resource demands using real RRBS data along with simulation. Our results show that the ratio-based tests are generally more conservative (less sensitive) than the countbased tests. However, some count-based methods have high false-positive rates and should be avoided. The beta-binomial model gives a good balance between sensitivity and specificity and is preferred method. Selection of methods in different settings, signal versus noise and sample size estimation are also discussed.
KW - Bisulfite next-generation sequencing
KW - Differential methylation
KW - Statistical method comparison
UR - http://www.scopus.com/inward/record.url?scp=85032445855&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85032445855&partnerID=8YFLogxK
U2 - 10.1093/bib/bbw133
DO - 10.1093/bib/bbw133
M3 - Article
C2 - 28040747
AN - SCOPUS:85032445855
SN - 1467-5463
VL - 19
SP - 374
EP - 386
JO - Briefings in bioinformatics
JF - Briefings in bioinformatics
IS - 3
ER -