Statistical method evaluation for differentially methylated CpGs in base resolution next-generation DNA sequencing data

Yun Zhang, Saurabh Baheti, Zhifu D Sun

Research output: Contribution to journalArticle

6 Citations (Scopus)

Abstract

High-throughput bisulfite methylation sequencing such as reduced representation bisulfite sequencing (RRBS), Agilent SureSelect Human Methyl-Seq (Methyl-seq) or whole-genome bisulfite sequencing is commonly used for base resolution methylome research. These data are represented either by the ratio of methylated cytosine versus total coverage at a CpG site or numbers of methylated and unmethylated cytosines. Multiple statistical methods can be used to detect differentially methylated CpGs (DMCs) between conditions, and these methods are often the base for the next step of differentially methylated region identification. The ratio data have a flexibility of fitting to many linear models, but the raw count data take consideration of coverage information. There is an array of options in each datatype for DMC detection; however, it is not clear which is an optimal statistical method. In this study, we systematically evaluated four statistic methods on methylation ratio data and four methods on count-based data and compared their performances with regard to type I error control, sensitivity and specificity of DMC detection and computational resource demands using real RRBS data along with simulation. Our results show that the ratio-based tests are generally more conservative (less sensitive) than the countbased tests. However, some count-based methods have high false-positive rates and should be avoided. The beta-binomial model gives a good balance between sensitivity and specificity and is preferred method. Selection of methods in different settings, signal versus noise and sample size estimation are also discussed.

Original languageEnglish (US)
Pages (from-to)374-386
Number of pages13
JournalBriefings in Bioinformatics
Volume19
Issue number3
DOIs
StatePublished - May 1 2018

Fingerprint

Methylation
DNA Sequence Analysis
Statistical methods
DNA
Genes
Throughput
Statistics
Cytosine
Sensitivity and Specificity
Statistical Models
Sample Size
Noise
Linear Models
Genome
hydrogen sulfite
Research

Keywords

  • Bisulfite next-generation sequencing
  • Differential methylation
  • Statistical method comparison

ASJC Scopus subject areas

  • Information Systems
  • Molecular Biology

Cite this

Statistical method evaluation for differentially methylated CpGs in base resolution next-generation DNA sequencing data. / Zhang, Yun; Baheti, Saurabh; Sun, Zhifu D.

In: Briefings in Bioinformatics, Vol. 19, No. 3, 01.05.2018, p. 374-386.

Research output: Contribution to journalArticle

@article{7c23209ad9a94cca83e96defe8a755ce,
title = "Statistical method evaluation for differentially methylated CpGs in base resolution next-generation DNA sequencing data",
abstract = "High-throughput bisulfite methylation sequencing such as reduced representation bisulfite sequencing (RRBS), Agilent SureSelect Human Methyl-Seq (Methyl-seq) or whole-genome bisulfite sequencing is commonly used for base resolution methylome research. These data are represented either by the ratio of methylated cytosine versus total coverage at a CpG site or numbers of methylated and unmethylated cytosines. Multiple statistical methods can be used to detect differentially methylated CpGs (DMCs) between conditions, and these methods are often the base for the next step of differentially methylated region identification. The ratio data have a flexibility of fitting to many linear models, but the raw count data take consideration of coverage information. There is an array of options in each datatype for DMC detection; however, it is not clear which is an optimal statistical method. In this study, we systematically evaluated four statistic methods on methylation ratio data and four methods on count-based data and compared their performances with regard to type I error control, sensitivity and specificity of DMC detection and computational resource demands using real RRBS data along with simulation. Our results show that the ratio-based tests are generally more conservative (less sensitive) than the countbased tests. However, some count-based methods have high false-positive rates and should be avoided. The beta-binomial model gives a good balance between sensitivity and specificity and is preferred method. Selection of methods in different settings, signal versus noise and sample size estimation are also discussed.",
keywords = "Bisulfite next-generation sequencing, Differential methylation, Statistical method comparison",
author = "Yun Zhang and Saurabh Baheti and Sun, {Zhifu D}",
year = "2018",
month = "5",
day = "1",
doi = "10.1093/bib/bbw133",
language = "English (US)",
volume = "19",
pages = "374--386",
journal = "Briefings in Bioinformatics",
issn = "1467-5463",
publisher = "Oxford University Press",
number = "3",

}

TY - JOUR

T1 - Statistical method evaluation for differentially methylated CpGs in base resolution next-generation DNA sequencing data

AU - Zhang, Yun

AU - Baheti, Saurabh

AU - Sun, Zhifu D

PY - 2018/5/1

Y1 - 2018/5/1

N2 - High-throughput bisulfite methylation sequencing such as reduced representation bisulfite sequencing (RRBS), Agilent SureSelect Human Methyl-Seq (Methyl-seq) or whole-genome bisulfite sequencing is commonly used for base resolution methylome research. These data are represented either by the ratio of methylated cytosine versus total coverage at a CpG site or numbers of methylated and unmethylated cytosines. Multiple statistical methods can be used to detect differentially methylated CpGs (DMCs) between conditions, and these methods are often the base for the next step of differentially methylated region identification. The ratio data have a flexibility of fitting to many linear models, but the raw count data take consideration of coverage information. There is an array of options in each datatype for DMC detection; however, it is not clear which is an optimal statistical method. In this study, we systematically evaluated four statistic methods on methylation ratio data and four methods on count-based data and compared their performances with regard to type I error control, sensitivity and specificity of DMC detection and computational resource demands using real RRBS data along with simulation. Our results show that the ratio-based tests are generally more conservative (less sensitive) than the countbased tests. However, some count-based methods have high false-positive rates and should be avoided. The beta-binomial model gives a good balance between sensitivity and specificity and is preferred method. Selection of methods in different settings, signal versus noise and sample size estimation are also discussed.

AB - High-throughput bisulfite methylation sequencing such as reduced representation bisulfite sequencing (RRBS), Agilent SureSelect Human Methyl-Seq (Methyl-seq) or whole-genome bisulfite sequencing is commonly used for base resolution methylome research. These data are represented either by the ratio of methylated cytosine versus total coverage at a CpG site or numbers of methylated and unmethylated cytosines. Multiple statistical methods can be used to detect differentially methylated CpGs (DMCs) between conditions, and these methods are often the base for the next step of differentially methylated region identification. The ratio data have a flexibility of fitting to many linear models, but the raw count data take consideration of coverage information. There is an array of options in each datatype for DMC detection; however, it is not clear which is an optimal statistical method. In this study, we systematically evaluated four statistic methods on methylation ratio data and four methods on count-based data and compared their performances with regard to type I error control, sensitivity and specificity of DMC detection and computational resource demands using real RRBS data along with simulation. Our results show that the ratio-based tests are generally more conservative (less sensitive) than the countbased tests. However, some count-based methods have high false-positive rates and should be avoided. The beta-binomial model gives a good balance between sensitivity and specificity and is preferred method. Selection of methods in different settings, signal versus noise and sample size estimation are also discussed.

KW - Bisulfite next-generation sequencing

KW - Differential methylation

KW - Statistical method comparison

UR - http://www.scopus.com/inward/record.url?scp=85032445855&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85032445855&partnerID=8YFLogxK

U2 - 10.1093/bib/bbw133

DO - 10.1093/bib/bbw133

M3 - Article

VL - 19

SP - 374

EP - 386

JO - Briefings in Bioinformatics

JF - Briefings in Bioinformatics

SN - 1467-5463

IS - 3

ER -