Identification of factors associated with duplicate rate in ChIP-seq data

Shulan Tian, Shuxia Peng, Michael Kalmbach, Krutika S. Gaonkar, Aditya Bhagwate, Wei D Ding, Jeanette E Eckel-Passow, Huihuang D Yan, Susan L Slager

Research output: Contribution to journalArticle

Abstract

Chromatin immunoprecipitation and sequencing (ChIP-seq) has been widely used to map DNA-binding proteins, histone proteins and their modifications. ChIP-seq data contains redundant reads termed duplicates, referring to those mapping to the same genomic location and strand. There are two main sources of duplicates: polymerase chain reaction (PCR) duplicates and natural duplicates. Unlike natural duplicates that represent true signals from sequencing of independent DNA templates, PCR duplicates are artifacts originating from sequencing of identical copies amplified from the same DNA template. In analysis, duplicates are removed from peak calling and signal quantification. Nevertheless, a significant portion of the duplicates is believed to represent true signals. Obviously, removing all duplicates will underestimate the signal level in peaks and impact the identification of signal changes across samples. Therefore, an in-depth evaluation of the impact from duplicate removal is needed. Using eight public ChIP-seq datasets from three narrow-peak and two broad-peak marks, we tried to understand the distribution of duplicates in the genome, the extent by which duplicate removal impacts peak calling and signal estimation, and the factors associated with duplicate level in peaks. The three PCR-free histone H3 lysine 4 trimethylation (H3K4me3) ChIP-seq data had about 40% duplicates and 97% of them were within peaks. For the other datasets generated with PCR amplification of ChIP DNA, as expected, the narrow-peak marks have a much higher proportion of duplicates than the broad-peak marks. We found that duplicates are enriched in peaks and largely represent true signals, more conspicuous in those with high confidence. Furthermore, duplicate level in peaks is strongly correlated with the target enrichment level estimated using nonredundant reads, which provides the basis to properly allocate duplicates between noise and signal. Our analysis supports the feasibility of retaining the portion of signal duplicates into downstream analysis, thus alleviating the limitation of complete deduplication.

Original languageEnglish (US)
Article numbere0214723
JournalPloS one
Volume14
Issue number4
DOIs
StatePublished - Apr 1 2019

Fingerprint

Chromatin Immunoprecipitation
Polymerase chain reaction
Chromatin
chromatin
polymerase chain reaction
Polymerase Chain Reaction
histones
Histones
DNA
DNA-binding proteins
DNA-directed DNA polymerase
DNA-Binding Proteins
DNA-Directed DNA Polymerase
Artifacts
Lysine
Amplification
Noise
lysine
Genes
Genome

ASJC Scopus subject areas

  • Biochemistry, Genetics and Molecular Biology(all)
  • Agricultural and Biological Sciences(all)

Cite this

Identification of factors associated with duplicate rate in ChIP-seq data. / Tian, Shulan; Peng, Shuxia; Kalmbach, Michael; Gaonkar, Krutika S.; Bhagwate, Aditya; Ding, Wei D; Eckel-Passow, Jeanette E; Yan, Huihuang D; Slager, Susan L.

In: PloS one, Vol. 14, No. 4, e0214723, 01.04.2019.

Research output: Contribution to journalArticle

Tian S, Peng S, Kalmbach M, Gaonkar KS, Bhagwate A, Ding WD et al. Identification of factors associated with duplicate rate in ChIP-seq data. PloS one. 2019 Apr 1;14(4). e0214723. https://doi.org/10.1371/journal.pone.0214723
Tian, Shulan ; Peng, Shuxia ; Kalmbach, Michael ; Gaonkar, Krutika S. ; Bhagwate, Aditya ; Ding, Wei D ; Eckel-Passow, Jeanette E ; Yan, Huihuang D ; Slager, Susan L. / Identification of factors associated with duplicate rate in ChIP-seq data. In: PloS one. 2019 ; Vol. 14, No. 4.
@article{0c5ad7953e51459790b98ae8df2a9e84,
title = "Identification of factors associated with duplicate rate in ChIP-seq data",
abstract = "Chromatin immunoprecipitation and sequencing (ChIP-seq) has been widely used to map DNA-binding proteins, histone proteins and their modifications. ChIP-seq data contains redundant reads termed duplicates, referring to those mapping to the same genomic location and strand. There are two main sources of duplicates: polymerase chain reaction (PCR) duplicates and natural duplicates. Unlike natural duplicates that represent true signals from sequencing of independent DNA templates, PCR duplicates are artifacts originating from sequencing of identical copies amplified from the same DNA template. In analysis, duplicates are removed from peak calling and signal quantification. Nevertheless, a significant portion of the duplicates is believed to represent true signals. Obviously, removing all duplicates will underestimate the signal level in peaks and impact the identification of signal changes across samples. Therefore, an in-depth evaluation of the impact from duplicate removal is needed. Using eight public ChIP-seq datasets from three narrow-peak and two broad-peak marks, we tried to understand the distribution of duplicates in the genome, the extent by which duplicate removal impacts peak calling and signal estimation, and the factors associated with duplicate level in peaks. The three PCR-free histone H3 lysine 4 trimethylation (H3K4me3) ChIP-seq data had about 40{\%} duplicates and 97{\%} of them were within peaks. For the other datasets generated with PCR amplification of ChIP DNA, as expected, the narrow-peak marks have a much higher proportion of duplicates than the broad-peak marks. We found that duplicates are enriched in peaks and largely represent true signals, more conspicuous in those with high confidence. Furthermore, duplicate level in peaks is strongly correlated with the target enrichment level estimated using nonredundant reads, which provides the basis to properly allocate duplicates between noise and signal. Our analysis supports the feasibility of retaining the portion of signal duplicates into downstream analysis, thus alleviating the limitation of complete deduplication.",
author = "Shulan Tian and Shuxia Peng and Michael Kalmbach and Gaonkar, {Krutika S.} and Aditya Bhagwate and Ding, {Wei D} and Eckel-Passow, {Jeanette E} and Yan, {Huihuang D} and Slager, {Susan L}",
year = "2019",
month = "4",
day = "1",
doi = "10.1371/journal.pone.0214723",
language = "English (US)",
volume = "14",
journal = "PLoS One",
issn = "1932-6203",
publisher = "Public Library of Science",
number = "4",

}

TY - JOUR

T1 - Identification of factors associated with duplicate rate in ChIP-seq data

AU - Tian, Shulan

AU - Peng, Shuxia

AU - Kalmbach, Michael

AU - Gaonkar, Krutika S.

AU - Bhagwate, Aditya

AU - Ding, Wei D

AU - Eckel-Passow, Jeanette E

AU - Yan, Huihuang D

AU - Slager, Susan L

PY - 2019/4/1

Y1 - 2019/4/1

N2 - Chromatin immunoprecipitation and sequencing (ChIP-seq) has been widely used to map DNA-binding proteins, histone proteins and their modifications. ChIP-seq data contains redundant reads termed duplicates, referring to those mapping to the same genomic location and strand. There are two main sources of duplicates: polymerase chain reaction (PCR) duplicates and natural duplicates. Unlike natural duplicates that represent true signals from sequencing of independent DNA templates, PCR duplicates are artifacts originating from sequencing of identical copies amplified from the same DNA template. In analysis, duplicates are removed from peak calling and signal quantification. Nevertheless, a significant portion of the duplicates is believed to represent true signals. Obviously, removing all duplicates will underestimate the signal level in peaks and impact the identification of signal changes across samples. Therefore, an in-depth evaluation of the impact from duplicate removal is needed. Using eight public ChIP-seq datasets from three narrow-peak and two broad-peak marks, we tried to understand the distribution of duplicates in the genome, the extent by which duplicate removal impacts peak calling and signal estimation, and the factors associated with duplicate level in peaks. The three PCR-free histone H3 lysine 4 trimethylation (H3K4me3) ChIP-seq data had about 40% duplicates and 97% of them were within peaks. For the other datasets generated with PCR amplification of ChIP DNA, as expected, the narrow-peak marks have a much higher proportion of duplicates than the broad-peak marks. We found that duplicates are enriched in peaks and largely represent true signals, more conspicuous in those with high confidence. Furthermore, duplicate level in peaks is strongly correlated with the target enrichment level estimated using nonredundant reads, which provides the basis to properly allocate duplicates between noise and signal. Our analysis supports the feasibility of retaining the portion of signal duplicates into downstream analysis, thus alleviating the limitation of complete deduplication.

AB - Chromatin immunoprecipitation and sequencing (ChIP-seq) has been widely used to map DNA-binding proteins, histone proteins and their modifications. ChIP-seq data contains redundant reads termed duplicates, referring to those mapping to the same genomic location and strand. There are two main sources of duplicates: polymerase chain reaction (PCR) duplicates and natural duplicates. Unlike natural duplicates that represent true signals from sequencing of independent DNA templates, PCR duplicates are artifacts originating from sequencing of identical copies amplified from the same DNA template. In analysis, duplicates are removed from peak calling and signal quantification. Nevertheless, a significant portion of the duplicates is believed to represent true signals. Obviously, removing all duplicates will underestimate the signal level in peaks and impact the identification of signal changes across samples. Therefore, an in-depth evaluation of the impact from duplicate removal is needed. Using eight public ChIP-seq datasets from three narrow-peak and two broad-peak marks, we tried to understand the distribution of duplicates in the genome, the extent by which duplicate removal impacts peak calling and signal estimation, and the factors associated with duplicate level in peaks. The three PCR-free histone H3 lysine 4 trimethylation (H3K4me3) ChIP-seq data had about 40% duplicates and 97% of them were within peaks. For the other datasets generated with PCR amplification of ChIP DNA, as expected, the narrow-peak marks have a much higher proportion of duplicates than the broad-peak marks. We found that duplicates are enriched in peaks and largely represent true signals, more conspicuous in those with high confidence. Furthermore, duplicate level in peaks is strongly correlated with the target enrichment level estimated using nonredundant reads, which provides the basis to properly allocate duplicates between noise and signal. Our analysis supports the feasibility of retaining the portion of signal duplicates into downstream analysis, thus alleviating the limitation of complete deduplication.

UR - http://www.scopus.com/inward/record.url?scp=85063776254&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85063776254&partnerID=8YFLogxK

U2 - 10.1371/journal.pone.0214723

DO - 10.1371/journal.pone.0214723

M3 - Article

VL - 14

JO - PLoS One

JF - PLoS One

SN - 1932-6203

IS - 4

M1 - e0214723

ER -