Identification of genomic indels and structural variations using split reads

Zhengdong D. Zhang, Jiang Du, Hugo Lam, Alexej Abyzov, Alexander E. Urban, Michael Snyder, Mark Gerstein

Research output: Contribution to journalArticle

46 Citations (Scopus)

Abstract

Background: Recent studies have demonstrated the genetic significance of insertions, deletions, and other more complex structural variants (SVs) in the human population. With the development of the next-generation sequencing technologies, high-throughput surveys of SVs on the whole-genome level have become possible. Here we present split-read identification, calibrated (SRiC), a sequence-based method for SV detection.Results: We start by mapping each read to the reference genome in standard fashion using gapped alignment. Then to identify SVs, we score each of the many initial mappings with an assessment strategy designed to take into account both sequencing and alignment errors (e.g. scoring more highly events gapped in the center of a read). All current SV calling methods have multilevel biases in their identifications due to both experimental and computational limitations (e.g. calling more deletions than insertions). A key aspect of our approach is that we calibrate all our calls against synthetic data sets generated from simulations of high-throughput sequencing (with realistic error models). This allows us to calculate sensitivity and the positive predictive value under different parameter-value scenarios and for different classes of events (e.g. long deletions vs. short insertions). We run our calculations on representative data from the 1000 Genomes Project. Coupling the observed numbers of events on chromosome 1 with the calibrations gleaned from the simulations (for different length events) allows us to construct a relatively unbiased estimate for the total number of SVs in the human genome across a wide range of length scales. We estimate in particular that an individual genome contains ~670,000 indels/SVs.Conclusions: Compared with the existing read-depth and read-pair approaches for SV identification, our method can pinpoint the exact breakpoints of SV events, reveal the actual sequence content of insertions, and cover the whole size spectrum for deletions. Moreover, with the advent of the third-generation sequencing technologies that produce longer reads, we expect our method to be even more useful.

Original languageEnglish (US)
Article number375
JournalBMC Genomics
Volume12
DOIs
StatePublished - Jul 25 2011
Externally publishedYes

Fingerprint

Genomic Structural Variation
Genome
Technology
Insertional Mutagenesis
Chromosomes, Human, Pair 1
Human Genome
Calibration
Population

Keywords

  • Deletion
  • High-throughput sequencing
  • Insertion
  • Split read
  • Structure variation

ASJC Scopus subject areas

  • Biotechnology
  • Genetics

Cite this

Zhang, Z. D., Du, J., Lam, H., Abyzov, A., Urban, A. E., Snyder, M., & Gerstein, M. (2011). Identification of genomic indels and structural variations using split reads. BMC Genomics, 12, [375]. https://doi.org/10.1186/1471-2164-12-375

Identification of genomic indels and structural variations using split reads. / Zhang, Zhengdong D.; Du, Jiang; Lam, Hugo; Abyzov, Alexej; Urban, Alexander E.; Snyder, Michael; Gerstein, Mark.

In: BMC Genomics, Vol. 12, 375, 25.07.2011.

Research output: Contribution to journalArticle

Zhang, Zhengdong D. ; Du, Jiang ; Lam, Hugo ; Abyzov, Alexej ; Urban, Alexander E. ; Snyder, Michael ; Gerstein, Mark. / Identification of genomic indels and structural variations using split reads. In: BMC Genomics. 2011 ; Vol. 12.
@article{83b1cf36e909433688cc20238c7cf191,
title = "Identification of genomic indels and structural variations using split reads",
abstract = "Background: Recent studies have demonstrated the genetic significance of insertions, deletions, and other more complex structural variants (SVs) in the human population. With the development of the next-generation sequencing technologies, high-throughput surveys of SVs on the whole-genome level have become possible. Here we present split-read identification, calibrated (SRiC), a sequence-based method for SV detection.Results: We start by mapping each read to the reference genome in standard fashion using gapped alignment. Then to identify SVs, we score each of the many initial mappings with an assessment strategy designed to take into account both sequencing and alignment errors (e.g. scoring more highly events gapped in the center of a read). All current SV calling methods have multilevel biases in their identifications due to both experimental and computational limitations (e.g. calling more deletions than insertions). A key aspect of our approach is that we calibrate all our calls against synthetic data sets generated from simulations of high-throughput sequencing (with realistic error models). This allows us to calculate sensitivity and the positive predictive value under different parameter-value scenarios and for different classes of events (e.g. long deletions vs. short insertions). We run our calculations on representative data from the 1000 Genomes Project. Coupling the observed numbers of events on chromosome 1 with the calibrations gleaned from the simulations (for different length events) allows us to construct a relatively unbiased estimate for the total number of SVs in the human genome across a wide range of length scales. We estimate in particular that an individual genome contains ~670,000 indels/SVs.Conclusions: Compared with the existing read-depth and read-pair approaches for SV identification, our method can pinpoint the exact breakpoints of SV events, reveal the actual sequence content of insertions, and cover the whole size spectrum for deletions. Moreover, with the advent of the third-generation sequencing technologies that produce longer reads, we expect our method to be even more useful.",
keywords = "Deletion, High-throughput sequencing, Insertion, Split read, Structure variation",
author = "Zhang, {Zhengdong D.} and Jiang Du and Hugo Lam and Alexej Abyzov and Urban, {Alexander E.} and Michael Snyder and Mark Gerstein",
year = "2011",
month = "7",
day = "25",
doi = "10.1186/1471-2164-12-375",
language = "English (US)",
volume = "12",
journal = "BMC Genomics",
issn = "1471-2164",
publisher = "BioMed Central",

}

TY - JOUR

T1 - Identification of genomic indels and structural variations using split reads

AU - Zhang, Zhengdong D.

AU - Du, Jiang

AU - Lam, Hugo

AU - Abyzov, Alexej

AU - Urban, Alexander E.

AU - Snyder, Michael

AU - Gerstein, Mark

PY - 2011/7/25

Y1 - 2011/7/25

N2 - Background: Recent studies have demonstrated the genetic significance of insertions, deletions, and other more complex structural variants (SVs) in the human population. With the development of the next-generation sequencing technologies, high-throughput surveys of SVs on the whole-genome level have become possible. Here we present split-read identification, calibrated (SRiC), a sequence-based method for SV detection.Results: We start by mapping each read to the reference genome in standard fashion using gapped alignment. Then to identify SVs, we score each of the many initial mappings with an assessment strategy designed to take into account both sequencing and alignment errors (e.g. scoring more highly events gapped in the center of a read). All current SV calling methods have multilevel biases in their identifications due to both experimental and computational limitations (e.g. calling more deletions than insertions). A key aspect of our approach is that we calibrate all our calls against synthetic data sets generated from simulations of high-throughput sequencing (with realistic error models). This allows us to calculate sensitivity and the positive predictive value under different parameter-value scenarios and for different classes of events (e.g. long deletions vs. short insertions). We run our calculations on representative data from the 1000 Genomes Project. Coupling the observed numbers of events on chromosome 1 with the calibrations gleaned from the simulations (for different length events) allows us to construct a relatively unbiased estimate for the total number of SVs in the human genome across a wide range of length scales. We estimate in particular that an individual genome contains ~670,000 indels/SVs.Conclusions: Compared with the existing read-depth and read-pair approaches for SV identification, our method can pinpoint the exact breakpoints of SV events, reveal the actual sequence content of insertions, and cover the whole size spectrum for deletions. Moreover, with the advent of the third-generation sequencing technologies that produce longer reads, we expect our method to be even more useful.

AB - Background: Recent studies have demonstrated the genetic significance of insertions, deletions, and other more complex structural variants (SVs) in the human population. With the development of the next-generation sequencing technologies, high-throughput surveys of SVs on the whole-genome level have become possible. Here we present split-read identification, calibrated (SRiC), a sequence-based method for SV detection.Results: We start by mapping each read to the reference genome in standard fashion using gapped alignment. Then to identify SVs, we score each of the many initial mappings with an assessment strategy designed to take into account both sequencing and alignment errors (e.g. scoring more highly events gapped in the center of a read). All current SV calling methods have multilevel biases in their identifications due to both experimental and computational limitations (e.g. calling more deletions than insertions). A key aspect of our approach is that we calibrate all our calls against synthetic data sets generated from simulations of high-throughput sequencing (with realistic error models). This allows us to calculate sensitivity and the positive predictive value under different parameter-value scenarios and for different classes of events (e.g. long deletions vs. short insertions). We run our calculations on representative data from the 1000 Genomes Project. Coupling the observed numbers of events on chromosome 1 with the calibrations gleaned from the simulations (for different length events) allows us to construct a relatively unbiased estimate for the total number of SVs in the human genome across a wide range of length scales. We estimate in particular that an individual genome contains ~670,000 indels/SVs.Conclusions: Compared with the existing read-depth and read-pair approaches for SV identification, our method can pinpoint the exact breakpoints of SV events, reveal the actual sequence content of insertions, and cover the whole size spectrum for deletions. Moreover, with the advent of the third-generation sequencing technologies that produce longer reads, we expect our method to be even more useful.

KW - Deletion

KW - High-throughput sequencing

KW - Insertion

KW - Split read

KW - Structure variation

UR - http://www.scopus.com/inward/record.url?scp=79960572201&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=79960572201&partnerID=8YFLogxK

U2 - 10.1186/1471-2164-12-375

DO - 10.1186/1471-2164-12-375

M3 - Article

C2 - 21787423

AN - SCOPUS:79960572201

VL - 12

JO - BMC Genomics

JF - BMC Genomics

SN - 1471-2164

M1 - 375

ER -