Identification of missing variants by combining multiple analytic pipelines

Yingxue Ren, Joseph S. Reddy, Cyril Pottier, Vivekananda Sarangi, Shulan Tian, Jason P. Sinnwell, Shannon K. McDonnell, Joanna M Biernacka, Minerva M Carrasquillo, Owen A Ross, Nilufer Taner, Rosa V Rademakers, Matthew Hudson, Liudmila Sergeevna Mainzer, Yan Asmann

Research output: Contribution to journalArticle

Abstract

Background: After decades of identifying risk factors using array-based genome-wide association studies (GWAS), genetic research of complex diseases has shifted to sequencing-based rare variants discovery. This requires large sample sizes for statistical power and has brought up questions about whether the current variant calling practices are adequate for large cohorts. It is well-known that there are discrepancies between variants called by different pipelines, and that using a single pipeline always misses true variants exclusively identifiable by other pipelines. Nonetheless, it is common practice today to call variants by one pipeline due to computational cost and assume that false negative calls are a small percent of total. Results: We analyzed 10,000 exomes from the Alzheimer's Disease Sequencing Project (ADSP) using multiple analytic pipelines consisting of different read aligners and variant calling strategies. We compared variants identified by using two aligners in 50,100, 200, 500, 1000, and 1952 samples; and compared variants identified by adding single-sample genotyping to the default multi-sample joint genotyping in 50,100, 500, 2000, 5000 and 10,000 samples. We found that using a single pipeline missed increasing numbers of high-quality variants correlated with sample sizes. By combining two read aligners and two variant calling strategies, we rescued 30% of pass-QC variants at sample size of 2000, and 56% at 10,000 samples. The rescued variants had higher proportions of low frequency (minor allele frequency [MAF] 1-5%) and rare (MAF<1%) variants, which are the very type of variants of interest. In 660 Alzheimer's disease cases with earlier onset ages of ≤65, 4 out of 13 (31%) previously-published rare pathogenic and protective mutations in APP, PSEN1, and PSEN2 genes were undetected by the default one-pipeline approach but recovered by the multi-pipeline approach. Conclusions: Identification of the complete variant set from sequencing data is the prerequisite of genetic association analyses. The current analytic practice of calling genetic variants from sequencing data using a single bioinformatics pipeline is no longer adequate with the increasingly large projects. The number and percentage of quality variants that passed quality filters but are missed by the one-pipeline approach rapidly increased with sample size.

Original languageEnglish (US)
Article number139
JournalBMC Bioinformatics
Volume19
Issue number1
DOIs
StatePublished - Apr 16 2018

Fingerprint

Sample Size
Pipelines
Gene Frequency
Alzheimer Disease
Sequencing
Exome
Genetic Research
Genome-Wide Association Study
Computational Biology
Age of Onset
Alzheimer's Disease
Joints
Minor
Costs and Cost Analysis
Mutation
Genes
Genetic Association
Statistical Power
Risk Factors
Bioinformatics

Keywords

  • Combining multiple bioinformatics pipelines
  • Missing variants
  • Rare variants

ASJC Scopus subject areas

  • Structural Biology
  • Biochemistry
  • Molecular Biology
  • Computer Science Applications
  • Applied Mathematics

Cite this

Ren, Y., Reddy, J. S., Pottier, C., Sarangi, V., Tian, S., Sinnwell, J. P., ... Asmann, Y. (2018). Identification of missing variants by combining multiple analytic pipelines. BMC Bioinformatics, 19(1), [139]. https://doi.org/10.1186/s12859-018-2151-0

Identification of missing variants by combining multiple analytic pipelines. / Ren, Yingxue; Reddy, Joseph S.; Pottier, Cyril; Sarangi, Vivekananda; Tian, Shulan; Sinnwell, Jason P.; McDonnell, Shannon K.; Biernacka, Joanna M; Carrasquillo, Minerva M; Ross, Owen A; Taner, Nilufer; Rademakers, Rosa V; Hudson, Matthew; Mainzer, Liudmila Sergeevna; Asmann, Yan.

In: BMC Bioinformatics, Vol. 19, No. 1, 139, 16.04.2018.

Research output: Contribution to journalArticle

Ren Y, Reddy JS, Pottier C, Sarangi V, Tian S, Sinnwell JP et al. Identification of missing variants by combining multiple analytic pipelines. BMC Bioinformatics. 2018 Apr 16;19(1). 139. https://doi.org/10.1186/s12859-018-2151-0
Ren, Yingxue ; Reddy, Joseph S. ; Pottier, Cyril ; Sarangi, Vivekananda ; Tian, Shulan ; Sinnwell, Jason P. ; McDonnell, Shannon K. ; Biernacka, Joanna M ; Carrasquillo, Minerva M ; Ross, Owen A ; Taner, Nilufer ; Rademakers, Rosa V ; Hudson, Matthew ; Mainzer, Liudmila Sergeevna ; Asmann, Yan. / Identification of missing variants by combining multiple analytic pipelines. In: BMC Bioinformatics. 2018 ; Vol. 19, No. 1.
@article{366176dd0d9a492ebbf13469ba712275,
title = "Identification of missing variants by combining multiple analytic pipelines",
abstract = "Background: After decades of identifying risk factors using array-based genome-wide association studies (GWAS), genetic research of complex diseases has shifted to sequencing-based rare variants discovery. This requires large sample sizes for statistical power and has brought up questions about whether the current variant calling practices are adequate for large cohorts. It is well-known that there are discrepancies between variants called by different pipelines, and that using a single pipeline always misses true variants exclusively identifiable by other pipelines. Nonetheless, it is common practice today to call variants by one pipeline due to computational cost and assume that false negative calls are a small percent of total. Results: We analyzed 10,000 exomes from the Alzheimer's Disease Sequencing Project (ADSP) using multiple analytic pipelines consisting of different read aligners and variant calling strategies. We compared variants identified by using two aligners in 50,100, 200, 500, 1000, and 1952 samples; and compared variants identified by adding single-sample genotyping to the default multi-sample joint genotyping in 50,100, 500, 2000, 5000 and 10,000 samples. We found that using a single pipeline missed increasing numbers of high-quality variants correlated with sample sizes. By combining two read aligners and two variant calling strategies, we rescued 30{\%} of pass-QC variants at sample size of 2000, and 56{\%} at 10,000 samples. The rescued variants had higher proportions of low frequency (minor allele frequency [MAF] 1-5{\%}) and rare (MAF<1{\%}) variants, which are the very type of variants of interest. In 660 Alzheimer's disease cases with earlier onset ages of ≤65, 4 out of 13 (31{\%}) previously-published rare pathogenic and protective mutations in APP, PSEN1, and PSEN2 genes were undetected by the default one-pipeline approach but recovered by the multi-pipeline approach. Conclusions: Identification of the complete variant set from sequencing data is the prerequisite of genetic association analyses. The current analytic practice of calling genetic variants from sequencing data using a single bioinformatics pipeline is no longer adequate with the increasingly large projects. The number and percentage of quality variants that passed quality filters but are missed by the one-pipeline approach rapidly increased with sample size.",
keywords = "Combining multiple bioinformatics pipelines, Missing variants, Rare variants",
author = "Yingxue Ren and Reddy, {Joseph S.} and Cyril Pottier and Vivekananda Sarangi and Shulan Tian and Sinnwell, {Jason P.} and McDonnell, {Shannon K.} and Biernacka, {Joanna M} and Carrasquillo, {Minerva M} and Ross, {Owen A} and Nilufer Taner and Rademakers, {Rosa V} and Matthew Hudson and Mainzer, {Liudmila Sergeevna} and Yan Asmann",
year = "2018",
month = "4",
day = "16",
doi = "10.1186/s12859-018-2151-0",
language = "English (US)",
volume = "19",
journal = "BMC Bioinformatics",
issn = "1471-2105",
publisher = "BioMed Central",
number = "1",

}

TY - JOUR

T1 - Identification of missing variants by combining multiple analytic pipelines

AU - Ren, Yingxue

AU - Reddy, Joseph S.

AU - Pottier, Cyril

AU - Sarangi, Vivekananda

AU - Tian, Shulan

AU - Sinnwell, Jason P.

AU - McDonnell, Shannon K.

AU - Biernacka, Joanna M

AU - Carrasquillo, Minerva M

AU - Ross, Owen A

AU - Taner, Nilufer

AU - Rademakers, Rosa V

AU - Hudson, Matthew

AU - Mainzer, Liudmila Sergeevna

AU - Asmann, Yan

PY - 2018/4/16

Y1 - 2018/4/16

N2 - Background: After decades of identifying risk factors using array-based genome-wide association studies (GWAS), genetic research of complex diseases has shifted to sequencing-based rare variants discovery. This requires large sample sizes for statistical power and has brought up questions about whether the current variant calling practices are adequate for large cohorts. It is well-known that there are discrepancies between variants called by different pipelines, and that using a single pipeline always misses true variants exclusively identifiable by other pipelines. Nonetheless, it is common practice today to call variants by one pipeline due to computational cost and assume that false negative calls are a small percent of total. Results: We analyzed 10,000 exomes from the Alzheimer's Disease Sequencing Project (ADSP) using multiple analytic pipelines consisting of different read aligners and variant calling strategies. We compared variants identified by using two aligners in 50,100, 200, 500, 1000, and 1952 samples; and compared variants identified by adding single-sample genotyping to the default multi-sample joint genotyping in 50,100, 500, 2000, 5000 and 10,000 samples. We found that using a single pipeline missed increasing numbers of high-quality variants correlated with sample sizes. By combining two read aligners and two variant calling strategies, we rescued 30% of pass-QC variants at sample size of 2000, and 56% at 10,000 samples. The rescued variants had higher proportions of low frequency (minor allele frequency [MAF] 1-5%) and rare (MAF<1%) variants, which are the very type of variants of interest. In 660 Alzheimer's disease cases with earlier onset ages of ≤65, 4 out of 13 (31%) previously-published rare pathogenic and protective mutations in APP, PSEN1, and PSEN2 genes were undetected by the default one-pipeline approach but recovered by the multi-pipeline approach. Conclusions: Identification of the complete variant set from sequencing data is the prerequisite of genetic association analyses. The current analytic practice of calling genetic variants from sequencing data using a single bioinformatics pipeline is no longer adequate with the increasingly large projects. The number and percentage of quality variants that passed quality filters but are missed by the one-pipeline approach rapidly increased with sample size.

AB - Background: After decades of identifying risk factors using array-based genome-wide association studies (GWAS), genetic research of complex diseases has shifted to sequencing-based rare variants discovery. This requires large sample sizes for statistical power and has brought up questions about whether the current variant calling practices are adequate for large cohorts. It is well-known that there are discrepancies between variants called by different pipelines, and that using a single pipeline always misses true variants exclusively identifiable by other pipelines. Nonetheless, it is common practice today to call variants by one pipeline due to computational cost and assume that false negative calls are a small percent of total. Results: We analyzed 10,000 exomes from the Alzheimer's Disease Sequencing Project (ADSP) using multiple analytic pipelines consisting of different read aligners and variant calling strategies. We compared variants identified by using two aligners in 50,100, 200, 500, 1000, and 1952 samples; and compared variants identified by adding single-sample genotyping to the default multi-sample joint genotyping in 50,100, 500, 2000, 5000 and 10,000 samples. We found that using a single pipeline missed increasing numbers of high-quality variants correlated with sample sizes. By combining two read aligners and two variant calling strategies, we rescued 30% of pass-QC variants at sample size of 2000, and 56% at 10,000 samples. The rescued variants had higher proportions of low frequency (minor allele frequency [MAF] 1-5%) and rare (MAF<1%) variants, which are the very type of variants of interest. In 660 Alzheimer's disease cases with earlier onset ages of ≤65, 4 out of 13 (31%) previously-published rare pathogenic and protective mutations in APP, PSEN1, and PSEN2 genes were undetected by the default one-pipeline approach but recovered by the multi-pipeline approach. Conclusions: Identification of the complete variant set from sequencing data is the prerequisite of genetic association analyses. The current analytic practice of calling genetic variants from sequencing data using a single bioinformatics pipeline is no longer adequate with the increasingly large projects. The number and percentage of quality variants that passed quality filters but are missed by the one-pipeline approach rapidly increased with sample size.

KW - Combining multiple bioinformatics pipelines

KW - Missing variants

KW - Rare variants

UR - http://www.scopus.com/inward/record.url?scp=85045460520&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85045460520&partnerID=8YFLogxK

U2 - 10.1186/s12859-018-2151-0

DO - 10.1186/s12859-018-2151-0

M3 - Article

C2 - 29661148

AN - SCOPUS:85045460520

VL - 19

JO - BMC Bioinformatics

JF - BMC Bioinformatics

SN - 1471-2105

IS - 1

M1 - 139

ER -