Recommendations for performance optimizations when using GATK3.8 and GATK4

Jacob R. Heldenbrand, Saurabh Baheti, Matthew A. Bockol, Travis M. Drucker, Steven N. Hart, Matthew E. Hudson, Ravishankar K. Iyer, Michael T. Kalmbach, Katherine I. Kendig, Eric W. Klee, Nathan R. Mattson, Eric D. Wieben, Mathieu Wiepert, Derek E. Wildman, Liudmila S. Mainzer

Research output: Contribution to journalArticle

Abstract

Background: Use of the Genome Analysis Toolkit (GATK) continues to be the standard practice in genomic variant calling in both research and the clinic. Recently the toolkit has been rapidly evolving. Significant computational performance improvements have been introduced in GATK3.8 through collaboration with Intel in 2017. The first release of GATK4 in early 2018 revealed rewrites in the code base, as the stepping stone toward a Spark implementation. As the software continues to be a moving target for optimal deployment in highly productive environments, we present a detailed analysis of these improvements, to help the community stay abreast with changes in performance. Results: We re-evaluated multiple options, such as threading, parallel garbage collection, I/O options and data-level parallelization. Additionally, we considered the trade-offs of using GATK3.8 and GATK4. We found optimized parameter values that reduce the time of executing the best practices variant calling procedure by 29.3% for GATK3.8 and 16.9% for GATK4. Further speedups can be accomplished by splitting data for parallel analysis, resulting in run time of only a few hours on whole human genome sequenced to the depth of 20X, for both versions of GATK. Nonetheless, GATK4 is already much more cost-effective than GATK3.8. Thanks to significant rewrites of the algorithms, the same analysis can be run largely in a single-threaded fashion, allowing users to process multiple samples on the same CPU. Conclusions: In time-sensitive situations, when a patient has a critical or rapidly developing condition, it is useful to minimize the time to process a single sample. In such cases we recommend using GATK3.8 by splitting the sample into chunks and computing across multiple nodes. The resultant walltime will be nnn.4 hours at the cost of $41.60 on 4 c5.18xlarge instances of Amazon Cloud. For cost-effectiveness of routine analyses or for large population studies, it is useful to maximize the number of samples processed per unit time. Thus we recommend GATK4, running multiple samples on one node. The total walltime will be 34.1 hours on 40 samples, with 1.18 samples processed per hour at the cost of $2.60 per sample on c5.18xlarge instance of Amazon Cloud.

Original languageEnglish (US)
Article number557
JournalBMC bioinformatics
Volume20
Issue number1
DOIs
StatePublished - Nov 8 2019

Fingerprint

Performance Optimization
Recommendations
Genes
Cost effectiveness
Electric sparks
Genome
Program processors
Garbage
Human Genome
Continue
Practice Guidelines
Cost-Benefit Analysis
Garbage Collection
Software
Cost-effectiveness
Costs
Moving Target
Best Practice
Costs and Cost Analysis
Vertex of a graph

Keywords

  • Best practices
  • Cluster computing
  • Computational performance
  • GATK
  • Genomic variant calling
  • Parallelization

ASJC Scopus subject areas

  • Structural Biology
  • Biochemistry
  • Molecular Biology
  • Computer Science Applications
  • Applied Mathematics

Cite this

Heldenbrand, J. R., Baheti, S., Bockol, M. A., Drucker, T. M., Hart, S. N., Hudson, M. E., ... Mainzer, L. S. (2019). Recommendations for performance optimizations when using GATK3.8 and GATK4. BMC bioinformatics, 20(1), [557]. https://doi.org/10.1186/s12859-019-3169-7

Recommendations for performance optimizations when using GATK3.8 and GATK4. / Heldenbrand, Jacob R.; Baheti, Saurabh; Bockol, Matthew A.; Drucker, Travis M.; Hart, Steven N.; Hudson, Matthew E.; Iyer, Ravishankar K.; Kalmbach, Michael T.; Kendig, Katherine I.; Klee, Eric W.; Mattson, Nathan R.; Wieben, Eric D.; Wiepert, Mathieu; Wildman, Derek E.; Mainzer, Liudmila S.

In: BMC bioinformatics, Vol. 20, No. 1, 557, 08.11.2019.

Research output: Contribution to journalArticle

Heldenbrand, JR, Baheti, S, Bockol, MA, Drucker, TM, Hart, SN, Hudson, ME, Iyer, RK, Kalmbach, MT, Kendig, KI, Klee, EW, Mattson, NR, Wieben, ED, Wiepert, M, Wildman, DE & Mainzer, LS 2019, 'Recommendations for performance optimizations when using GATK3.8 and GATK4', BMC bioinformatics, vol. 20, no. 1, 557. https://doi.org/10.1186/s12859-019-3169-7
Heldenbrand, Jacob R. ; Baheti, Saurabh ; Bockol, Matthew A. ; Drucker, Travis M. ; Hart, Steven N. ; Hudson, Matthew E. ; Iyer, Ravishankar K. ; Kalmbach, Michael T. ; Kendig, Katherine I. ; Klee, Eric W. ; Mattson, Nathan R. ; Wieben, Eric D. ; Wiepert, Mathieu ; Wildman, Derek E. ; Mainzer, Liudmila S. / Recommendations for performance optimizations when using GATK3.8 and GATK4. In: BMC bioinformatics. 2019 ; Vol. 20, No. 1.
@article{c28d426496ed4ea38424887ba3a15819,
title = "Recommendations for performance optimizations when using GATK3.8 and GATK4",
abstract = "Background: Use of the Genome Analysis Toolkit (GATK) continues to be the standard practice in genomic variant calling in both research and the clinic. Recently the toolkit has been rapidly evolving. Significant computational performance improvements have been introduced in GATK3.8 through collaboration with Intel in 2017. The first release of GATK4 in early 2018 revealed rewrites in the code base, as the stepping stone toward a Spark implementation. As the software continues to be a moving target for optimal deployment in highly productive environments, we present a detailed analysis of these improvements, to help the community stay abreast with changes in performance. Results: We re-evaluated multiple options, such as threading, parallel garbage collection, I/O options and data-level parallelization. Additionally, we considered the trade-offs of using GATK3.8 and GATK4. We found optimized parameter values that reduce the time of executing the best practices variant calling procedure by 29.3{\%} for GATK3.8 and 16.9{\%} for GATK4. Further speedups can be accomplished by splitting data for parallel analysis, resulting in run time of only a few hours on whole human genome sequenced to the depth of 20X, for both versions of GATK. Nonetheless, GATK4 is already much more cost-effective than GATK3.8. Thanks to significant rewrites of the algorithms, the same analysis can be run largely in a single-threaded fashion, allowing users to process multiple samples on the same CPU. Conclusions: In time-sensitive situations, when a patient has a critical or rapidly developing condition, it is useful to minimize the time to process a single sample. In such cases we recommend using GATK3.8 by splitting the sample into chunks and computing across multiple nodes. The resultant walltime will be nnn.4 hours at the cost of $41.60 on 4 c5.18xlarge instances of Amazon Cloud. For cost-effectiveness of routine analyses or for large population studies, it is useful to maximize the number of samples processed per unit time. Thus we recommend GATK4, running multiple samples on one node. The total walltime will be 34.1 hours on 40 samples, with 1.18 samples processed per hour at the cost of $2.60 per sample on c5.18xlarge instance of Amazon Cloud.",
keywords = "Best practices, Cluster computing, Computational performance, GATK, Genomic variant calling, Parallelization",
author = "Heldenbrand, {Jacob R.} and Saurabh Baheti and Bockol, {Matthew A.} and Drucker, {Travis M.} and Hart, {Steven N.} and Hudson, {Matthew E.} and Iyer, {Ravishankar K.} and Kalmbach, {Michael T.} and Kendig, {Katherine I.} and Klee, {Eric W.} and Mattson, {Nathan R.} and Wieben, {Eric D.} and Mathieu Wiepert and Wildman, {Derek E.} and Mainzer, {Liudmila S.}",
year = "2019",
month = "11",
day = "8",
doi = "10.1186/s12859-019-3169-7",
language = "English (US)",
volume = "20",
journal = "BMC Bioinformatics",
issn = "1471-2105",
publisher = "BioMed Central",
number = "1",

}

TY - JOUR

T1 - Recommendations for performance optimizations when using GATK3.8 and GATK4

AU - Heldenbrand, Jacob R.

AU - Baheti, Saurabh

AU - Bockol, Matthew A.

AU - Drucker, Travis M.

AU - Hart, Steven N.

AU - Hudson, Matthew E.

AU - Iyer, Ravishankar K.

AU - Kalmbach, Michael T.

AU - Kendig, Katherine I.

AU - Klee, Eric W.

AU - Mattson, Nathan R.

AU - Wieben, Eric D.

AU - Wiepert, Mathieu

AU - Wildman, Derek E.

AU - Mainzer, Liudmila S.

PY - 2019/11/8

Y1 - 2019/11/8

N2 - Background: Use of the Genome Analysis Toolkit (GATK) continues to be the standard practice in genomic variant calling in both research and the clinic. Recently the toolkit has been rapidly evolving. Significant computational performance improvements have been introduced in GATK3.8 through collaboration with Intel in 2017. The first release of GATK4 in early 2018 revealed rewrites in the code base, as the stepping stone toward a Spark implementation. As the software continues to be a moving target for optimal deployment in highly productive environments, we present a detailed analysis of these improvements, to help the community stay abreast with changes in performance. Results: We re-evaluated multiple options, such as threading, parallel garbage collection, I/O options and data-level parallelization. Additionally, we considered the trade-offs of using GATK3.8 and GATK4. We found optimized parameter values that reduce the time of executing the best practices variant calling procedure by 29.3% for GATK3.8 and 16.9% for GATK4. Further speedups can be accomplished by splitting data for parallel analysis, resulting in run time of only a few hours on whole human genome sequenced to the depth of 20X, for both versions of GATK. Nonetheless, GATK4 is already much more cost-effective than GATK3.8. Thanks to significant rewrites of the algorithms, the same analysis can be run largely in a single-threaded fashion, allowing users to process multiple samples on the same CPU. Conclusions: In time-sensitive situations, when a patient has a critical or rapidly developing condition, it is useful to minimize the time to process a single sample. In such cases we recommend using GATK3.8 by splitting the sample into chunks and computing across multiple nodes. The resultant walltime will be nnn.4 hours at the cost of $41.60 on 4 c5.18xlarge instances of Amazon Cloud. For cost-effectiveness of routine analyses or for large population studies, it is useful to maximize the number of samples processed per unit time. Thus we recommend GATK4, running multiple samples on one node. The total walltime will be 34.1 hours on 40 samples, with 1.18 samples processed per hour at the cost of $2.60 per sample on c5.18xlarge instance of Amazon Cloud.

AB - Background: Use of the Genome Analysis Toolkit (GATK) continues to be the standard practice in genomic variant calling in both research and the clinic. Recently the toolkit has been rapidly evolving. Significant computational performance improvements have been introduced in GATK3.8 through collaboration with Intel in 2017. The first release of GATK4 in early 2018 revealed rewrites in the code base, as the stepping stone toward a Spark implementation. As the software continues to be a moving target for optimal deployment in highly productive environments, we present a detailed analysis of these improvements, to help the community stay abreast with changes in performance. Results: We re-evaluated multiple options, such as threading, parallel garbage collection, I/O options and data-level parallelization. Additionally, we considered the trade-offs of using GATK3.8 and GATK4. We found optimized parameter values that reduce the time of executing the best practices variant calling procedure by 29.3% for GATK3.8 and 16.9% for GATK4. Further speedups can be accomplished by splitting data for parallel analysis, resulting in run time of only a few hours on whole human genome sequenced to the depth of 20X, for both versions of GATK. Nonetheless, GATK4 is already much more cost-effective than GATK3.8. Thanks to significant rewrites of the algorithms, the same analysis can be run largely in a single-threaded fashion, allowing users to process multiple samples on the same CPU. Conclusions: In time-sensitive situations, when a patient has a critical or rapidly developing condition, it is useful to minimize the time to process a single sample. In such cases we recommend using GATK3.8 by splitting the sample into chunks and computing across multiple nodes. The resultant walltime will be nnn.4 hours at the cost of $41.60 on 4 c5.18xlarge instances of Amazon Cloud. For cost-effectiveness of routine analyses or for large population studies, it is useful to maximize the number of samples processed per unit time. Thus we recommend GATK4, running multiple samples on one node. The total walltime will be 34.1 hours on 40 samples, with 1.18 samples processed per hour at the cost of $2.60 per sample on c5.18xlarge instance of Amazon Cloud.

KW - Best practices

KW - Cluster computing

KW - Computational performance

KW - GATK

KW - Genomic variant calling

KW - Parallelization

UR - http://www.scopus.com/inward/record.url?scp=85074690361&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85074690361&partnerID=8YFLogxK

U2 - 10.1186/s12859-019-3169-7

DO - 10.1186/s12859-019-3169-7

M3 - Article

C2 - 31703611

AN - SCOPUS:85074690361

VL - 20

JO - BMC Bioinformatics

JF - BMC Bioinformatics

SN - 1471-2105

IS - 1

M1 - 557

ER -