Managing genomic variant calling workflows with Swift/T

Azza E. Ahmed, Jacob Heldenbrand, Yan Asmann, Faisal M. Fadlelmola, Daniel S. Katz, Katherine Kendig, Matthew C. Kendzior, Tiffany Li, Yingxue Ren, Elliott Rodriguez, Matthew R. Weber, Justin M. Wozniak, Jennie Zermeno, Liudmila S. Mainzer

Research output: Contribution to journalArticle

Abstract

Bioinformatics research is frequently performed using complex workflows with multiple steps, fans, merges, and conditionals. This complexity makes management of the workflow difficult on a computer cluster, especially when running in parallel on large batches of data: hundreds or thousands of samples at a time. Scientific workflow management systems could help with that. Many are now being proposed, but is there yet the “best” workflow management system for bioinformatics? Such a system would need to satisfy numerous, sometimes conflicting requirements: from ease of use, to seamless deployment at peta- and exa-scale, and portability to the cloud. We evaluated Swift/T as a candidate for such role by implementing a primary genomic variant calling workflow in the Swift/T language, focusing on workflow management, performance and scalability issues that arise from production-grade big data genomic analyses. In the process we introduced novel features into the language, which are now part of its open repository. Additionally, we formalized a set of design criteria for quality, robust, maintainable workflows that must function at-scale in a production setting, such as a large genomic sequencing facility or a major hospital system. The use of Swift/T conveys two key advantages. (1) It operates transparently in multiple cluster scheduling environments (PBS Torque, SLURM, Cray aprun environment, etc.), thus a single workflow is trivially portable across numerous clusters. (2) The leaf functions of Swift/T permit developers to easily swap executables in and out of the workflow, which makes it easy to maintain and to request resources optimal for each stage of the pipeline. While Swift/T’s data-level parallelism eliminates the need to code parallel analysis of multiple samples, it does make debugging more difficult, as is common for implicitly parallel code. Nonetheless, the language gives users a powerful and portable way to scale up analyses in many computing architectures. The code for our implementation of a variant calling workflow using Swift/ T can be found on GitHub at https://github.com/ncsa/Swift-T-Variant-Calling, with full documentation provided at http://swift-t-variant-calling.readthedocs.io/en/latest/.

Original languageEnglish (US)
Article numbere0211608
JournalPloS one
Volume14
Issue number7
DOIs
StatePublished - Jan 1 2019

Fingerprint

Workflow
genomics
bioinformatics
management systems
Bioinformatics
Apodidae
torque
fans (equipment)
Language
Fans
Scalability
sampling
Computational Biology
Torque
Pipelines
Scheduling
leaves
Documentation

ASJC Scopus subject areas

  • Biochemistry, Genetics and Molecular Biology(all)
  • Agricultural and Biological Sciences(all)

Cite this

Ahmed, A. E., Heldenbrand, J., Asmann, Y., Fadlelmola, F. M., Katz, D. S., Kendig, K., ... Mainzer, L. S. (2019). Managing genomic variant calling workflows with Swift/T. PloS one, 14(7), [e0211608]. https://doi.org/10.1371/journal.pone.0211608

Managing genomic variant calling workflows with Swift/T. / Ahmed, Azza E.; Heldenbrand, Jacob; Asmann, Yan; Fadlelmola, Faisal M.; Katz, Daniel S.; Kendig, Katherine; Kendzior, Matthew C.; Li, Tiffany; Ren, Yingxue; Rodriguez, Elliott; Weber, Matthew R.; Wozniak, Justin M.; Zermeno, Jennie; Mainzer, Liudmila S.

In: PloS one, Vol. 14, No. 7, e0211608, 01.01.2019.

Research output: Contribution to journalArticle

Ahmed, AE, Heldenbrand, J, Asmann, Y, Fadlelmola, FM, Katz, DS, Kendig, K, Kendzior, MC, Li, T, Ren, Y, Rodriguez, E, Weber, MR, Wozniak, JM, Zermeno, J & Mainzer, LS 2019, 'Managing genomic variant calling workflows with Swift/T', PloS one, vol. 14, no. 7, e0211608. https://doi.org/10.1371/journal.pone.0211608
Ahmed AE, Heldenbrand J, Asmann Y, Fadlelmola FM, Katz DS, Kendig K et al. Managing genomic variant calling workflows with Swift/T. PloS one. 2019 Jan 1;14(7). e0211608. https://doi.org/10.1371/journal.pone.0211608
Ahmed, Azza E. ; Heldenbrand, Jacob ; Asmann, Yan ; Fadlelmola, Faisal M. ; Katz, Daniel S. ; Kendig, Katherine ; Kendzior, Matthew C. ; Li, Tiffany ; Ren, Yingxue ; Rodriguez, Elliott ; Weber, Matthew R. ; Wozniak, Justin M. ; Zermeno, Jennie ; Mainzer, Liudmila S. / Managing genomic variant calling workflows with Swift/T. In: PloS one. 2019 ; Vol. 14, No. 7.
@article{01f38662ee864709b58f7ae92d97a7d5,
title = "Managing genomic variant calling workflows with Swift/T",
abstract = "Bioinformatics research is frequently performed using complex workflows with multiple steps, fans, merges, and conditionals. This complexity makes management of the workflow difficult on a computer cluster, especially when running in parallel on large batches of data: hundreds or thousands of samples at a time. Scientific workflow management systems could help with that. Many are now being proposed, but is there yet the “best” workflow management system for bioinformatics? Such a system would need to satisfy numerous, sometimes conflicting requirements: from ease of use, to seamless deployment at peta- and exa-scale, and portability to the cloud. We evaluated Swift/T as a candidate for such role by implementing a primary genomic variant calling workflow in the Swift/T language, focusing on workflow management, performance and scalability issues that arise from production-grade big data genomic analyses. In the process we introduced novel features into the language, which are now part of its open repository. Additionally, we formalized a set of design criteria for quality, robust, maintainable workflows that must function at-scale in a production setting, such as a large genomic sequencing facility or a major hospital system. The use of Swift/T conveys two key advantages. (1) It operates transparently in multiple cluster scheduling environments (PBS Torque, SLURM, Cray aprun environment, etc.), thus a single workflow is trivially portable across numerous clusters. (2) The leaf functions of Swift/T permit developers to easily swap executables in and out of the workflow, which makes it easy to maintain and to request resources optimal for each stage of the pipeline. While Swift/T’s data-level parallelism eliminates the need to code parallel analysis of multiple samples, it does make debugging more difficult, as is common for implicitly parallel code. Nonetheless, the language gives users a powerful and portable way to scale up analyses in many computing architectures. The code for our implementation of a variant calling workflow using Swift/ T can be found on GitHub at https://github.com/ncsa/Swift-T-Variant-Calling, with full documentation provided at http://swift-t-variant-calling.readthedocs.io/en/latest/.",
author = "Ahmed, {Azza E.} and Jacob Heldenbrand and Yan Asmann and Fadlelmola, {Faisal M.} and Katz, {Daniel S.} and Katherine Kendig and Kendzior, {Matthew C.} and Tiffany Li and Yingxue Ren and Elliott Rodriguez and Weber, {Matthew R.} and Wozniak, {Justin M.} and Jennie Zermeno and Mainzer, {Liudmila S.}",
year = "2019",
month = "1",
day = "1",
doi = "10.1371/journal.pone.0211608",
language = "English (US)",
volume = "14",
journal = "PLoS One",
issn = "1932-6203",
publisher = "Public Library of Science",
number = "7",

}

TY - JOUR

T1 - Managing genomic variant calling workflows with Swift/T

AU - Ahmed, Azza E.

AU - Heldenbrand, Jacob

AU - Asmann, Yan

AU - Fadlelmola, Faisal M.

AU - Katz, Daniel S.

AU - Kendig, Katherine

AU - Kendzior, Matthew C.

AU - Li, Tiffany

AU - Ren, Yingxue

AU - Rodriguez, Elliott

AU - Weber, Matthew R.

AU - Wozniak, Justin M.

AU - Zermeno, Jennie

AU - Mainzer, Liudmila S.

PY - 2019/1/1

Y1 - 2019/1/1

N2 - Bioinformatics research is frequently performed using complex workflows with multiple steps, fans, merges, and conditionals. This complexity makes management of the workflow difficult on a computer cluster, especially when running in parallel on large batches of data: hundreds or thousands of samples at a time. Scientific workflow management systems could help with that. Many are now being proposed, but is there yet the “best” workflow management system for bioinformatics? Such a system would need to satisfy numerous, sometimes conflicting requirements: from ease of use, to seamless deployment at peta- and exa-scale, and portability to the cloud. We evaluated Swift/T as a candidate for such role by implementing a primary genomic variant calling workflow in the Swift/T language, focusing on workflow management, performance and scalability issues that arise from production-grade big data genomic analyses. In the process we introduced novel features into the language, which are now part of its open repository. Additionally, we formalized a set of design criteria for quality, robust, maintainable workflows that must function at-scale in a production setting, such as a large genomic sequencing facility or a major hospital system. The use of Swift/T conveys two key advantages. (1) It operates transparently in multiple cluster scheduling environments (PBS Torque, SLURM, Cray aprun environment, etc.), thus a single workflow is trivially portable across numerous clusters. (2) The leaf functions of Swift/T permit developers to easily swap executables in and out of the workflow, which makes it easy to maintain and to request resources optimal for each stage of the pipeline. While Swift/T’s data-level parallelism eliminates the need to code parallel analysis of multiple samples, it does make debugging more difficult, as is common for implicitly parallel code. Nonetheless, the language gives users a powerful and portable way to scale up analyses in many computing architectures. The code for our implementation of a variant calling workflow using Swift/ T can be found on GitHub at https://github.com/ncsa/Swift-T-Variant-Calling, with full documentation provided at http://swift-t-variant-calling.readthedocs.io/en/latest/.

AB - Bioinformatics research is frequently performed using complex workflows with multiple steps, fans, merges, and conditionals. This complexity makes management of the workflow difficult on a computer cluster, especially when running in parallel on large batches of data: hundreds or thousands of samples at a time. Scientific workflow management systems could help with that. Many are now being proposed, but is there yet the “best” workflow management system for bioinformatics? Such a system would need to satisfy numerous, sometimes conflicting requirements: from ease of use, to seamless deployment at peta- and exa-scale, and portability to the cloud. We evaluated Swift/T as a candidate for such role by implementing a primary genomic variant calling workflow in the Swift/T language, focusing on workflow management, performance and scalability issues that arise from production-grade big data genomic analyses. In the process we introduced novel features into the language, which are now part of its open repository. Additionally, we formalized a set of design criteria for quality, robust, maintainable workflows that must function at-scale in a production setting, such as a large genomic sequencing facility or a major hospital system. The use of Swift/T conveys two key advantages. (1) It operates transparently in multiple cluster scheduling environments (PBS Torque, SLURM, Cray aprun environment, etc.), thus a single workflow is trivially portable across numerous clusters. (2) The leaf functions of Swift/T permit developers to easily swap executables in and out of the workflow, which makes it easy to maintain and to request resources optimal for each stage of the pipeline. While Swift/T’s data-level parallelism eliminates the need to code parallel analysis of multiple samples, it does make debugging more difficult, as is common for implicitly parallel code. Nonetheless, the language gives users a powerful and portable way to scale up analyses in many computing architectures. The code for our implementation of a variant calling workflow using Swift/ T can be found on GitHub at https://github.com/ncsa/Swift-T-Variant-Calling, with full documentation provided at http://swift-t-variant-calling.readthedocs.io/en/latest/.

UR - http://www.scopus.com/inward/record.url?scp=85069312540&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85069312540&partnerID=8YFLogxK

U2 - 10.1371/journal.pone.0211608

DO - 10.1371/journal.pone.0211608

M3 - Article

VL - 14

JO - PLoS One

JF - PLoS One

SN - 1932-6203

IS - 7

M1 - e0211608

ER -