Robust and rapid algorithms facilitate large-scale whole genome sequencing downstream analysis in an integrative framework

Miaoxin Li, Jiang Li, Mulin Jun Li, Zhicheng Pan, Jacob Shujui Hsu, Dajiang J. Liu, Xiaowei Zhan, Junwen Wang, Song Youqiang, Pak Chung Sham

Research output: Contribution to journalArticle

10 Citations (Scopus)

Abstract

Whole genome sequencing (WGS) is a promising strategy to unravel variants or genes responsible for human diseases and traits. However, there is a lack of robust platforms for a comprehensive downstream analysis. In the present study, we first proposed three novel algorithms, sequence gap-filled gene feature annotation, bit-block encoded genotypes and sectional fast access to text lines to address three fundamental problems. The three algorithms then formed the infrastructure of a robust parallel computing framework, KGGSeq, for integrating downstream analysis functions for whole genome sequencing data. KGGSeq has been equipped with a comprehensive set of analysis functions for quality control, filtration, annotation, pathogenic prediction and statistical tests. In the tests with whole genome sequencing data from 1000 Genomes Project, KGGSeq annotated several thousand more reliable nonsynonymous variants than other widely used tools (e.g. ANNOVAR and SNPEff). It took only around half an hour on a small server with 10 CPUs to access genotypes of 60 million variants of 2504 subjects, while a popular alternative tool required around one day. KGGSeq's bit-block genotype format used 1.5% or less space to flexibly represent phased or unphased genotypes with multiple alleles and achieved a speed of over 1000 times faster to calculate genotypic correlation.

Original languageEnglish (US)
Article numbere75
JournalNucleic Acids Research
Volume45
Issue number9
DOIs
StatePublished - May 19 2017

Fingerprint

Genotype
Genome
Molecular Sequence Annotation
Quality Control
Alleles
Genes

ASJC Scopus subject areas

  • Genetics

Cite this

Robust and rapid algorithms facilitate large-scale whole genome sequencing downstream analysis in an integrative framework. / Li, Miaoxin; Li, Jiang; Li, Mulin Jun; Pan, Zhicheng; Hsu, Jacob Shujui; Liu, Dajiang J.; Zhan, Xiaowei; Wang, Junwen; Youqiang, Song; Sham, Pak Chung.

In: Nucleic Acids Research, Vol. 45, No. 9, e75, 19.05.2017.

Research output: Contribution to journalArticle

Li, Miaoxin ; Li, Jiang ; Li, Mulin Jun ; Pan, Zhicheng ; Hsu, Jacob Shujui ; Liu, Dajiang J. ; Zhan, Xiaowei ; Wang, Junwen ; Youqiang, Song ; Sham, Pak Chung. / Robust and rapid algorithms facilitate large-scale whole genome sequencing downstream analysis in an integrative framework. In: Nucleic Acids Research. 2017 ; Vol. 45, No. 9.
@article{48b681b41ac14c71ac07263c68b9689d,
title = "Robust and rapid algorithms facilitate large-scale whole genome sequencing downstream analysis in an integrative framework",
abstract = "Whole genome sequencing (WGS) is a promising strategy to unravel variants or genes responsible for human diseases and traits. However, there is a lack of robust platforms for a comprehensive downstream analysis. In the present study, we first proposed three novel algorithms, sequence gap-filled gene feature annotation, bit-block encoded genotypes and sectional fast access to text lines to address three fundamental problems. The three algorithms then formed the infrastructure of a robust parallel computing framework, KGGSeq, for integrating downstream analysis functions for whole genome sequencing data. KGGSeq has been equipped with a comprehensive set of analysis functions for quality control, filtration, annotation, pathogenic prediction and statistical tests. In the tests with whole genome sequencing data from 1000 Genomes Project, KGGSeq annotated several thousand more reliable nonsynonymous variants than other widely used tools (e.g. ANNOVAR and SNPEff). It took only around half an hour on a small server with 10 CPUs to access genotypes of 60 million variants of 2504 subjects, while a popular alternative tool required around one day. KGGSeq's bit-block genotype format used 1.5{\%} or less space to flexibly represent phased or unphased genotypes with multiple alleles and achieved a speed of over 1000 times faster to calculate genotypic correlation.",
author = "Miaoxin Li and Jiang Li and Li, {Mulin Jun} and Zhicheng Pan and Hsu, {Jacob Shujui} and Liu, {Dajiang J.} and Xiaowei Zhan and Junwen Wang and Song Youqiang and Sham, {Pak Chung}",
year = "2017",
month = "5",
day = "19",
doi = "10.1093/nar/gkx019",
language = "English (US)",
volume = "45",
journal = "Nucleic Acids Research",
issn = "0305-1048",
publisher = "Oxford University Press",
number = "9",

}

TY - JOUR

T1 - Robust and rapid algorithms facilitate large-scale whole genome sequencing downstream analysis in an integrative framework

AU - Li, Miaoxin

AU - Li, Jiang

AU - Li, Mulin Jun

AU - Pan, Zhicheng

AU - Hsu, Jacob Shujui

AU - Liu, Dajiang J.

AU - Zhan, Xiaowei

AU - Wang, Junwen

AU - Youqiang, Song

AU - Sham, Pak Chung

PY - 2017/5/19

Y1 - 2017/5/19

N2 - Whole genome sequencing (WGS) is a promising strategy to unravel variants or genes responsible for human diseases and traits. However, there is a lack of robust platforms for a comprehensive downstream analysis. In the present study, we first proposed three novel algorithms, sequence gap-filled gene feature annotation, bit-block encoded genotypes and sectional fast access to text lines to address three fundamental problems. The three algorithms then formed the infrastructure of a robust parallel computing framework, KGGSeq, for integrating downstream analysis functions for whole genome sequencing data. KGGSeq has been equipped with a comprehensive set of analysis functions for quality control, filtration, annotation, pathogenic prediction and statistical tests. In the tests with whole genome sequencing data from 1000 Genomes Project, KGGSeq annotated several thousand more reliable nonsynonymous variants than other widely used tools (e.g. ANNOVAR and SNPEff). It took only around half an hour on a small server with 10 CPUs to access genotypes of 60 million variants of 2504 subjects, while a popular alternative tool required around one day. KGGSeq's bit-block genotype format used 1.5% or less space to flexibly represent phased or unphased genotypes with multiple alleles and achieved a speed of over 1000 times faster to calculate genotypic correlation.

AB - Whole genome sequencing (WGS) is a promising strategy to unravel variants or genes responsible for human diseases and traits. However, there is a lack of robust platforms for a comprehensive downstream analysis. In the present study, we first proposed three novel algorithms, sequence gap-filled gene feature annotation, bit-block encoded genotypes and sectional fast access to text lines to address three fundamental problems. The three algorithms then formed the infrastructure of a robust parallel computing framework, KGGSeq, for integrating downstream analysis functions for whole genome sequencing data. KGGSeq has been equipped with a comprehensive set of analysis functions for quality control, filtration, annotation, pathogenic prediction and statistical tests. In the tests with whole genome sequencing data from 1000 Genomes Project, KGGSeq annotated several thousand more reliable nonsynonymous variants than other widely used tools (e.g. ANNOVAR and SNPEff). It took only around half an hour on a small server with 10 CPUs to access genotypes of 60 million variants of 2504 subjects, while a popular alternative tool required around one day. KGGSeq's bit-block genotype format used 1.5% or less space to flexibly represent phased or unphased genotypes with multiple alleles and achieved a speed of over 1000 times faster to calculate genotypic correlation.

UR - http://www.scopus.com/inward/record.url?scp=85015660495&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85015660495&partnerID=8YFLogxK

U2 - 10.1093/nar/gkx019

DO - 10.1093/nar/gkx019

M3 - Article

C2 - 28115622

AN - SCOPUS:85015660495

VL - 45

JO - Nucleic Acids Research

JF - Nucleic Acids Research

SN - 0305-1048

IS - 9

M1 - e75

ER -