Next Generation Sequencing (NGS) technologies that cost-effectively characterize genomic regions and identify sequence variations using short reads are the current standard for genome sequencing. However, calling small indels in low-complexity regions of the genome using NGS is challenging. Recent advances in Third Generation Sequencing (TGS) provide long reads, which call large-structural variants accurately. However, these reads have context-dependent indel errors in low-complexity regions, resulting in lower accuracy of small indel calls compared to NGS reads. When both small and large-structural variants need to be called, both NGS and TGS reads may be available. Integration of the two data types with unique error profiles could improve robustness of small variant calling in challenging cases. However, there isn’t currently such a method integrating both types of data. We present a novel method that integrates NGS and TGS reads to call small variants. We leverage the Mixture of Experts paradigm which uses an ensemble of Deep Neural Networks (DNN), each processing a different data type to make predictions. We present improvements in our DNN design compared to previous work such as sequence processing using one-dimensional convolutions instead of image processing using two-dimensional convolutions and an algorithm to efficiently process sites with many variant candidates, which help us reduce computations. Using our method to integrate Illumina and PacBio reads, we find a reduction in the number of erroneous small variant calls of up to ~30%, compared to the state-of-the-art using only Illumina data. We also find improvements in calling small indels in low-complexity regions.
ASJC Scopus subject areas
- Biochemistry, Genetics and Molecular Biology(all)
- Agricultural and Biological Sciences(all)
- Immunology and Microbiology(all)
- Pharmacology, Toxicology and Pharmaceutics(all)