RVboost: RNA-seq variants prioritization using a boosting method

Chen Wang, Jaime I. Davila, Saurabh Baheti, Aditya V. Bhagwate, Xue Wang, Jean-Pierre Kocher, Susan L Slager, Andrew L Feldman, Anne J Novak, James R Cerhan, E Aubrey Thompson, Yan Asmann

Research output: Contribution to journalArticle

13 Citations (Scopus)

Abstract

Motivation: RNA-seq has become the method of choice to quantify genes and exons, discover novel transcripts and detect fusion genes. However, reliable variant identification from RNA-seq data remains challenging because of the complexities of the transcriptome, the challenges of accurately mapping exon boundary spanning reads and the bias introduced during the sequencing library preparation. Method: We developed RVboost, a novel method specific for RNA variant prioritization. RVboost uses several attributes unique in the process of RNA library preparation, sequencing and RNA-seq data analyses. It uses a boosting method to train a model of 'good quality' variants using common variants from HapMap, and prioritizes and calls the RNA variants based on the trained model. We packaged RVboost in a comprehensive workflow, which integrates tools of variant calling, annotation and filtering. Results: RVboost consistently outperforms the variant quality score recalibration from the Genome Analysis Tool Kit and the RNA-seq variant-calling pipeline SNPiR in 12 RNA-seq samples using ground-truth variants from paired exome sequencing data. Several RNAseq- specific attributes were identified as critical to differentiate true and false variants, including the distance of the variant positions to exon boundaries, and the percent of the reads supporting the variant in the first six base pairs. The latter identifies false variants introduced by the random hexamer priming during the library construction.

Original languageEnglish (US)
Pages (from-to)3414-3416
Number of pages3
JournalBioinformatics
Volume30
Issue number23
DOIs
StatePublished - 2014

Fingerprint

Prioritization
Boosting
RNA
Sequencing
Preparation
Attribute
Exons
Gene
Genes
Differentiate
Percent
Work Flow
Annotation
Fusion
Genome
Quantify
HapMap Project
Filtering
Exome
RNA Sequence Analysis

ASJC Scopus subject areas

  • Biochemistry
  • Molecular Biology
  • Computational Theory and Mathematics
  • Computer Science Applications
  • Computational Mathematics
  • Statistics and Probability

Cite this

RVboost : RNA-seq variants prioritization using a boosting method. / Wang, Chen; Davila, Jaime I.; Baheti, Saurabh; Bhagwate, Aditya V.; Wang, Xue; Kocher, Jean-Pierre; Slager, Susan L; Feldman, Andrew L; Novak, Anne J; Cerhan, James R; Thompson, E Aubrey; Asmann, Yan.

In: Bioinformatics, Vol. 30, No. 23, 2014, p. 3414-3416.

Research output: Contribution to journalArticle

@article{920a17243fa8422296316b08e14fe929,
title = "RVboost: RNA-seq variants prioritization using a boosting method",
abstract = "Motivation: RNA-seq has become the method of choice to quantify genes and exons, discover novel transcripts and detect fusion genes. However, reliable variant identification from RNA-seq data remains challenging because of the complexities of the transcriptome, the challenges of accurately mapping exon boundary spanning reads and the bias introduced during the sequencing library preparation. Method: We developed RVboost, a novel method specific for RNA variant prioritization. RVboost uses several attributes unique in the process of RNA library preparation, sequencing and RNA-seq data analyses. It uses a boosting method to train a model of 'good quality' variants using common variants from HapMap, and prioritizes and calls the RNA variants based on the trained model. We packaged RVboost in a comprehensive workflow, which integrates tools of variant calling, annotation and filtering. Results: RVboost consistently outperforms the variant quality score recalibration from the Genome Analysis Tool Kit and the RNA-seq variant-calling pipeline SNPiR in 12 RNA-seq samples using ground-truth variants from paired exome sequencing data. Several RNAseq- specific attributes were identified as critical to differentiate true and false variants, including the distance of the variant positions to exon boundaries, and the percent of the reads supporting the variant in the first six base pairs. The latter identifies false variants introduced by the random hexamer priming during the library construction.",
author = "Chen Wang and Davila, {Jaime I.} and Saurabh Baheti and Bhagwate, {Aditya V.} and Xue Wang and Jean-Pierre Kocher and Slager, {Susan L} and Feldman, {Andrew L} and Novak, {Anne J} and Cerhan, {James R} and Thompson, {E Aubrey} and Yan Asmann",
year = "2014",
doi = "10.1093/bioinformatics/btu577",
language = "English (US)",
volume = "30",
pages = "3414--3416",
journal = "Bioinformatics",
issn = "1367-4803",
publisher = "Oxford University Press",
number = "23",

}

TY - JOUR

T1 - RVboost

T2 - RNA-seq variants prioritization using a boosting method

AU - Wang, Chen

AU - Davila, Jaime I.

AU - Baheti, Saurabh

AU - Bhagwate, Aditya V.

AU - Wang, Xue

AU - Kocher, Jean-Pierre

AU - Slager, Susan L

AU - Feldman, Andrew L

AU - Novak, Anne J

AU - Cerhan, James R

AU - Thompson, E Aubrey

AU - Asmann, Yan

PY - 2014

Y1 - 2014

N2 - Motivation: RNA-seq has become the method of choice to quantify genes and exons, discover novel transcripts and detect fusion genes. However, reliable variant identification from RNA-seq data remains challenging because of the complexities of the transcriptome, the challenges of accurately mapping exon boundary spanning reads and the bias introduced during the sequencing library preparation. Method: We developed RVboost, a novel method specific for RNA variant prioritization. RVboost uses several attributes unique in the process of RNA library preparation, sequencing and RNA-seq data analyses. It uses a boosting method to train a model of 'good quality' variants using common variants from HapMap, and prioritizes and calls the RNA variants based on the trained model. We packaged RVboost in a comprehensive workflow, which integrates tools of variant calling, annotation and filtering. Results: RVboost consistently outperforms the variant quality score recalibration from the Genome Analysis Tool Kit and the RNA-seq variant-calling pipeline SNPiR in 12 RNA-seq samples using ground-truth variants from paired exome sequencing data. Several RNAseq- specific attributes were identified as critical to differentiate true and false variants, including the distance of the variant positions to exon boundaries, and the percent of the reads supporting the variant in the first six base pairs. The latter identifies false variants introduced by the random hexamer priming during the library construction.

AB - Motivation: RNA-seq has become the method of choice to quantify genes and exons, discover novel transcripts and detect fusion genes. However, reliable variant identification from RNA-seq data remains challenging because of the complexities of the transcriptome, the challenges of accurately mapping exon boundary spanning reads and the bias introduced during the sequencing library preparation. Method: We developed RVboost, a novel method specific for RNA variant prioritization. RVboost uses several attributes unique in the process of RNA library preparation, sequencing and RNA-seq data analyses. It uses a boosting method to train a model of 'good quality' variants using common variants from HapMap, and prioritizes and calls the RNA variants based on the trained model. We packaged RVboost in a comprehensive workflow, which integrates tools of variant calling, annotation and filtering. Results: RVboost consistently outperforms the variant quality score recalibration from the Genome Analysis Tool Kit and the RNA-seq variant-calling pipeline SNPiR in 12 RNA-seq samples using ground-truth variants from paired exome sequencing data. Several RNAseq- specific attributes were identified as critical to differentiate true and false variants, including the distance of the variant positions to exon boundaries, and the percent of the reads supporting the variant in the first six base pairs. The latter identifies false variants introduced by the random hexamer priming during the library construction.

UR - http://www.scopus.com/inward/record.url?scp=84930693393&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84930693393&partnerID=8YFLogxK

U2 - 10.1093/bioinformatics/btu577

DO - 10.1093/bioinformatics/btu577

M3 - Article

C2 - 25170027

AN - SCOPUS:84930693393

VL - 30

SP - 3414

EP - 3416

JO - Bioinformatics

JF - Bioinformatics

SN - 1367-4803

IS - 23

ER -