CPAT: Coding-potential assessment tool using an alignment-free logistic regression model

Liguo Wang; Hyun Jung Park; Surendra Dasari; Shengqin Wang; Jean Pierre Kocher; Wei Li

doi:10.1093/nar/gkt006

CPAT: Coding-potential assessment tool using an alignment-free logistic regression model

Liguo Wang, Hyun Jung Park, Surendra Dasari, Shengqin Wang, Jean Pierre Kocher, Wei Li

Research output: Contribution to journal › Article › peer-review

761 Scopus citations

Abstract

Thousands of novel transcripts have been identified using deep transcriptome sequencing. This discovery of large and 'hidden' transcriptome rejuvenates the demand for methods that can rapidly distinguish between coding and noncoding RNA. Here, we present a novel alignment-free method, Coding Potential Assessment Tool (CPAT), which rapidly recognizes coding and noncoding transcripts from a large pool of candidates. To this end, CPAT uses a logistic regression model built with four sequence features: open reading frame size, open reading frame coverage, Fickett TESTCODE statistic and hexamer usage bias. CPAT software outperformed (sensitivity: 0.96, specificity: 0.97) other state-of-the-art alignment-based software such as Coding-Potential Calculator (sensitivity: 0.99, specificity: 0.74) and Phylo Codon Substitution Frequencies (sensitivity: 0.90, specificity: 0.63). In addition to high accuracy, CPAT is approximately four orders of magnitude faster than Coding-Potential Calculator and Phylo Codon Substitution Frequencies, enabling its users to process thousands of transcripts within seconds. The software accepts input sequences in either FASTA- or BED-formatted data files. We also developed a web interface for CPAT that allows users to submit sequences and receive the prediction results almost instantly.

Original language	English (US)
Pages (from-to)	e74
Journal	Nucleic acids research
Volume	41
Issue number	6
DOIs	https://doi.org/10.1093/nar/gkt006
State	Published - Apr 2013

ASJC Scopus subject areas

Genetics

Access to Document

10.1093/nar/gkt006

Cite this

@article{4b6530052cd2406797d22cb9a96a34a7,

title = "CPAT: Coding-potential assessment tool using an alignment-free logistic regression model",

abstract = "Thousands of novel transcripts have been identified using deep transcriptome sequencing. This discovery of large and 'hidden' transcriptome rejuvenates the demand for methods that can rapidly distinguish between coding and noncoding RNA. Here, we present a novel alignment-free method, Coding Potential Assessment Tool (CPAT), which rapidly recognizes coding and noncoding transcripts from a large pool of candidates. To this end, CPAT uses a logistic regression model built with four sequence features: open reading frame size, open reading frame coverage, Fickett TESTCODE statistic and hexamer usage bias. CPAT software outperformed (sensitivity: 0.96, specificity: 0.97) other state-of-the-art alignment-based software such as Coding-Potential Calculator (sensitivity: 0.99, specificity: 0.74) and Phylo Codon Substitution Frequencies (sensitivity: 0.90, specificity: 0.63). In addition to high accuracy, CPAT is approximately four orders of magnitude faster than Coding-Potential Calculator and Phylo Codon Substitution Frequencies, enabling its users to process thousands of transcripts within seconds. The software accepts input sequences in either FASTA- or BED-formatted data files. We also developed a web interface for CPAT that allows users to submit sequences and receive the prediction results almost instantly.",

author = "Liguo Wang and Park, {Hyun Jung} and Surendra Dasari and Shengqin Wang and Kocher, {Jean Pierre} and Wei Li",

year = "2013",

month = apr,

doi = "10.1093/nar/gkt006",

language = "English (US)",

volume = "41",

pages = "e74",

journal = "Nucleic acids research",

issn = "0305-1048",

publisher = "Oxford University Press",

number = "6",

}

TY - JOUR

T1 - CPAT

T2 - Coding-potential assessment tool using an alignment-free logistic regression model

AU - Wang, Liguo

AU - Park, Hyun Jung

AU - Dasari, Surendra

AU - Wang, Shengqin

AU - Kocher, Jean Pierre

AU - Li, Wei

PY - 2013/4

Y1 - 2013/4

N2 - Thousands of novel transcripts have been identified using deep transcriptome sequencing. This discovery of large and 'hidden' transcriptome rejuvenates the demand for methods that can rapidly distinguish between coding and noncoding RNA. Here, we present a novel alignment-free method, Coding Potential Assessment Tool (CPAT), which rapidly recognizes coding and noncoding transcripts from a large pool of candidates. To this end, CPAT uses a logistic regression model built with four sequence features: open reading frame size, open reading frame coverage, Fickett TESTCODE statistic and hexamer usage bias. CPAT software outperformed (sensitivity: 0.96, specificity: 0.97) other state-of-the-art alignment-based software such as Coding-Potential Calculator (sensitivity: 0.99, specificity: 0.74) and Phylo Codon Substitution Frequencies (sensitivity: 0.90, specificity: 0.63). In addition to high accuracy, CPAT is approximately four orders of magnitude faster than Coding-Potential Calculator and Phylo Codon Substitution Frequencies, enabling its users to process thousands of transcripts within seconds. The software accepts input sequences in either FASTA- or BED-formatted data files. We also developed a web interface for CPAT that allows users to submit sequences and receive the prediction results almost instantly.

AB - Thousands of novel transcripts have been identified using deep transcriptome sequencing. This discovery of large and 'hidden' transcriptome rejuvenates the demand for methods that can rapidly distinguish between coding and noncoding RNA. Here, we present a novel alignment-free method, Coding Potential Assessment Tool (CPAT), which rapidly recognizes coding and noncoding transcripts from a large pool of candidates. To this end, CPAT uses a logistic regression model built with four sequence features: open reading frame size, open reading frame coverage, Fickett TESTCODE statistic and hexamer usage bias. CPAT software outperformed (sensitivity: 0.96, specificity: 0.97) other state-of-the-art alignment-based software such as Coding-Potential Calculator (sensitivity: 0.99, specificity: 0.74) and Phylo Codon Substitution Frequencies (sensitivity: 0.90, specificity: 0.63). In addition to high accuracy, CPAT is approximately four orders of magnitude faster than Coding-Potential Calculator and Phylo Codon Substitution Frequencies, enabling its users to process thousands of transcripts within seconds. The software accepts input sequences in either FASTA- or BED-formatted data files. We also developed a web interface for CPAT that allows users to submit sequences and receive the prediction results almost instantly.

UR - http://www.scopus.com/inward/record.url?scp=84876020023&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84876020023&partnerID=8YFLogxK

U2 - 10.1093/nar/gkt006

DO - 10.1093/nar/gkt006

M3 - Article

C2 - 23335781

AN - SCOPUS:84876020023

SN - 0305-1048

VL - 41

SP - e74

JO - Nucleic acids research

JF - Nucleic acids research

IS - 6

ER -

CPAT: Coding-potential assessment tool using an alignment-free logistic regression model

Abstract

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this