Generalizations of Markov model to characterize biological sequences

Junwen Wang, Sridhar Hannenhalli

Research output: Contribution to journalArticle

11 Citations (Scopus)

Abstract

Background. The currently used kth Markov models estimate the probability of generating a single nucleotide conditional upon the immediately preceding (gap=0) k units. However, this neither takes into account the joint dependency of multiple neighboring nucleotides, nor does it consider the long range dependency with gap>0. Result. We describe a configurable tool to explore generalizations of the standard Markov model. We evaluated whether the sequence classification accuracy can be improved by using an alternative set of model parameters. The evaluation was done on four classes of biological sequences - CpG-poor promoters, all promoters, exons and nucleosome positioning sequences. Using di- and tri-nucleotide as the model unit significantly improved the sequence classification accuracy relative to the standard single nucleotide model. In the case of nucleosome positioning sequences, optimal accuracy was achieved at a gap length of 4. Furthermore in the plot of classification accuracy versus the gap, a periodicity of 10-11 bps was observed which might indicate structural preferences in the nucleosome positioning sequence. The tool is implemented in Java and is available for download at ftp://ftp.pcbi.upenn.edu/GMM/. Conclusion. Markov modeling is an important component of many sequence analysis tools. We have extended the standard Markov model to incorporate joint and long range dependencies between the sequence elements. The proposed generalizations of the Markov model are likely to improve the overall accuracy of sequence analysis tools.

Original languageEnglish (US)
Article number219
JournalBMC Bioinformatics
Volume6
DOIs
StatePublished - Sep 6 2005
Externally publishedYes

Fingerprint

Markov Model
Nucleotides
Nucleosomes
Positioning
Sequence Analysis
Joints
Promoter
Standard Model
Periodicity
Unit
Exons
Generalization
Range of data
Java
Immediately
Likely
Model
Alternatives
Evaluation
Modeling

ASJC Scopus subject areas

  • Medicine(all)
  • Structural Biology
  • Applied Mathematics

Cite this

Generalizations of Markov model to characterize biological sequences. / Wang, Junwen; Hannenhalli, Sridhar.

In: BMC Bioinformatics, Vol. 6, 219, 06.09.2005.

Research output: Contribution to journalArticle

@article{99ff6f421c18414e92ee772829223e7f,
title = "Generalizations of Markov model to characterize biological sequences",
abstract = "Background. The currently used kth Markov models estimate the probability of generating a single nucleotide conditional upon the immediately preceding (gap=0) k units. However, this neither takes into account the joint dependency of multiple neighboring nucleotides, nor does it consider the long range dependency with gap>0. Result. We describe a configurable tool to explore generalizations of the standard Markov model. We evaluated whether the sequence classification accuracy can be improved by using an alternative set of model parameters. The evaluation was done on four classes of biological sequences - CpG-poor promoters, all promoters, exons and nucleosome positioning sequences. Using di- and tri-nucleotide as the model unit significantly improved the sequence classification accuracy relative to the standard single nucleotide model. In the case of nucleosome positioning sequences, optimal accuracy was achieved at a gap length of 4. Furthermore in the plot of classification accuracy versus the gap, a periodicity of 10-11 bps was observed which might indicate structural preferences in the nucleosome positioning sequence. The tool is implemented in Java and is available for download at ftp://ftp.pcbi.upenn.edu/GMM/. Conclusion. Markov modeling is an important component of many sequence analysis tools. We have extended the standard Markov model to incorporate joint and long range dependencies between the sequence elements. The proposed generalizations of the Markov model are likely to improve the overall accuracy of sequence analysis tools.",
author = "Junwen Wang and Sridhar Hannenhalli",
year = "2005",
month = "9",
day = "6",
doi = "10.1186/1471-2105-6-219",
language = "English (US)",
volume = "6",
journal = "BMC Bioinformatics",
issn = "1471-2105",
publisher = "BioMed Central",

}

TY - JOUR

T1 - Generalizations of Markov model to characterize biological sequences

AU - Wang, Junwen

AU - Hannenhalli, Sridhar

PY - 2005/9/6

Y1 - 2005/9/6

N2 - Background. The currently used kth Markov models estimate the probability of generating a single nucleotide conditional upon the immediately preceding (gap=0) k units. However, this neither takes into account the joint dependency of multiple neighboring nucleotides, nor does it consider the long range dependency with gap>0. Result. We describe a configurable tool to explore generalizations of the standard Markov model. We evaluated whether the sequence classification accuracy can be improved by using an alternative set of model parameters. The evaluation was done on four classes of biological sequences - CpG-poor promoters, all promoters, exons and nucleosome positioning sequences. Using di- and tri-nucleotide as the model unit significantly improved the sequence classification accuracy relative to the standard single nucleotide model. In the case of nucleosome positioning sequences, optimal accuracy was achieved at a gap length of 4. Furthermore in the plot of classification accuracy versus the gap, a periodicity of 10-11 bps was observed which might indicate structural preferences in the nucleosome positioning sequence. The tool is implemented in Java and is available for download at ftp://ftp.pcbi.upenn.edu/GMM/. Conclusion. Markov modeling is an important component of many sequence analysis tools. We have extended the standard Markov model to incorporate joint and long range dependencies between the sequence elements. The proposed generalizations of the Markov model are likely to improve the overall accuracy of sequence analysis tools.

AB - Background. The currently used kth Markov models estimate the probability of generating a single nucleotide conditional upon the immediately preceding (gap=0) k units. However, this neither takes into account the joint dependency of multiple neighboring nucleotides, nor does it consider the long range dependency with gap>0. Result. We describe a configurable tool to explore generalizations of the standard Markov model. We evaluated whether the sequence classification accuracy can be improved by using an alternative set of model parameters. The evaluation was done on four classes of biological sequences - CpG-poor promoters, all promoters, exons and nucleosome positioning sequences. Using di- and tri-nucleotide as the model unit significantly improved the sequence classification accuracy relative to the standard single nucleotide model. In the case of nucleosome positioning sequences, optimal accuracy was achieved at a gap length of 4. Furthermore in the plot of classification accuracy versus the gap, a periodicity of 10-11 bps was observed which might indicate structural preferences in the nucleosome positioning sequence. The tool is implemented in Java and is available for download at ftp://ftp.pcbi.upenn.edu/GMM/. Conclusion. Markov modeling is an important component of many sequence analysis tools. We have extended the standard Markov model to incorporate joint and long range dependencies between the sequence elements. The proposed generalizations of the Markov model are likely to improve the overall accuracy of sequence analysis tools.

UR - http://www.scopus.com/inward/record.url?scp=25444493365&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=25444493365&partnerID=8YFLogxK

U2 - 10.1186/1471-2105-6-219

DO - 10.1186/1471-2105-6-219

M3 - Article

C2 - 16144548

AN - SCOPUS:25444493365

VL - 6

JO - BMC Bioinformatics

JF - BMC Bioinformatics

SN - 1471-2105

M1 - 219

ER -