Support vector machine-based mucin-type o-linked glycosylation site prediction using enhanced sequence feature encoding.

Manabu Torii, Hongfang D Liu, Zhang Zhi Hu

Research output: Contribution to journalArticle

7 Citations (Scopus)

Abstract

Glycosylation is a common and complex protein post-translational modification (PTM). In particular, mucin-type O-linked glycosylation is abundant and plays important biological functions. The number of determined glycosylation sites is still small and there remains the need of accurate computational prediction for annotation and functional understanding of proteins. PTM site prediction can be formulated as a machine learning task. An important step in applying machine learning to this task is encoding protein fragments as feature vectors. Here we assess existing encoding methods as well as an enhanced encoding method named composition of monomer spectrum (CMS) using support vector machines (SVMs). SVMs employing the existing encoding methods achieved AUC (area under ROC curve) of 90.3-91.3%, and ones employing CMS achieved AUC of 92.4%. Analysis of different encoding methods suggests the potential in further improving the prediction.

Original languageEnglish (US)
Pages (from-to)640-644
Number of pages5
JournalAMIA ... Annual Symposium proceedings / AMIA Symposium. AMIA Symposium
Volume2009
StatePublished - 2009
Externally publishedYes

Fingerprint

Mucins
Glycosylation
Post Translational Protein Processing
ROC Curve
Area Under Curve
Proteins
Support Vector Machine
Machine Learning

ASJC Scopus subject areas

  • Medicine(all)

Cite this

@article{2a357c5f83d24e1bb66d585100577cfc,
title = "Support vector machine-based mucin-type o-linked glycosylation site prediction using enhanced sequence feature encoding.",
abstract = "Glycosylation is a common and complex protein post-translational modification (PTM). In particular, mucin-type O-linked glycosylation is abundant and plays important biological functions. The number of determined glycosylation sites is still small and there remains the need of accurate computational prediction for annotation and functional understanding of proteins. PTM site prediction can be formulated as a machine learning task. An important step in applying machine learning to this task is encoding protein fragments as feature vectors. Here we assess existing encoding methods as well as an enhanced encoding method named composition of monomer spectrum (CMS) using support vector machines (SVMs). SVMs employing the existing encoding methods achieved AUC (area under ROC curve) of 90.3-91.3{\%}, and ones employing CMS achieved AUC of 92.4{\%}. Analysis of different encoding methods suggests the potential in further improving the prediction.",
author = "Manabu Torii and Liu, {Hongfang D} and Hu, {Zhang Zhi}",
year = "2009",
language = "English (US)",
volume = "2009",
pages = "640--644",
journal = "AMIA ... Annual Symposium proceedings / AMIA Symposium. AMIA Symposium",
issn = "1559-4076",
publisher = "American Medical Informatics Association",

}

TY - JOUR

T1 - Support vector machine-based mucin-type o-linked glycosylation site prediction using enhanced sequence feature encoding.

AU - Torii, Manabu

AU - Liu, Hongfang D

AU - Hu, Zhang Zhi

PY - 2009

Y1 - 2009

N2 - Glycosylation is a common and complex protein post-translational modification (PTM). In particular, mucin-type O-linked glycosylation is abundant and plays important biological functions. The number of determined glycosylation sites is still small and there remains the need of accurate computational prediction for annotation and functional understanding of proteins. PTM site prediction can be formulated as a machine learning task. An important step in applying machine learning to this task is encoding protein fragments as feature vectors. Here we assess existing encoding methods as well as an enhanced encoding method named composition of monomer spectrum (CMS) using support vector machines (SVMs). SVMs employing the existing encoding methods achieved AUC (area under ROC curve) of 90.3-91.3%, and ones employing CMS achieved AUC of 92.4%. Analysis of different encoding methods suggests the potential in further improving the prediction.

AB - Glycosylation is a common and complex protein post-translational modification (PTM). In particular, mucin-type O-linked glycosylation is abundant and plays important biological functions. The number of determined glycosylation sites is still small and there remains the need of accurate computational prediction for annotation and functional understanding of proteins. PTM site prediction can be formulated as a machine learning task. An important step in applying machine learning to this task is encoding protein fragments as feature vectors. Here we assess existing encoding methods as well as an enhanced encoding method named composition of monomer spectrum (CMS) using support vector machines (SVMs). SVMs employing the existing encoding methods achieved AUC (area under ROC curve) of 90.3-91.3%, and ones employing CMS achieved AUC of 92.4%. Analysis of different encoding methods suggests the potential in further improving the prediction.

UR - http://www.scopus.com/inward/record.url?scp=79953768199&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=79953768199&partnerID=8YFLogxK

M3 - Article

VL - 2009

SP - 640

EP - 644

JO - AMIA ... Annual Symposium proceedings / AMIA Symposium. AMIA Symposium

JF - AMIA ... Annual Symposium proceedings / AMIA Symposium. AMIA Symposium

SN - 1559-4076

ER -