Development of dbOGAP: A bioinformatics resource of O-GlcNAcylated proteins and site prediction

Zhang Zhi Hu; Manabu Torii; Jinlian Wang; Hongfang Liu; Gerald W. Hart

doi:10.1109/BIBMW.2009.5332094

Development of dbOGAP: A bioinformatics resource of O-GlcNAcylated proteins and site prediction

Zhang Zhi Hu, Manabu Torii, Jinlian Wang, Hongfang Liu, Gerald W. Hart

Digital Health Sciences

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

Abstract

Protein glycosylation is one of the most common posttranslational modifications (PTMs) with several types. O-GlcNAcylation is an O-linked glycosylation with attachment of β-N-acetylglucosamine (GlcNAc) to Ser/Thr residues catalyzed by O-GlcNAc transferase (OGA), whose removal is catalyzed by O-GlcNAcase. Unlike mucin-type O-glycosylation, O-GlcNAcylation occurs primarily in nucleocytoplasmic proteins and the monosaccharide is not further extended. Moreover, O-GlcNAcylation is dynamic and often reciprocal to phosphorylation at the same or adjacent Ser/Thr residues. Growing evidences suggest that O-GlcNAcylation is very common and has broad roles in physiology as well as in diseases especially through its interplay with phosphorylation, e.g., regulation of insulin signaling, transcription, and roles in diabetes and neurodegenerative diseases. In contrast to the enormous body of research on cellular roles of phosphorylation, the amount of research on O-GlcNAcylation has been disproportionally small and annotation of O-GlcNAcylated sites in protein databases is currently scarce. An O-GlcNAcylation site prediction program was developed in 2002, but it was based on a small data set with 40 O-GlcNAcylation sites known at that time (http://www.cbs.dtu.dk/services/YinOYang/). Here we seek to develop a database of O-GlcNAcylated proteins and sites, named dbOGAP, and also an O-GlcNAcylated site prediction system based on known sites data in dbOGAP to facilitate annotation and proteomic identification of the O-GlcNAcylation sites. We developed dbOGAP based primarily on O-GlcNAcylated proteins and sites published in peer-reviewed articles dated back to 1984 since it was first described. Most of these proteins were mapped to the UniProtKB protein IDs, except for some that could not be unambiguously mapped. The database currently contains 540 protein entries with experimental O-GlcNAcylation information, and 338 O-GlcNAc sites for 164 proteins. About 59% of these proteins are of humans, and other organisms include rat, mouse, fly and African frog. Among 164 proteins with known O-GlcNAcylation sites, 122 also have both phosphorylation (total 1634) and O-GlcNAc sites (total 263). The Gene Ontology (GO) profiling showed that the known O-GlcNAcylated proteins have a broad range of functions including developmental process, transcriptional regulation, cell signaling, metabolic regulation, and cellular transport and trafficking. The GO profile also showed that O-GlcNAcylated proteins are primarily nuclear and cytoplasmic, including membrane-associated intracellular proteins. The database is also populated with additional orthologous protein sequences to known O-GlcNAcylated proteins. Additional functional data, including other PTM features, biological pathways and disease information have been integrated to the database. We developed an O-GlcNAcylation site prediction program using Support Vector Machine (SVM). As positive instances, sequence fragments surrounding 322 O-GlcNAcylated Ser/Thr sites were extracted from 157 proteins in dbOGAP, and over 28 thousand sequence fragments surrounding the rest of the Ser/Thr sites in those proteins were assumed as negative instances. Two thirds of this data set was randomly selected as development data and was used for tuning parameters in SVM classifiers, while the rest of the data was set apart as a held-out test data set. To reduce the impact of imbalanced data on the performance of trained classifiers, we explored different ratio of positive to negative instances in a training data set, which was controlled by under-sampling negative instances in a training data set. The optimal parameters of the prediction system were sought in five-fold cross-validation tests conducted on the development data set, and the final classifier trained on the entire development data set was evaluated on the held-out test data set. We used four encoding methods for feature vector extraction, including binary encoding, composition of k-spaced amino acid pairs (CKSAAP), monomer spectrum (MS), and composition of monomer spectrum (CMS). These encoding methods yielded different prediction performance for O-GlcNAcylation sites. The results showed that the method obtained an AUC (area under curve) of ∼80% on the test sequence set. The dbOGAP database and the O-GlcNAcylation site prediction program tool are being made web accessible and the web resource will be an important bioinformatics tool to facilitate exploration of the broad roles of O-GlcNAcylation in physiology and diseases.

Original language	English (US)
Title of host publication	Proceedings - 2009 IEEE International Conference on Bioinformatics and Biomedicine Workshops, BIBMW 2009
Pages	346
Number of pages	1
DOIs	https://doi.org/10.1109/BIBMW.2009.5332094
State	Published - 2009
Event	2009 IEEE International Conference on Bioinformatics and Biomedicine Workshops, BIBMW 2009 - Washington, DC, United States Duration: Nov 1 2009 → Nov 4 2009

Publication series

Name	Proceedings - 2009 IEEE International Conference on Bioinformatics and Biomedicine Workshops, BIBMW 2009

Other

Other	2009 IEEE International Conference on Bioinformatics and Biomedicine Workshops, BIBMW 2009
Country/Territory	United States
City	Washington, DC
Period	11/1/09 → 11/4/09

Keywords

Database
O-GlcNAcylation
Protein glycosylation
Site prediction
Support vector machine

ASJC Scopus subject areas

Biomedical Engineering
Health Informatics
Health Information Management

Access to Document

10.1109/BIBMW.2009.5332094

Cite this

Hu, Z. Z., Torii, M., Wang, J., Liu, H., & Hart, G. W. (2009). Development of dbOGAP: A bioinformatics resource of O-GlcNAcylated proteins and site prediction. In Proceedings - 2009 IEEE International Conference on Bioinformatics and Biomedicine Workshops, BIBMW 2009 (pp. 346). Article 5332094 (Proceedings - 2009 IEEE International Conference on Bioinformatics and Biomedicine Workshops, BIBMW 2009). https://doi.org/10.1109/BIBMW.2009.5332094

Development of dbOGAP: A bioinformatics resource of O-GlcNAcylated proteins and site prediction. / Hu, Zhang Zhi; Torii, Manabu; Wang, Jinlian et al.
Proceedings - 2009 IEEE International Conference on Bioinformatics and Biomedicine Workshops, BIBMW 2009. 2009. p. 346 5332094 (Proceedings - 2009 IEEE International Conference on Bioinformatics and Biomedicine Workshops, BIBMW 2009).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

Hu, ZZ, Torii, M, Wang, J, Liu, H & Hart, GW 2009, Development of dbOGAP: A bioinformatics resource of O-GlcNAcylated proteins and site prediction. in Proceedings - 2009 IEEE International Conference on Bioinformatics and Biomedicine Workshops, BIBMW 2009., 5332094, Proceedings - 2009 IEEE International Conference on Bioinformatics and Biomedicine Workshops, BIBMW 2009, pp. 346, 2009 IEEE International Conference on Bioinformatics and Biomedicine Workshops, BIBMW 2009, Washington, DC, United States, 11/1/09. https://doi.org/10.1109/BIBMW.2009.5332094

Hu ZZ, Torii M, Wang J, Liu H, Hart GW. Development of dbOGAP: A bioinformatics resource of O-GlcNAcylated proteins and site prediction. In Proceedings - 2009 IEEE International Conference on Bioinformatics and Biomedicine Workshops, BIBMW 2009. 2009. p. 346. 5332094. (Proceedings - 2009 IEEE International Conference on Bioinformatics and Biomedicine Workshops, BIBMW 2009). doi: 10.1109/BIBMW.2009.5332094

Hu, Zhang Zhi ; Torii, Manabu ; Wang, Jinlian et al. / Development of dbOGAP : A bioinformatics resource of O-GlcNAcylated proteins and site prediction. Proceedings - 2009 IEEE International Conference on Bioinformatics and Biomedicine Workshops, BIBMW 2009. 2009. pp. 346 (Proceedings - 2009 IEEE International Conference on Bioinformatics and Biomedicine Workshops, BIBMW 2009).

@inproceedings{7d2f233c13e54040a1b2f6d6ff96f943,

title = "Development of dbOGAP: A bioinformatics resource of O-GlcNAcylated proteins and site prediction",

abstract = "Protein glycosylation is one of the most common posttranslational modifications (PTMs) with several types. O-GlcNAcylation is an O-linked glycosylation with attachment of β-N-acetylglucosamine (GlcNAc) to Ser/Thr residues catalyzed by O-GlcNAc transferase (OGA), whose removal is catalyzed by O-GlcNAcase. Unlike mucin-type O-glycosylation, O-GlcNAcylation occurs primarily in nucleocytoplasmic proteins and the monosaccharide is not further extended. Moreover, O-GlcNAcylation is dynamic and often reciprocal to phosphorylation at the same or adjacent Ser/Thr residues. Growing evidences suggest that O-GlcNAcylation is very common and has broad roles in physiology as well as in diseases especially through its interplay with phosphorylation, e.g., regulation of insulin signaling, transcription, and roles in diabetes and neurodegenerative diseases. In contrast to the enormous body of research on cellular roles of phosphorylation, the amount of research on O-GlcNAcylation has been disproportionally small and annotation of O-GlcNAcylated sites in protein databases is currently scarce. An O-GlcNAcylation site prediction program was developed in 2002, but it was based on a small data set with 40 O-GlcNAcylation sites known at that time (http://www.cbs.dtu.dk/services/YinOYang/). Here we seek to develop a database of O-GlcNAcylated proteins and sites, named dbOGAP, and also an O-GlcNAcylated site prediction system based on known sites data in dbOGAP to facilitate annotation and proteomic identification of the O-GlcNAcylation sites. We developed dbOGAP based primarily on O-GlcNAcylated proteins and sites published in peer-reviewed articles dated back to 1984 since it was first described. Most of these proteins were mapped to the UniProtKB protein IDs, except for some that could not be unambiguously mapped. The database currently contains 540 protein entries with experimental O-GlcNAcylation information, and 338 O-GlcNAc sites for 164 proteins. About 59% of these proteins are of humans, and other organisms include rat, mouse, fly and African frog. Among 164 proteins with known O-GlcNAcylation sites, 122 also have both phosphorylation (total 1634) and O-GlcNAc sites (total 263). The Gene Ontology (GO) profiling showed that the known O-GlcNAcylated proteins have a broad range of functions including developmental process, transcriptional regulation, cell signaling, metabolic regulation, and cellular transport and trafficking. The GO profile also showed that O-GlcNAcylated proteins are primarily nuclear and cytoplasmic, including membrane-associated intracellular proteins. The database is also populated with additional orthologous protein sequences to known O-GlcNAcylated proteins. Additional functional data, including other PTM features, biological pathways and disease information have been integrated to the database. We developed an O-GlcNAcylation site prediction program using Support Vector Machine (SVM). As positive instances, sequence fragments surrounding 322 O-GlcNAcylated Ser/Thr sites were extracted from 157 proteins in dbOGAP, and over 28 thousand sequence fragments surrounding the rest of the Ser/Thr sites in those proteins were assumed as negative instances. Two thirds of this data set was randomly selected as development data and was used for tuning parameters in SVM classifiers, while the rest of the data was set apart as a held-out test data set. To reduce the impact of imbalanced data on the performance of trained classifiers, we explored different ratio of positive to negative instances in a training data set, which was controlled by under-sampling negative instances in a training data set. The optimal parameters of the prediction system were sought in five-fold cross-validation tests conducted on the development data set, and the final classifier trained on the entire development data set was evaluated on the held-out test data set. We used four encoding methods for feature vector extraction, including binary encoding, composition of k-spaced amino acid pairs (CKSAAP), monomer spectrum (MS), and composition of monomer spectrum (CMS). These encoding methods yielded different prediction performance for O-GlcNAcylation sites. The results showed that the method obtained an AUC (area under curve) of ∼80% on the test sequence set. The dbOGAP database and the O-GlcNAcylation site prediction program tool are being made web accessible and the web resource will be an important bioinformatics tool to facilitate exploration of the broad roles of O-GlcNAcylation in physiology and diseases.",

keywords = "Database, O-GlcNAcylation, Protein glycosylation, Site prediction, Support vector machine",

author = "Hu, {Zhang Zhi} and Manabu Torii and Jinlian Wang and Hongfang Liu and Hart, {Gerald W.}",

year = "2009",

doi = "10.1109/BIBMW.2009.5332094",

language = "English (US)",

isbn = "9781424451210",

series = "Proceedings - 2009 IEEE International Conference on Bioinformatics and Biomedicine Workshops, BIBMW 2009",

pages = "346",

booktitle = "Proceedings - 2009 IEEE International Conference on Bioinformatics and Biomedicine Workshops, BIBMW 2009",

note = "2009 IEEE International Conference on Bioinformatics and Biomedicine Workshops, BIBMW 2009 ; Conference date: 01-11-2009 Through 04-11-2009",

}

TY - GEN

T1 - Development of dbOGAP

T2 - 2009 IEEE International Conference on Bioinformatics and Biomedicine Workshops, BIBMW 2009

AU - Hu, Zhang Zhi

AU - Torii, Manabu

AU - Wang, Jinlian

AU - Liu, Hongfang

AU - Hart, Gerald W.

PY - 2009

Y1 - 2009

N2 - Protein glycosylation is one of the most common posttranslational modifications (PTMs) with several types. O-GlcNAcylation is an O-linked glycosylation with attachment of β-N-acetylglucosamine (GlcNAc) to Ser/Thr residues catalyzed by O-GlcNAc transferase (OGA), whose removal is catalyzed by O-GlcNAcase. Unlike mucin-type O-glycosylation, O-GlcNAcylation occurs primarily in nucleocytoplasmic proteins and the monosaccharide is not further extended. Moreover, O-GlcNAcylation is dynamic and often reciprocal to phosphorylation at the same or adjacent Ser/Thr residues. Growing evidences suggest that O-GlcNAcylation is very common and has broad roles in physiology as well as in diseases especially through its interplay with phosphorylation, e.g., regulation of insulin signaling, transcription, and roles in diabetes and neurodegenerative diseases. In contrast to the enormous body of research on cellular roles of phosphorylation, the amount of research on O-GlcNAcylation has been disproportionally small and annotation of O-GlcNAcylated sites in protein databases is currently scarce. An O-GlcNAcylation site prediction program was developed in 2002, but it was based on a small data set with 40 O-GlcNAcylation sites known at that time (http://www.cbs.dtu.dk/services/YinOYang/). Here we seek to develop a database of O-GlcNAcylated proteins and sites, named dbOGAP, and also an O-GlcNAcylated site prediction system based on known sites data in dbOGAP to facilitate annotation and proteomic identification of the O-GlcNAcylation sites. We developed dbOGAP based primarily on O-GlcNAcylated proteins and sites published in peer-reviewed articles dated back to 1984 since it was first described. Most of these proteins were mapped to the UniProtKB protein IDs, except for some that could not be unambiguously mapped. The database currently contains 540 protein entries with experimental O-GlcNAcylation information, and 338 O-GlcNAc sites for 164 proteins. About 59% of these proteins are of humans, and other organisms include rat, mouse, fly and African frog. Among 164 proteins with known O-GlcNAcylation sites, 122 also have both phosphorylation (total 1634) and O-GlcNAc sites (total 263). The Gene Ontology (GO) profiling showed that the known O-GlcNAcylated proteins have a broad range of functions including developmental process, transcriptional regulation, cell signaling, metabolic regulation, and cellular transport and trafficking. The GO profile also showed that O-GlcNAcylated proteins are primarily nuclear and cytoplasmic, including membrane-associated intracellular proteins. The database is also populated with additional orthologous protein sequences to known O-GlcNAcylated proteins. Additional functional data, including other PTM features, biological pathways and disease information have been integrated to the database. We developed an O-GlcNAcylation site prediction program using Support Vector Machine (SVM). As positive instances, sequence fragments surrounding 322 O-GlcNAcylated Ser/Thr sites were extracted from 157 proteins in dbOGAP, and over 28 thousand sequence fragments surrounding the rest of the Ser/Thr sites in those proteins were assumed as negative instances. Two thirds of this data set was randomly selected as development data and was used for tuning parameters in SVM classifiers, while the rest of the data was set apart as a held-out test data set. To reduce the impact of imbalanced data on the performance of trained classifiers, we explored different ratio of positive to negative instances in a training data set, which was controlled by under-sampling negative instances in a training data set. The optimal parameters of the prediction system were sought in five-fold cross-validation tests conducted on the development data set, and the final classifier trained on the entire development data set was evaluated on the held-out test data set. We used four encoding methods for feature vector extraction, including binary encoding, composition of k-spaced amino acid pairs (CKSAAP), monomer spectrum (MS), and composition of monomer spectrum (CMS). These encoding methods yielded different prediction performance for O-GlcNAcylation sites. The results showed that the method obtained an AUC (area under curve) of ∼80% on the test sequence set. The dbOGAP database and the O-GlcNAcylation site prediction program tool are being made web accessible and the web resource will be an important bioinformatics tool to facilitate exploration of the broad roles of O-GlcNAcylation in physiology and diseases.

AB - Protein glycosylation is one of the most common posttranslational modifications (PTMs) with several types. O-GlcNAcylation is an O-linked glycosylation with attachment of β-N-acetylglucosamine (GlcNAc) to Ser/Thr residues catalyzed by O-GlcNAc transferase (OGA), whose removal is catalyzed by O-GlcNAcase. Unlike mucin-type O-glycosylation, O-GlcNAcylation occurs primarily in nucleocytoplasmic proteins and the monosaccharide is not further extended. Moreover, O-GlcNAcylation is dynamic and often reciprocal to phosphorylation at the same or adjacent Ser/Thr residues. Growing evidences suggest that O-GlcNAcylation is very common and has broad roles in physiology as well as in diseases especially through its interplay with phosphorylation, e.g., regulation of insulin signaling, transcription, and roles in diabetes and neurodegenerative diseases. In contrast to the enormous body of research on cellular roles of phosphorylation, the amount of research on O-GlcNAcylation has been disproportionally small and annotation of O-GlcNAcylated sites in protein databases is currently scarce. An O-GlcNAcylation site prediction program was developed in 2002, but it was based on a small data set with 40 O-GlcNAcylation sites known at that time (http://www.cbs.dtu.dk/services/YinOYang/). Here we seek to develop a database of O-GlcNAcylated proteins and sites, named dbOGAP, and also an O-GlcNAcylated site prediction system based on known sites data in dbOGAP to facilitate annotation and proteomic identification of the O-GlcNAcylation sites. We developed dbOGAP based primarily on O-GlcNAcylated proteins and sites published in peer-reviewed articles dated back to 1984 since it was first described. Most of these proteins were mapped to the UniProtKB protein IDs, except for some that could not be unambiguously mapped. The database currently contains 540 protein entries with experimental O-GlcNAcylation information, and 338 O-GlcNAc sites for 164 proteins. About 59% of these proteins are of humans, and other organisms include rat, mouse, fly and African frog. Among 164 proteins with known O-GlcNAcylation sites, 122 also have both phosphorylation (total 1634) and O-GlcNAc sites (total 263). The Gene Ontology (GO) profiling showed that the known O-GlcNAcylated proteins have a broad range of functions including developmental process, transcriptional regulation, cell signaling, metabolic regulation, and cellular transport and trafficking. The GO profile also showed that O-GlcNAcylated proteins are primarily nuclear and cytoplasmic, including membrane-associated intracellular proteins. The database is also populated with additional orthologous protein sequences to known O-GlcNAcylated proteins. Additional functional data, including other PTM features, biological pathways and disease information have been integrated to the database. We developed an O-GlcNAcylation site prediction program using Support Vector Machine (SVM). As positive instances, sequence fragments surrounding 322 O-GlcNAcylated Ser/Thr sites were extracted from 157 proteins in dbOGAP, and over 28 thousand sequence fragments surrounding the rest of the Ser/Thr sites in those proteins were assumed as negative instances. Two thirds of this data set was randomly selected as development data and was used for tuning parameters in SVM classifiers, while the rest of the data was set apart as a held-out test data set. To reduce the impact of imbalanced data on the performance of trained classifiers, we explored different ratio of positive to negative instances in a training data set, which was controlled by under-sampling negative instances in a training data set. The optimal parameters of the prediction system were sought in five-fold cross-validation tests conducted on the development data set, and the final classifier trained on the entire development data set was evaluated on the held-out test data set. We used four encoding methods for feature vector extraction, including binary encoding, composition of k-spaced amino acid pairs (CKSAAP), monomer spectrum (MS), and composition of monomer spectrum (CMS). These encoding methods yielded different prediction performance for O-GlcNAcylation sites. The results showed that the method obtained an AUC (area under curve) of ∼80% on the test sequence set. The dbOGAP database and the O-GlcNAcylation site prediction program tool are being made web accessible and the web resource will be an important bioinformatics tool to facilitate exploration of the broad roles of O-GlcNAcylation in physiology and diseases.

KW - Database

KW - O-GlcNAcylation

KW - Protein glycosylation

KW - Site prediction

KW - Support vector machine

UR - http://www.scopus.com/inward/record.url?scp=72849130477&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=72849130477&partnerID=8YFLogxK

U2 - 10.1109/BIBMW.2009.5332094

DO - 10.1109/BIBMW.2009.5332094

M3 - Conference contribution

AN - SCOPUS:72849130477

SN - 9781424451210

T3 - Proceedings - 2009 IEEE International Conference on Bioinformatics and Biomedicine Workshops, BIBMW 2009

SP - 346

BT - Proceedings - 2009 IEEE International Conference on Bioinformatics and Biomedicine Workshops, BIBMW 2009

Y2 - 1 November 2009 through 4 November 2009

ER -

Development of dbOGAP: A bioinformatics resource of O-GlcNAcylated proteins and site prediction

Abstract

Publication series

Other

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this