Protein docking using surface matching and supervised machine learning

Andrew J. Bordner, Andrey A. Gorin

Research output: Contribution to journalArticle

20 Citations (Scopus)

Abstract

Computational prediction of protein complex structures through docking offers a means to gain a mechanistic understanding of protein interactions that mediate biological processes. This is particularly important as the number of experimentally determined structures of isolated proteins exceeds the number of structures of complexes. A comprehensive docking procedure is described in which efficient sampling of conformations is achieved by matching surface normal vectors, fast filtering for shape complementarity, clustering by RMSD, and scoring the docked conformations using a supervised machine learning approach. Contacting residue pair frequencies, residue propensities, evolutionary conservation, and shape complementarity score for each docking conformation are used as input data to a Random Forest classifier. The performance of the Random Forest approach for selecting correctly docked conformations was as-sessed by cross-validation using a nonredundant benchmark set of X-ray structures for 93 heterodimer and 733 homodimer complexes. The single highest rank docking solution was the correct (near-native) structure for slightly more than one third of the complexes. Furthermore, the fraction of highly ranked correct structures was significantly higher than the overall fraction of correct structures, for almost all complexes. A detailed analysis of the difficult to predict complexes revealed that the majority of the homodimer cases were explained by incorrect oligomeric state annotation. Evolutionary conservation and shape complementarity score as well as both underrepresented and overrepresented residue types and residue pairs were found to make the largest contributions to the overall prediction accuracy. Finally, the method was also applied to docking unbound subunit structures from a previously published benchmark set.

Original languageEnglish (US)
Pages (from-to)488-502
Number of pages15
JournalProteins: Structure, Function and Genetics
Volume68
Issue number2
DOIs
StatePublished - Aug 1 2007
Externally publishedYes

Fingerprint

Learning systems
Benchmarking
Conformations
Biological Phenomena
Proteins
Conservation
Cluster Analysis
X-Rays
Classifiers
Supervised Machine Learning
Sampling
X rays

Keywords

  • Contacting residues
  • Evolutionary conservation
  • Protein complexes
  • Random forest
  • Residue propensities
  • Shape complementarity

ASJC Scopus subject areas

  • Genetics
  • Structural Biology
  • Biochemistry

Cite this

Protein docking using surface matching and supervised machine learning. / Bordner, Andrew J.; Gorin, Andrey A.

In: Proteins: Structure, Function and Genetics, Vol. 68, No. 2, 01.08.2007, p. 488-502.

Research output: Contribution to journalArticle

Bordner, Andrew J. ; Gorin, Andrey A. / Protein docking using surface matching and supervised machine learning. In: Proteins: Structure, Function and Genetics. 2007 ; Vol. 68, No. 2. pp. 488-502.
@article{5c4d4f38644d4b4f9e5627b194a4f90d,
title = "Protein docking using surface matching and supervised machine learning",
abstract = "Computational prediction of protein complex structures through docking offers a means to gain a mechanistic understanding of protein interactions that mediate biological processes. This is particularly important as the number of experimentally determined structures of isolated proteins exceeds the number of structures of complexes. A comprehensive docking procedure is described in which efficient sampling of conformations is achieved by matching surface normal vectors, fast filtering for shape complementarity, clustering by RMSD, and scoring the docked conformations using a supervised machine learning approach. Contacting residue pair frequencies, residue propensities, evolutionary conservation, and shape complementarity score for each docking conformation are used as input data to a Random Forest classifier. The performance of the Random Forest approach for selecting correctly docked conformations was as-sessed by cross-validation using a nonredundant benchmark set of X-ray structures for 93 heterodimer and 733 homodimer complexes. The single highest rank docking solution was the correct (near-native) structure for slightly more than one third of the complexes. Furthermore, the fraction of highly ranked correct structures was significantly higher than the overall fraction of correct structures, for almost all complexes. A detailed analysis of the difficult to predict complexes revealed that the majority of the homodimer cases were explained by incorrect oligomeric state annotation. Evolutionary conservation and shape complementarity score as well as both underrepresented and overrepresented residue types and residue pairs were found to make the largest contributions to the overall prediction accuracy. Finally, the method was also applied to docking unbound subunit structures from a previously published benchmark set.",
keywords = "Contacting residues, Evolutionary conservation, Protein complexes, Random forest, Residue propensities, Shape complementarity",
author = "Bordner, {Andrew J.} and Gorin, {Andrey A.}",
year = "2007",
month = "8",
day = "1",
doi = "10.1002/prot.21406",
language = "English (US)",
volume = "68",
pages = "488--502",
journal = "Proteins: Structure, Function and Bioinformatics",
issn = "0887-3585",
publisher = "Wiley-Liss Inc.",
number = "2",

}

TY - JOUR

T1 - Protein docking using surface matching and supervised machine learning

AU - Bordner, Andrew J.

AU - Gorin, Andrey A.

PY - 2007/8/1

Y1 - 2007/8/1

N2 - Computational prediction of protein complex structures through docking offers a means to gain a mechanistic understanding of protein interactions that mediate biological processes. This is particularly important as the number of experimentally determined structures of isolated proteins exceeds the number of structures of complexes. A comprehensive docking procedure is described in which efficient sampling of conformations is achieved by matching surface normal vectors, fast filtering for shape complementarity, clustering by RMSD, and scoring the docked conformations using a supervised machine learning approach. Contacting residue pair frequencies, residue propensities, evolutionary conservation, and shape complementarity score for each docking conformation are used as input data to a Random Forest classifier. The performance of the Random Forest approach for selecting correctly docked conformations was as-sessed by cross-validation using a nonredundant benchmark set of X-ray structures for 93 heterodimer and 733 homodimer complexes. The single highest rank docking solution was the correct (near-native) structure for slightly more than one third of the complexes. Furthermore, the fraction of highly ranked correct structures was significantly higher than the overall fraction of correct structures, for almost all complexes. A detailed analysis of the difficult to predict complexes revealed that the majority of the homodimer cases were explained by incorrect oligomeric state annotation. Evolutionary conservation and shape complementarity score as well as both underrepresented and overrepresented residue types and residue pairs were found to make the largest contributions to the overall prediction accuracy. Finally, the method was also applied to docking unbound subunit structures from a previously published benchmark set.

AB - Computational prediction of protein complex structures through docking offers a means to gain a mechanistic understanding of protein interactions that mediate biological processes. This is particularly important as the number of experimentally determined structures of isolated proteins exceeds the number of structures of complexes. A comprehensive docking procedure is described in which efficient sampling of conformations is achieved by matching surface normal vectors, fast filtering for shape complementarity, clustering by RMSD, and scoring the docked conformations using a supervised machine learning approach. Contacting residue pair frequencies, residue propensities, evolutionary conservation, and shape complementarity score for each docking conformation are used as input data to a Random Forest classifier. The performance of the Random Forest approach for selecting correctly docked conformations was as-sessed by cross-validation using a nonredundant benchmark set of X-ray structures for 93 heterodimer and 733 homodimer complexes. The single highest rank docking solution was the correct (near-native) structure for slightly more than one third of the complexes. Furthermore, the fraction of highly ranked correct structures was significantly higher than the overall fraction of correct structures, for almost all complexes. A detailed analysis of the difficult to predict complexes revealed that the majority of the homodimer cases were explained by incorrect oligomeric state annotation. Evolutionary conservation and shape complementarity score as well as both underrepresented and overrepresented residue types and residue pairs were found to make the largest contributions to the overall prediction accuracy. Finally, the method was also applied to docking unbound subunit structures from a previously published benchmark set.

KW - Contacting residues

KW - Evolutionary conservation

KW - Protein complexes

KW - Random forest

KW - Residue propensities

KW - Shape complementarity

UR - http://www.scopus.com/inward/record.url?scp=34250877416&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=34250877416&partnerID=8YFLogxK

U2 - 10.1002/prot.21406

DO - 10.1002/prot.21406

M3 - Article

C2 - 17444516

AN - SCOPUS:34250877416

VL - 68

SP - 488

EP - 502

JO - Proteins: Structure, Function and Bioinformatics

JF - Proteins: Structure, Function and Bioinformatics

SN - 0887-3585

IS - 2

ER -