Predictive models for protein crystallization

Bernhard Rupp; Junwen Wang

doi:10.1016/j.ymeth.2004.03.031

Predictive models for protein crystallization

Bernhard Rupp, Junwen Wang

Research

Research output: Contribution to journal › Article › peer-review

63 Scopus citations

Abstract

Crystallization of proteins is a nontrivial task, and despite the substantial efforts in robotic automation, crystallization screening is still largely based on trial-and-error sampling of a limited subset of suitable reagents and experimental parameters. Funding of high throughput crystallography pilot projects through the NIH Protein Structure Initiative provides the opportunity to collect crystallization data in a comprehensive and statistically valid form. Data mining and machine learning algorithms thus have the potential to deliver predictive models for protein crystallization. However, the underlying complex physical reality of crystallization, combined with a generally ill-defined and sparsely populated sampling space, and inconsistent scoring and annotation make the development of predictive models non-trivial. We discuss the conceptual problems, and review strengths and limitations of current approaches towards crystallization prediction, emphasizing the importance of comprehensive and valid sampling protocols. In view of limited overlap in techniques and sampling parameters between the publicly funded high throughput crystallography initiatives, exchange of information and standardization should be encouraged, aiming to effectively integrate data mining and machine learning efforts into a comprehensive predictive framework for protein crystallization. Similar experimental design and knowledge discovery strategies should be applied to valid analysis and prediction of protein expression, solubilization, and purification, as well as crystal handling and cryo-protection.

Original language	English (US)
Pages (from-to)	390-407
Number of pages	18
Journal	Methods
Volume	34
Issue number	3
DOIs	https://doi.org/10.1016/j.ymeth.2004.03.031
State	Published - Nov 2004

Keywords

High throughput crystallization
Machine learning
Predictive models
Statistical analysis
Structural genomics

ASJC Scopus subject areas

Molecular Biology
General Biochemistry, Genetics and Molecular Biology

Access to Document

10.1016/j.ymeth.2004.03.031

Cite this

@article{c9c57f54423849aea5c7184aa8c535b4,

title = "Predictive models for protein crystallization",

abstract = "Crystallization of proteins is a nontrivial task, and despite the substantial efforts in robotic automation, crystallization screening is still largely based on trial-and-error sampling of a limited subset of suitable reagents and experimental parameters. Funding of high throughput crystallography pilot projects through the NIH Protein Structure Initiative provides the opportunity to collect crystallization data in a comprehensive and statistically valid form. Data mining and machine learning algorithms thus have the potential to deliver predictive models for protein crystallization. However, the underlying complex physical reality of crystallization, combined with a generally ill-defined and sparsely populated sampling space, and inconsistent scoring and annotation make the development of predictive models non-trivial. We discuss the conceptual problems, and review strengths and limitations of current approaches towards crystallization prediction, emphasizing the importance of comprehensive and valid sampling protocols. In view of limited overlap in techniques and sampling parameters between the publicly funded high throughput crystallography initiatives, exchange of information and standardization should be encouraged, aiming to effectively integrate data mining and machine learning efforts into a comprehensive predictive framework for protein crystallization. Similar experimental design and knowledge discovery strategies should be applied to valid analysis and prediction of protein expression, solubilization, and purification, as well as crystal handling and cryo-protection.",

keywords = "High throughput crystallization, Machine learning, Predictive models, Statistical analysis, Structural genomics",

author = "Bernhard Rupp and Junwen Wang",

note = "Funding Information: We thank the current and past members of the TB Structural Genomics Consortium crystallization facility team (B.W. Segelke, H.I. Krupka, B.S. Schick, T. Lekin, J. Schafer, and D. Toppani) for populating the crystallization database. K.A. Kantardjieff, CSUF, has provided assistance with statistical data analysis and manuscript revisions. The cloning and protein production facilities under J. Perry, C. Goulding, and D. Eisenberg (UCLA); J.C. Sacchettini (Texas A&M University); T. Terwilliger, M. Park, C.-Y. Chang, and G. Waldo (LANL) have supplied a steady flow of proteins used in the crystallization experiments. Li Chen (RCSB Rutgers) has helped in extracting information from the PSI target database. LLNL is operated by University of California for the US DOE under contract W-7405-ENG-48. This work was funded by NIH P50 GM62410 (TB Structural Genomics) centre grant and produced with support of the Reiss Bar, Vienna, Austria. ",

year = "2004",

month = nov,

doi = "10.1016/j.ymeth.2004.03.031",

language = "English (US)",

volume = "34",

pages = "390--407",

journal = "Methods",

issn = "1046-2023",

publisher = "Academic Press Inc.",

number = "3",

}

TY - JOUR

T1 - Predictive models for protein crystallization

AU - Rupp, Bernhard

AU - Wang, Junwen

N1 - Funding Information: We thank the current and past members of the TB Structural Genomics Consortium crystallization facility team (B.W. Segelke, H.I. Krupka, B.S. Schick, T. Lekin, J. Schafer, and D. Toppani) for populating the crystallization database. K.A. Kantardjieff, CSUF, has provided assistance with statistical data analysis and manuscript revisions. The cloning and protein production facilities under J. Perry, C. Goulding, and D. Eisenberg (UCLA); J.C. Sacchettini (Texas A&M University); T. Terwilliger, M. Park, C.-Y. Chang, and G. Waldo (LANL) have supplied a steady flow of proteins used in the crystallization experiments. Li Chen (RCSB Rutgers) has helped in extracting information from the PSI target database. LLNL is operated by University of California for the US DOE under contract W-7405-ENG-48. This work was funded by NIH P50 GM62410 (TB Structural Genomics) centre grant and produced with support of the Reiss Bar, Vienna, Austria.

PY - 2004/11

Y1 - 2004/11

N2 - Crystallization of proteins is a nontrivial task, and despite the substantial efforts in robotic automation, crystallization screening is still largely based on trial-and-error sampling of a limited subset of suitable reagents and experimental parameters. Funding of high throughput crystallography pilot projects through the NIH Protein Structure Initiative provides the opportunity to collect crystallization data in a comprehensive and statistically valid form. Data mining and machine learning algorithms thus have the potential to deliver predictive models for protein crystallization. However, the underlying complex physical reality of crystallization, combined with a generally ill-defined and sparsely populated sampling space, and inconsistent scoring and annotation make the development of predictive models non-trivial. We discuss the conceptual problems, and review strengths and limitations of current approaches towards crystallization prediction, emphasizing the importance of comprehensive and valid sampling protocols. In view of limited overlap in techniques and sampling parameters between the publicly funded high throughput crystallography initiatives, exchange of information and standardization should be encouraged, aiming to effectively integrate data mining and machine learning efforts into a comprehensive predictive framework for protein crystallization. Similar experimental design and knowledge discovery strategies should be applied to valid analysis and prediction of protein expression, solubilization, and purification, as well as crystal handling and cryo-protection.

AB - Crystallization of proteins is a nontrivial task, and despite the substantial efforts in robotic automation, crystallization screening is still largely based on trial-and-error sampling of a limited subset of suitable reagents and experimental parameters. Funding of high throughput crystallography pilot projects through the NIH Protein Structure Initiative provides the opportunity to collect crystallization data in a comprehensive and statistically valid form. Data mining and machine learning algorithms thus have the potential to deliver predictive models for protein crystallization. However, the underlying complex physical reality of crystallization, combined with a generally ill-defined and sparsely populated sampling space, and inconsistent scoring and annotation make the development of predictive models non-trivial. We discuss the conceptual problems, and review strengths and limitations of current approaches towards crystallization prediction, emphasizing the importance of comprehensive and valid sampling protocols. In view of limited overlap in techniques and sampling parameters between the publicly funded high throughput crystallography initiatives, exchange of information and standardization should be encouraged, aiming to effectively integrate data mining and machine learning efforts into a comprehensive predictive framework for protein crystallization. Similar experimental design and knowledge discovery strategies should be applied to valid analysis and prediction of protein expression, solubilization, and purification, as well as crystal handling and cryo-protection.

KW - High throughput crystallization

KW - Machine learning

KW - Predictive models

KW - Statistical analysis

KW - Structural genomics

UR - http://www.scopus.com/inward/record.url?scp=4344704198&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=4344704198&partnerID=8YFLogxK

U2 - 10.1016/j.ymeth.2004.03.031

DO - 10.1016/j.ymeth.2004.03.031

M3 - Article

C2 - 15325656

AN - SCOPUS:4344704198

SN - 1046-2023

VL - 34

SP - 390

EP - 407

JO - Methods

JF - Methods

IS - 3

ER -

Predictive models for protein crystallization

Abstract

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this