Machine learning in genome-wide association studies

Silke Szymczak, Joanna M Biernacka, Heather J. Cordell, Oscar González-Recio, Inke R. König, Heping Zhang, Yan V. Sun

Research output: Contribution to journalArticle

67 Citations (Scopus)

Abstract

Recently, genome-wide association studies have substantially expanded our knowledge about genetic variants that influence the susceptibility to complex diseases. Although standard statistical tests for each single-nucleotide polymorphism (SNP) separately are able to capture main genetic effects, different approaches are necessary to identify SNPs that influence disease risk jointly or in complex interactions. Experimental and simulated genome-wide SNP data provided by the Genetic Analysis Workshop 16 afforded an opportunity to analyze the applicability and benefit of several machine learning methods. Penalized regression, ensemble methods, and network analyses resulted in several new findings while known and simulated genetic risk variants were also identified. In conclusion, machine learning approaches are promising complements to standard single-and multi-SNP analysis methods for understanding the overall genetic architecture of complex human diseases. However, because they are not optimized for genome-wide SNP data, improved implementations and new variable selection procedures are required.

Original languageEnglish (US)
JournalGenetic Epidemiology
Volume33
Issue numberSUPPL. 1
DOIs
StatePublished - 2009

Fingerprint

Genome-Wide Association Study
Single Nucleotide Polymorphism
Genome
Machine Learning
Education

Keywords

  • Data mining
  • Genetic Analysis Workshop
  • Network analysis
  • Penalized regression
  • Random forests

ASJC Scopus subject areas

  • Genetics(clinical)
  • Epidemiology

Cite this

Szymczak, S., Biernacka, J. M., Cordell, H. J., González-Recio, O., König, I. R., Zhang, H., & Sun, Y. V. (2009). Machine learning in genome-wide association studies. Genetic Epidemiology, 33(SUPPL. 1). https://doi.org/10.1002/gepi.20473

Machine learning in genome-wide association studies. / Szymczak, Silke; Biernacka, Joanna M; Cordell, Heather J.; González-Recio, Oscar; König, Inke R.; Zhang, Heping; Sun, Yan V.

In: Genetic Epidemiology, Vol. 33, No. SUPPL. 1, 2009.

Research output: Contribution to journalArticle

Szymczak, S, Biernacka, JM, Cordell, HJ, González-Recio, O, König, IR, Zhang, H & Sun, YV 2009, 'Machine learning in genome-wide association studies', Genetic Epidemiology, vol. 33, no. SUPPL. 1. https://doi.org/10.1002/gepi.20473
Szymczak S, Biernacka JM, Cordell HJ, González-Recio O, König IR, Zhang H et al. Machine learning in genome-wide association studies. Genetic Epidemiology. 2009;33(SUPPL. 1). https://doi.org/10.1002/gepi.20473
Szymczak, Silke ; Biernacka, Joanna M ; Cordell, Heather J. ; González-Recio, Oscar ; König, Inke R. ; Zhang, Heping ; Sun, Yan V. / Machine learning in genome-wide association studies. In: Genetic Epidemiology. 2009 ; Vol. 33, No. SUPPL. 1.
@article{d4d8df6271f44842a9caf356937c7782,
title = "Machine learning in genome-wide association studies",
abstract = "Recently, genome-wide association studies have substantially expanded our knowledge about genetic variants that influence the susceptibility to complex diseases. Although standard statistical tests for each single-nucleotide polymorphism (SNP) separately are able to capture main genetic effects, different approaches are necessary to identify SNPs that influence disease risk jointly or in complex interactions. Experimental and simulated genome-wide SNP data provided by the Genetic Analysis Workshop 16 afforded an opportunity to analyze the applicability and benefit of several machine learning methods. Penalized regression, ensemble methods, and network analyses resulted in several new findings while known and simulated genetic risk variants were also identified. In conclusion, machine learning approaches are promising complements to standard single-and multi-SNP analysis methods for understanding the overall genetic architecture of complex human diseases. However, because they are not optimized for genome-wide SNP data, improved implementations and new variable selection procedures are required.",
keywords = "Data mining, Genetic Analysis Workshop, Network analysis, Penalized regression, Random forests",
author = "Silke Szymczak and Biernacka, {Joanna M} and Cordell, {Heather J.} and Oscar Gonz{\'a}lez-Recio and K{\"o}nig, {Inke R.} and Heping Zhang and Sun, {Yan V.}",
year = "2009",
doi = "10.1002/gepi.20473",
language = "English (US)",
volume = "33",
journal = "Genetic Epidemiology",
issn = "0741-0395",
publisher = "Wiley-Liss Inc.",
number = "SUPPL. 1",

}

TY - JOUR

T1 - Machine learning in genome-wide association studies

AU - Szymczak, Silke

AU - Biernacka, Joanna M

AU - Cordell, Heather J.

AU - González-Recio, Oscar

AU - König, Inke R.

AU - Zhang, Heping

AU - Sun, Yan V.

PY - 2009

Y1 - 2009

N2 - Recently, genome-wide association studies have substantially expanded our knowledge about genetic variants that influence the susceptibility to complex diseases. Although standard statistical tests for each single-nucleotide polymorphism (SNP) separately are able to capture main genetic effects, different approaches are necessary to identify SNPs that influence disease risk jointly or in complex interactions. Experimental and simulated genome-wide SNP data provided by the Genetic Analysis Workshop 16 afforded an opportunity to analyze the applicability and benefit of several machine learning methods. Penalized regression, ensemble methods, and network analyses resulted in several new findings while known and simulated genetic risk variants were also identified. In conclusion, machine learning approaches are promising complements to standard single-and multi-SNP analysis methods for understanding the overall genetic architecture of complex human diseases. However, because they are not optimized for genome-wide SNP data, improved implementations and new variable selection procedures are required.

AB - Recently, genome-wide association studies have substantially expanded our knowledge about genetic variants that influence the susceptibility to complex diseases. Although standard statistical tests for each single-nucleotide polymorphism (SNP) separately are able to capture main genetic effects, different approaches are necessary to identify SNPs that influence disease risk jointly or in complex interactions. Experimental and simulated genome-wide SNP data provided by the Genetic Analysis Workshop 16 afforded an opportunity to analyze the applicability and benefit of several machine learning methods. Penalized regression, ensemble methods, and network analyses resulted in several new findings while known and simulated genetic risk variants were also identified. In conclusion, machine learning approaches are promising complements to standard single-and multi-SNP analysis methods for understanding the overall genetic architecture of complex human diseases. However, because they are not optimized for genome-wide SNP data, improved implementations and new variable selection procedures are required.

KW - Data mining

KW - Genetic Analysis Workshop

KW - Network analysis

KW - Penalized regression

KW - Random forests

UR - http://www.scopus.com/inward/record.url?scp=71249151977&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=71249151977&partnerID=8YFLogxK

U2 - 10.1002/gepi.20473

DO - 10.1002/gepi.20473

M3 - Article

VL - 33

JO - Genetic Epidemiology

JF - Genetic Epidemiology

SN - 0741-0395

IS - SUPPL. 1

ER -