Random forests for genetic association studies

Benjamin A. Goldstein; Eric C. Polley; Farren B.S. Briggs

doi:10.2202/1544-6115.1691

Random forests for genetic association studies

Benjamin A. Goldstein, Eric C. Polley, Farren B.S. Briggs

Quantitative Health Sciences

Research output: Contribution to journal › Review article › peer-review

100 Scopus citations

Abstract

The Random Forests (RF) algorithm has become a commonly used machine learning algorithm for genetic association studies. It is well suited for genetic applications since it is both computationally efficient and models genetic causal mechanisms well. With its growing ubiquity, there has been inconsistent and less than optimal use of RF in the literature. The purpose of this review is to breakdown the theoretical and statistical basis of RF so that practitioners are able to apply it in their work. An emphasis is placed on showing how the various components contribute to bias and variance, as well as discussing variable importance measures. Applications specific to genetic studies are highlighted. To provide context, RF is compared to other commonly used machine learning algorithms.

Original language	English (US)
Article number	32
Journal	Statistical Applications in Genetics and Molecular Biology
Volume	10
Issue number	1
DOIs	https://doi.org/10.2202/1544-6115.1691
State	Published - 2011

Keywords

SNP
genome wide association studies
machine learning

ASJC Scopus subject areas

Statistics and Probability
Molecular Biology
Genetics
Computational Mathematics

Access to Document

10.2202/1544-6115.1691

Cite this

@article{040665d55d9043babbad9ad29fbda25a,

title = "Random forests for genetic association studies",

abstract = "The Random Forests (RF) algorithm has become a commonly used machine learning algorithm for genetic association studies. It is well suited for genetic applications since it is both computationally efficient and models genetic causal mechanisms well. With its growing ubiquity, there has been inconsistent and less than optimal use of RF in the literature. The purpose of this review is to breakdown the theoretical and statistical basis of RF so that practitioners are able to apply it in their work. An emphasis is placed on showing how the various components contribute to bias and variance, as well as discussing variable importance measures. Applications specific to genetic studies are highlighted. To provide context, RF is compared to other commonly used machine learning algorithms.",

keywords = "SNP, genome wide association studies, machine learning",

author = "Goldstein, {Benjamin A.} and Polley, {Eric C.} and Briggs, {Farren B.S.}",

note = "Funding Information: Author Notes: Benjamin A. Goldstein, Quantitative Sciences Unit, Department of Medicine, Stanford University. Eric C. Polley, Biometric Research Branch, National Cancer Institute, National Institutes of Health. Farren B. S. Briggs, Genetic Epidemiology and Genomics Laboratory, University of California, Berkeley. The authors acknowledge Alan Hubbard, Lisa Barcellos and Adele Cutler for discussing and reviewing aspects of this work. BAG was funded in part by a National Institutes of Health NRSA Trainee appointment on grant T32 HG 00047 and the Russell M. Grossman Endowment. FBSB is a National Multiple Sclerosis Society PostDoctoral Fellow (FG 1847A1/1)",

year = "2011",

doi = "10.2202/1544-6115.1691",

language = "English (US)",

volume = "10",

journal = "Statistical Applications in Genetics and Molecular Biology",

issn = "1544-6115",

publisher = "Berkeley Electronic Press",

number = "1",

}

TY - JOUR

T1 - Random forests for genetic association studies

AU - Goldstein, Benjamin A.

AU - Polley, Eric C.

AU - Briggs, Farren B.S.

N1 - Funding Information: Author Notes: Benjamin A. Goldstein, Quantitative Sciences Unit, Department of Medicine, Stanford University. Eric C. Polley, Biometric Research Branch, National Cancer Institute, National Institutes of Health. Farren B. S. Briggs, Genetic Epidemiology and Genomics Laboratory, University of California, Berkeley. The authors acknowledge Alan Hubbard, Lisa Barcellos and Adele Cutler for discussing and reviewing aspects of this work. BAG was funded in part by a National Institutes of Health NRSA Trainee appointment on grant T32 HG 00047 and the Russell M. Grossman Endowment. FBSB is a National Multiple Sclerosis Society PostDoctoral Fellow (FG 1847A1/1)

PY - 2011

Y1 - 2011

N2 - The Random Forests (RF) algorithm has become a commonly used machine learning algorithm for genetic association studies. It is well suited for genetic applications since it is both computationally efficient and models genetic causal mechanisms well. With its growing ubiquity, there has been inconsistent and less than optimal use of RF in the literature. The purpose of this review is to breakdown the theoretical and statistical basis of RF so that practitioners are able to apply it in their work. An emphasis is placed on showing how the various components contribute to bias and variance, as well as discussing variable importance measures. Applications specific to genetic studies are highlighted. To provide context, RF is compared to other commonly used machine learning algorithms.

AB - The Random Forests (RF) algorithm has become a commonly used machine learning algorithm for genetic association studies. It is well suited for genetic applications since it is both computationally efficient and models genetic causal mechanisms well. With its growing ubiquity, there has been inconsistent and less than optimal use of RF in the literature. The purpose of this review is to breakdown the theoretical and statistical basis of RF so that practitioners are able to apply it in their work. An emphasis is placed on showing how the various components contribute to bias and variance, as well as discussing variable importance measures. Applications specific to genetic studies are highlighted. To provide context, RF is compared to other commonly used machine learning algorithms.

KW - SNP

KW - genome wide association studies

KW - machine learning

UR - http://www.scopus.com/inward/record.url?scp=79961222591&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=79961222591&partnerID=8YFLogxK

U2 - 10.2202/1544-6115.1691

DO - 10.2202/1544-6115.1691

M3 - Review article

C2 - 22889876

AN - SCOPUS:79961222591

SN - 1544-6115

VL - 10

JO - Statistical Applications in Genetics and Molecular Biology

JF - Statistical Applications in Genetics and Molecular Biology

IS - 1

M1 - 32

ER -

Random forests for genetic association studies

Abstract

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this