A weighted random forests approach to improve predictive performance

Stacey J. Winham; Robert R. Freimuth; Joanna M. Biernacka

doi:10.1002/sam.11196

A weighted random forests approach to improve predictive performance

Stacey J. Winham, Robert R. Freimuth, Joanna M. Biernacka

Quantitative Health Sciences

Research output: Contribution to journal › Article › peer-review

40 Scopus citations

Abstract

Identifying genetic variants associated with complex disease in high-dimensional data is a challenging problem, and complicated etiologies such as gene-gene interactions are often ignored in analyses. The data-mining method random forests (RF) can handle high dimensions; however, in high-dimensional data, RF is not an effective filter for identifying risk factors associated with the disease trait via complex genetic models such as gene-gene interactions without strong marginal components. In this article we propose an extension called weighted random forests (wRF), which incorporates tree-level weights to emphasize more accurate trees in prediction and calculation of variable importance. We demonstrate through simulation and application to data from a genetic study of addiction that wRF can outperform RF in high-dimensional data, although the improvements are modest and limited to situations with effect sizes that are larger than what is realistic in genetics of complex disease. Thus, the current implementation of wRF is unlikely to improve detection of relevant predictors in high-dimensional genetic data, but may be applicable in other situations where larger effect sizes are anticipated.

Original language	English (US)
Pages (from-to)	496-505
Number of pages	10
Journal	Statistical Analysis and Data Mining
Volume	6
Issue number	6
DOIs	https://doi.org/10.1002/sam.11196
State	Published - Dec 2013

Keywords

Gene-gene interactions
Genetic data
Genome-wide association
High-dimensional data
Random forests
Weighting

ASJC Scopus subject areas

Analysis
Information Systems
Computer Science Applications

Access to Document

10.1002/sam.11196

Cite this

@article{f3b3054071ec44a69d38156c2074a9df,

title = "A weighted random forests approach to improve predictive performance",

abstract = "Identifying genetic variants associated with complex disease in high-dimensional data is a challenging problem, and complicated etiologies such as gene-gene interactions are often ignored in analyses. The data-mining method random forests (RF) can handle high dimensions; however, in high-dimensional data, RF is not an effective filter for identifying risk factors associated with the disease trait via complex genetic models such as gene-gene interactions without strong marginal components. In this article we propose an extension called weighted random forests (wRF), which incorporates tree-level weights to emphasize more accurate trees in prediction and calculation of variable importance. We demonstrate through simulation and application to data from a genetic study of addiction that wRF can outperform RF in high-dimensional data, although the improvements are modest and limited to situations with effect sizes that are larger than what is realistic in genetics of complex disease. Thus, the current implementation of wRF is unlikely to improve detection of relevant predictors in high-dimensional genetic data, but may be applicable in other situations where larger effect sizes are anticipated.",

keywords = "Gene-gene interactions, Genetic data, Genome-wide association, High-dimensional data, Random forests, Weighting",

author = "Winham, {Stacey J.} and Freimuth, {Robert R.} and Biernacka, {Joanna M.}",

year = "2013",

month = dec,

doi = "10.1002/sam.11196",

language = "English (US)",

volume = "6",

pages = "496--505",

journal = "Statistical Analysis and Data Mining",

issn = "1932-1872",

publisher = "John Wiley and Sons Inc.",

number = "6",

}

TY - JOUR

T1 - A weighted random forests approach to improve predictive performance

AU - Winham, Stacey J.

AU - Freimuth, Robert R.

AU - Biernacka, Joanna M.

PY - 2013/12

Y1 - 2013/12

N2 - Identifying genetic variants associated with complex disease in high-dimensional data is a challenging problem, and complicated etiologies such as gene-gene interactions are often ignored in analyses. The data-mining method random forests (RF) can handle high dimensions; however, in high-dimensional data, RF is not an effective filter for identifying risk factors associated with the disease trait via complex genetic models such as gene-gene interactions without strong marginal components. In this article we propose an extension called weighted random forests (wRF), which incorporates tree-level weights to emphasize more accurate trees in prediction and calculation of variable importance. We demonstrate through simulation and application to data from a genetic study of addiction that wRF can outperform RF in high-dimensional data, although the improvements are modest and limited to situations with effect sizes that are larger than what is realistic in genetics of complex disease. Thus, the current implementation of wRF is unlikely to improve detection of relevant predictors in high-dimensional genetic data, but may be applicable in other situations where larger effect sizes are anticipated.

AB - Identifying genetic variants associated with complex disease in high-dimensional data is a challenging problem, and complicated etiologies such as gene-gene interactions are often ignored in analyses. The data-mining method random forests (RF) can handle high dimensions; however, in high-dimensional data, RF is not an effective filter for identifying risk factors associated with the disease trait via complex genetic models such as gene-gene interactions without strong marginal components. In this article we propose an extension called weighted random forests (wRF), which incorporates tree-level weights to emphasize more accurate trees in prediction and calculation of variable importance. We demonstrate through simulation and application to data from a genetic study of addiction that wRF can outperform RF in high-dimensional data, although the improvements are modest and limited to situations with effect sizes that are larger than what is realistic in genetics of complex disease. Thus, the current implementation of wRF is unlikely to improve detection of relevant predictors in high-dimensional genetic data, but may be applicable in other situations where larger effect sizes are anticipated.

KW - Gene-gene interactions

KW - Genetic data

KW - Genome-wide association

KW - High-dimensional data

KW - Random forests

KW - Weighting

UR - http://www.scopus.com/inward/record.url?scp=84890157266&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84890157266&partnerID=8YFLogxK

U2 - 10.1002/sam.11196

DO - 10.1002/sam.11196

M3 - Article

AN - SCOPUS:84890157266

SN - 1932-1872

VL - 6

SP - 496

EP - 505

JO - Statistical Analysis and Data Mining

JF - Statistical Analysis and Data Mining

IS - 6

ER -

A weighted random forests approach to improve predictive performance

Abstract

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this