Modeling X Chromosome Data Using Random Forests: Conquering Sex Bias

Stacey J. Winham; Gregory D. Jenkins; Joanna M. Biernacka

doi:10.1002/gepi.21946

Modeling X Chromosome Data Using Random Forests: Conquering Sex Bias

Stacey J. Winham, Gregory D. Jenkins, Joanna M. Biernacka

Quantitative Health Sciences

Research output: Contribution to journal › Article › peer-review

4 Scopus citations

Abstract

Machine learning methods, including Random Forests (RF), are increasingly used for genetic data analysis. However, the standard RF algorithm does not correctly model the effects of X chromosome single nucleotide polymorphisms (SNPs), leading to biased estimates of variable importance. We propose extensions of RF to correctly model X SNPs, including a stratified approach and an approach based on the process of X chromosome inactivation. We applied the new and standard RF approaches to case-control alcohol dependence data from the Study of Addiction: Genes and Environment (SAGE), and compared the performance of the alternative approaches via a simulation study. Standard RF applied to a case-control study of alcohol dependence yielded inflated variable importance estimates for X SNPs, even when sex was included as a variable, but the results of the new RF methods were consistent with univariate regression-based approaches that correctly model X chromosome data. Simulations showed that the new RF methods eliminate the bias in standard RF variable importance for X SNPs when sex is associated with the trait, and are able to detect causal autosomal and X SNPs. Even in the absence of sex effects, the new extensions perform similarly to standard RF. Thus, we provide a powerful multimarker approach for genetic analysis that accommodates X chromosome data in an unbiased way. This method is implemented in the freely available R package "snpRF" (http://www.cran.r-project.org/web/packages/snpRF/).

Original language	English (US)
Pages (from-to)	123-132
Number of pages	10
Journal	Genetic epidemiology
Volume	40
Issue number	2
DOIs	https://doi.org/10.1002/gepi.21946
State	Published - Feb 1 2016

Keywords

Bias
Random Forest
Sex differences
Variable importance
X chromosome

ASJC Scopus subject areas

Epidemiology
Genetics(clinical)

Access to Document

10.1002/gepi.21946

Cite this

@article{e8a4cb6b4373476ca36f20d93cc5d4ad,

title = "Modeling X Chromosome Data Using Random Forests: Conquering Sex Bias",

abstract = "Machine learning methods, including Random Forests (RF), are increasingly used for genetic data analysis. However, the standard RF algorithm does not correctly model the effects of X chromosome single nucleotide polymorphisms (SNPs), leading to biased estimates of variable importance. We propose extensions of RF to correctly model X SNPs, including a stratified approach and an approach based on the process of X chromosome inactivation. We applied the new and standard RF approaches to case-control alcohol dependence data from the Study of Addiction: Genes and Environment (SAGE), and compared the performance of the alternative approaches via a simulation study. Standard RF applied to a case-control study of alcohol dependence yielded inflated variable importance estimates for X SNPs, even when sex was included as a variable, but the results of the new RF methods were consistent with univariate regression-based approaches that correctly model X chromosome data. Simulations showed that the new RF methods eliminate the bias in standard RF variable importance for X SNPs when sex is associated with the trait, and are able to detect causal autosomal and X SNPs. Even in the absence of sex effects, the new extensions perform similarly to standard RF. Thus, we provide a powerful multimarker approach for genetic analysis that accommodates X chromosome data in an unbiased way. This method is implemented in the freely available R package {"}snpRF{"} (http://www.cran.r-project.org/web/packages/snpRF/).",

keywords = "Bias, Random Forest, Sex differences, Variable importance, X chromosome",

author = "Winham, {Stacey J.} and Jenkins, {Gregory D.} and Biernacka, {Joanna M.}",

note = "Publisher Copyright: {\textcopyright} 2016 Wiley Periodicals, Inc.",

year = "2016",

month = feb,

day = "1",

doi = "10.1002/gepi.21946",

language = "English (US)",

volume = "40",

pages = "123--132",

journal = "Genetic epidemiology",

issn = "0741-0395",

publisher = "Wiley-Liss Inc.",

number = "2",

}

TY - JOUR

T1 - Modeling X Chromosome Data Using Random Forests

T2 - Conquering Sex Bias

AU - Winham, Stacey J.

AU - Jenkins, Gregory D.

AU - Biernacka, Joanna M.

PY - 2016/2/1

Y1 - 2016/2/1

N2 - Machine learning methods, including Random Forests (RF), are increasingly used for genetic data analysis. However, the standard RF algorithm does not correctly model the effects of X chromosome single nucleotide polymorphisms (SNPs), leading to biased estimates of variable importance. We propose extensions of RF to correctly model X SNPs, including a stratified approach and an approach based on the process of X chromosome inactivation. We applied the new and standard RF approaches to case-control alcohol dependence data from the Study of Addiction: Genes and Environment (SAGE), and compared the performance of the alternative approaches via a simulation study. Standard RF applied to a case-control study of alcohol dependence yielded inflated variable importance estimates for X SNPs, even when sex was included as a variable, but the results of the new RF methods were consistent with univariate regression-based approaches that correctly model X chromosome data. Simulations showed that the new RF methods eliminate the bias in standard RF variable importance for X SNPs when sex is associated with the trait, and are able to detect causal autosomal and X SNPs. Even in the absence of sex effects, the new extensions perform similarly to standard RF. Thus, we provide a powerful multimarker approach for genetic analysis that accommodates X chromosome data in an unbiased way. This method is implemented in the freely available R package "snpRF" (http://www.cran.r-project.org/web/packages/snpRF/).

AB - Machine learning methods, including Random Forests (RF), are increasingly used for genetic data analysis. However, the standard RF algorithm does not correctly model the effects of X chromosome single nucleotide polymorphisms (SNPs), leading to biased estimates of variable importance. We propose extensions of RF to correctly model X SNPs, including a stratified approach and an approach based on the process of X chromosome inactivation. We applied the new and standard RF approaches to case-control alcohol dependence data from the Study of Addiction: Genes and Environment (SAGE), and compared the performance of the alternative approaches via a simulation study. Standard RF applied to a case-control study of alcohol dependence yielded inflated variable importance estimates for X SNPs, even when sex was included as a variable, but the results of the new RF methods were consistent with univariate regression-based approaches that correctly model X chromosome data. Simulations showed that the new RF methods eliminate the bias in standard RF variable importance for X SNPs when sex is associated with the trait, and are able to detect causal autosomal and X SNPs. Even in the absence of sex effects, the new extensions perform similarly to standard RF. Thus, we provide a powerful multimarker approach for genetic analysis that accommodates X chromosome data in an unbiased way. This method is implemented in the freely available R package "snpRF" (http://www.cran.r-project.org/web/packages/snpRF/).

KW - Bias

KW - Random Forest

KW - Sex differences

KW - Variable importance

KW - X chromosome

UR - http://www.scopus.com/inward/record.url?scp=84955697167&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84955697167&partnerID=8YFLogxK

U2 - 10.1002/gepi.21946

DO - 10.1002/gepi.21946

M3 - Article

C2 - 26639183

AN - SCOPUS:84955697167

SN - 0741-0395

VL - 40

SP - 123

EP - 132

JO - Genetic epidemiology

JF - Genetic epidemiology

IS - 2

ER -

Modeling X Chromosome Data Using Random Forests: Conquering Sex Bias

Abstract

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this