Modeling X Chromosome Data Using Random Forests

Conquering Sex Bias

Stacey J Winham, Gregory D. Jenkins, Joanna M Biernacka

Research output: Contribution to journalArticle

3 Citations (Scopus)

Abstract

Machine learning methods, including Random Forests (RF), are increasingly used for genetic data analysis. However, the standard RF algorithm does not correctly model the effects of X chromosome single nucleotide polymorphisms (SNPs), leading to biased estimates of variable importance. We propose extensions of RF to correctly model X SNPs, including a stratified approach and an approach based on the process of X chromosome inactivation. We applied the new and standard RF approaches to case-control alcohol dependence data from the Study of Addiction: Genes and Environment (SAGE), and compared the performance of the alternative approaches via a simulation study. Standard RF applied to a case-control study of alcohol dependence yielded inflated variable importance estimates for X SNPs, even when sex was included as a variable, but the results of the new RF methods were consistent with univariate regression-based approaches that correctly model X chromosome data. Simulations showed that the new RF methods eliminate the bias in standard RF variable importance for X SNPs when sex is associated with the trait, and are able to detect causal autosomal and X SNPs. Even in the absence of sex effects, the new extensions perform similarly to standard RF. Thus, we provide a powerful multimarker approach for genetic analysis that accommodates X chromosome data in an unbiased way. This method is implemented in the freely available R package "snpRF" (http://www.cran.r-project.org/web/packages/snpRF/).

Original languageEnglish (US)
Pages (from-to)123-132
Number of pages10
JournalGenetic Epidemiology
Volume40
Issue number2
DOIs
StatePublished - Feb 1 2016

Fingerprint

Sexism
X Chromosome
Single Nucleotide Polymorphism
Alcoholism
X Chromosome Inactivation
Case-Control Studies
Forests
Genes

Keywords

  • Bias
  • Random Forest
  • Sex differences
  • Variable importance
  • X chromosome

ASJC Scopus subject areas

  • Genetics(clinical)
  • Epidemiology

Cite this

Modeling X Chromosome Data Using Random Forests : Conquering Sex Bias. / Winham, Stacey J; Jenkins, Gregory D.; Biernacka, Joanna M.

In: Genetic Epidemiology, Vol. 40, No. 2, 01.02.2016, p. 123-132.

Research output: Contribution to journalArticle

@article{e8a4cb6b4373476ca36f20d93cc5d4ad,
title = "Modeling X Chromosome Data Using Random Forests: Conquering Sex Bias",
abstract = "Machine learning methods, including Random Forests (RF), are increasingly used for genetic data analysis. However, the standard RF algorithm does not correctly model the effects of X chromosome single nucleotide polymorphisms (SNPs), leading to biased estimates of variable importance. We propose extensions of RF to correctly model X SNPs, including a stratified approach and an approach based on the process of X chromosome inactivation. We applied the new and standard RF approaches to case-control alcohol dependence data from the Study of Addiction: Genes and Environment (SAGE), and compared the performance of the alternative approaches via a simulation study. Standard RF applied to a case-control study of alcohol dependence yielded inflated variable importance estimates for X SNPs, even when sex was included as a variable, but the results of the new RF methods were consistent with univariate regression-based approaches that correctly model X chromosome data. Simulations showed that the new RF methods eliminate the bias in standard RF variable importance for X SNPs when sex is associated with the trait, and are able to detect causal autosomal and X SNPs. Even in the absence of sex effects, the new extensions perform similarly to standard RF. Thus, we provide a powerful multimarker approach for genetic analysis that accommodates X chromosome data in an unbiased way. This method is implemented in the freely available R package {"}snpRF{"} (http://www.cran.r-project.org/web/packages/snpRF/).",
keywords = "Bias, Random Forest, Sex differences, Variable importance, X chromosome",
author = "Winham, {Stacey J} and Jenkins, {Gregory D.} and Biernacka, {Joanna M}",
year = "2016",
month = "2",
day = "1",
doi = "10.1002/gepi.21946",
language = "English (US)",
volume = "40",
pages = "123--132",
journal = "Genetic Epidemiology",
issn = "0741-0395",
publisher = "Wiley-Liss Inc.",
number = "2",

}

TY - JOUR

T1 - Modeling X Chromosome Data Using Random Forests

T2 - Conquering Sex Bias

AU - Winham, Stacey J

AU - Jenkins, Gregory D.

AU - Biernacka, Joanna M

PY - 2016/2/1

Y1 - 2016/2/1

N2 - Machine learning methods, including Random Forests (RF), are increasingly used for genetic data analysis. However, the standard RF algorithm does not correctly model the effects of X chromosome single nucleotide polymorphisms (SNPs), leading to biased estimates of variable importance. We propose extensions of RF to correctly model X SNPs, including a stratified approach and an approach based on the process of X chromosome inactivation. We applied the new and standard RF approaches to case-control alcohol dependence data from the Study of Addiction: Genes and Environment (SAGE), and compared the performance of the alternative approaches via a simulation study. Standard RF applied to a case-control study of alcohol dependence yielded inflated variable importance estimates for X SNPs, even when sex was included as a variable, but the results of the new RF methods were consistent with univariate regression-based approaches that correctly model X chromosome data. Simulations showed that the new RF methods eliminate the bias in standard RF variable importance for X SNPs when sex is associated with the trait, and are able to detect causal autosomal and X SNPs. Even in the absence of sex effects, the new extensions perform similarly to standard RF. Thus, we provide a powerful multimarker approach for genetic analysis that accommodates X chromosome data in an unbiased way. This method is implemented in the freely available R package "snpRF" (http://www.cran.r-project.org/web/packages/snpRF/).

AB - Machine learning methods, including Random Forests (RF), are increasingly used for genetic data analysis. However, the standard RF algorithm does not correctly model the effects of X chromosome single nucleotide polymorphisms (SNPs), leading to biased estimates of variable importance. We propose extensions of RF to correctly model X SNPs, including a stratified approach and an approach based on the process of X chromosome inactivation. We applied the new and standard RF approaches to case-control alcohol dependence data from the Study of Addiction: Genes and Environment (SAGE), and compared the performance of the alternative approaches via a simulation study. Standard RF applied to a case-control study of alcohol dependence yielded inflated variable importance estimates for X SNPs, even when sex was included as a variable, but the results of the new RF methods were consistent with univariate regression-based approaches that correctly model X chromosome data. Simulations showed that the new RF methods eliminate the bias in standard RF variable importance for X SNPs when sex is associated with the trait, and are able to detect causal autosomal and X SNPs. Even in the absence of sex effects, the new extensions perform similarly to standard RF. Thus, we provide a powerful multimarker approach for genetic analysis that accommodates X chromosome data in an unbiased way. This method is implemented in the freely available R package "snpRF" (http://www.cran.r-project.org/web/packages/snpRF/).

KW - Bias

KW - Random Forest

KW - Sex differences

KW - Variable importance

KW - X chromosome

UR - http://www.scopus.com/inward/record.url?scp=84955697167&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84955697167&partnerID=8YFLogxK

U2 - 10.1002/gepi.21946

DO - 10.1002/gepi.21946

M3 - Article

VL - 40

SP - 123

EP - 132

JO - Genetic Epidemiology

JF - Genetic Epidemiology

SN - 0741-0395

IS - 2

ER -