Variable selection for sparse Dirichlet-multinomial regression with an application to microbiome data analysis

Jun Chen, Hongzhe Li

Research output: Contribution to journalArticle

60 Citations (Scopus)

Abstract

With the development of next generation sequencing technology, researchers have now been able to study the microbiome composition using direct sequencing, whose output are bacterial taxa counts for each microbiome sample. One goal of microbiome study is to associate the microbiome composition with environmental covariates. We propose to model the taxa counts using a Dirichlet-multinomial (DM) regression model in order to account for overdispersion of observed counts. The DM regression model can be used for testing the association between taxa composition and covariates using the likelihood ratio test. However, when the number of covariates is large, multiple testing can lead to loss of power. To address the high dimensionality of the problem, we develop a penalized likelihood approach to estimate the regression parameters and to select the variables by imposing a sparse group l1 penalty to encourage both group-level and within-group sparsity. Such a variable selection procedure can lead to selection of the relevant covariates and their associated bacterial taxa. An efficient block-coordinate descent algorithm is developed to solve the optimization problem. We present extensive simulations to demonstrate that the sparse DM regression can result in better identification of the microbiome-associated covariates than models that ignore overdispersion or only consider the proportions. We demonstrate the power of our method in an analysis of a data set evaluating the effects of nutrient intake on human gut microbiome composition. Our results have clearly shown that the nutrient intake is strongly associated with the human gut microbiome.

Original languageEnglish (US)
Pages (from-to)418-442
Number of pages25
JournalAnnals of Applied Statistics
Volume7
Issue number1
DOIs
StatePublished - Mar 2013
Externally publishedYes

Fingerprint

Variable Selection
Dirichlet
Covariates
Data analysis
Regression
Multinomial Model
Chemical analysis
Overdispersion
Count
Nutrients
Sequencing
Regression Model
Coordinate Descent
Testing
Penalized Likelihood
Multiple Testing
Descent Algorithm
Selection Procedures
Likelihood Ratio Test
Identification (control systems)

Keywords

  • Coordinate descent
  • Counts data
  • Overdispersion
  • Regularized likelihood
  • Sparse group penalty

ASJC Scopus subject areas

  • Statistics, Probability and Uncertainty
  • Modeling and Simulation
  • Statistics and Probability

Cite this

Variable selection for sparse Dirichlet-multinomial regression with an application to microbiome data analysis. / Chen, Jun; Li, Hongzhe.

In: Annals of Applied Statistics, Vol. 7, No. 1, 03.2013, p. 418-442.

Research output: Contribution to journalArticle

@article{82d6633461334b63aa0f1818a397bf00,
title = "Variable selection for sparse Dirichlet-multinomial regression with an application to microbiome data analysis",
abstract = "With the development of next generation sequencing technology, researchers have now been able to study the microbiome composition using direct sequencing, whose output are bacterial taxa counts for each microbiome sample. One goal of microbiome study is to associate the microbiome composition with environmental covariates. We propose to model the taxa counts using a Dirichlet-multinomial (DM) regression model in order to account for overdispersion of observed counts. The DM regression model can be used for testing the association between taxa composition and covariates using the likelihood ratio test. However, when the number of covariates is large, multiple testing can lead to loss of power. To address the high dimensionality of the problem, we develop a penalized likelihood approach to estimate the regression parameters and to select the variables by imposing a sparse group l1 penalty to encourage both group-level and within-group sparsity. Such a variable selection procedure can lead to selection of the relevant covariates and their associated bacterial taxa. An efficient block-coordinate descent algorithm is developed to solve the optimization problem. We present extensive simulations to demonstrate that the sparse DM regression can result in better identification of the microbiome-associated covariates than models that ignore overdispersion or only consider the proportions. We demonstrate the power of our method in an analysis of a data set evaluating the effects of nutrient intake on human gut microbiome composition. Our results have clearly shown that the nutrient intake is strongly associated with the human gut microbiome.",
keywords = "Coordinate descent, Counts data, Overdispersion, Regularized likelihood, Sparse group penalty",
author = "Jun Chen and Hongzhe Li",
year = "2013",
month = "3",
doi = "10.1214/12-AOAS592",
language = "English (US)",
volume = "7",
pages = "418--442",
journal = "Annals of Applied Statistics",
issn = "1932-6157",
publisher = "Institute of Mathematical Statistics",
number = "1",

}

TY - JOUR

T1 - Variable selection for sparse Dirichlet-multinomial regression with an application to microbiome data analysis

AU - Chen, Jun

AU - Li, Hongzhe

PY - 2013/3

Y1 - 2013/3

N2 - With the development of next generation sequencing technology, researchers have now been able to study the microbiome composition using direct sequencing, whose output are bacterial taxa counts for each microbiome sample. One goal of microbiome study is to associate the microbiome composition with environmental covariates. We propose to model the taxa counts using a Dirichlet-multinomial (DM) regression model in order to account for overdispersion of observed counts. The DM regression model can be used for testing the association between taxa composition and covariates using the likelihood ratio test. However, when the number of covariates is large, multiple testing can lead to loss of power. To address the high dimensionality of the problem, we develop a penalized likelihood approach to estimate the regression parameters and to select the variables by imposing a sparse group l1 penalty to encourage both group-level and within-group sparsity. Such a variable selection procedure can lead to selection of the relevant covariates and their associated bacterial taxa. An efficient block-coordinate descent algorithm is developed to solve the optimization problem. We present extensive simulations to demonstrate that the sparse DM regression can result in better identification of the microbiome-associated covariates than models that ignore overdispersion or only consider the proportions. We demonstrate the power of our method in an analysis of a data set evaluating the effects of nutrient intake on human gut microbiome composition. Our results have clearly shown that the nutrient intake is strongly associated with the human gut microbiome.

AB - With the development of next generation sequencing technology, researchers have now been able to study the microbiome composition using direct sequencing, whose output are bacterial taxa counts for each microbiome sample. One goal of microbiome study is to associate the microbiome composition with environmental covariates. We propose to model the taxa counts using a Dirichlet-multinomial (DM) regression model in order to account for overdispersion of observed counts. The DM regression model can be used for testing the association between taxa composition and covariates using the likelihood ratio test. However, when the number of covariates is large, multiple testing can lead to loss of power. To address the high dimensionality of the problem, we develop a penalized likelihood approach to estimate the regression parameters and to select the variables by imposing a sparse group l1 penalty to encourage both group-level and within-group sparsity. Such a variable selection procedure can lead to selection of the relevant covariates and their associated bacterial taxa. An efficient block-coordinate descent algorithm is developed to solve the optimization problem. We present extensive simulations to demonstrate that the sparse DM regression can result in better identification of the microbiome-associated covariates than models that ignore overdispersion or only consider the proportions. We demonstrate the power of our method in an analysis of a data set evaluating the effects of nutrient intake on human gut microbiome composition. Our results have clearly shown that the nutrient intake is strongly associated with the human gut microbiome.

KW - Coordinate descent

KW - Counts data

KW - Overdispersion

KW - Regularized likelihood

KW - Sparse group penalty

UR - http://www.scopus.com/inward/record.url?scp=84876058250&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84876058250&partnerID=8YFLogxK

U2 - 10.1214/12-AOAS592

DO - 10.1214/12-AOAS592

M3 - Article

VL - 7

SP - 418

EP - 442

JO - Annals of Applied Statistics

JF - Annals of Applied Statistics

SN - 1932-6157

IS - 1

ER -