Collaborative Research: New Statistical Methods for Microbiome Data Analysis

Project: Research project

Project Details


The human microbiome, the collection of micro-organisms associated with the human body, has been increasingly recognized as an important player in human health and disease. Human microbiome research focuses on deciphering the intricate relationship between the microbiome and the host and identifying microbial biomarkers for disease prevention, diagnosis, and treatment. Current technologies to study the human microbiome involve sequencing the microbial DNA in the sample, upon which the identity and the abundance of the micro-organisms can be determined. Analysis of such microbiome sequencing data raises many statistical challenges. First, the data are zero-inflated. A typical microbiome dataset contains more than 75% zeros. Second, the data are compositional. The abundance change in one microbe will automatically lead to changes in the relative abundance of others, making identification of the 'driver' microbe difficult. Third, the microbes are phylogenetically related. Closely related microbes usually share similar biological traits. Finally, the human microbiome is subject to many environmental confounders. Controlling these confounders is essential to make valid statistical inferences. The project will develop novel statistical methods for analyzing microbiome data addressing these challenges. The research results will be disseminated through scientific publications as well as seminar and conference presentations. The PIs will develop, distribute, document, and maintain R software packages via GitHub and CRAN for developed methods, and provide tutorials with example datasets. The PIs will test the software in real-world settings thoroughly. Given the popularity of the multi-omics approach to study the human microbiome, the delivered software packages will be of particular interest to microbiome investigators. The PIs will train students at the intersection of high-dimensional statistics, optimization, and genomics.

The project has two research thrusts. In the first thrust, the PIs will develop a new statistical learning framework for microbiome data to simultaneously tackle the high-dimensionality, compositional effect, zero-inflation, and phylogenetic information. In particular, the new framework includes a novel zero imputation method based on a new Dirichlet mixture model, a general approach for handling compositional effect in supervised/unsupervised statistical learning, and a robust structure adaptive method to incorporate external information encoded in the phylogenetic tree. In the second thrust, the PIs will develop a two-dimensional false discovery rate (FDR) control procedure for powerful confounder adjustment in microbiome association analysis. The procedure uses the test statistics from the unadjusted analysis as auxiliary statistics to filter out a large number of irrelevant features, and false discovery rate control is then performed based on the test statistics from the adjusted analysis on the reduced set. The PIs will investigate both model-based and model-free approaches, and prove the asymptotic FDR control.

This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

Effective start/end date10/1/188/31/24


  • National Science Foundation: $91,959.00


Explore the research topics touched on by this project. These labels are generated based on the underlying awards/grants. Together they form a unique fingerprint.