Glmgraph: An R package for variable selection and predictive modeling of structured genomic data

Li Chen, Han Liu, Jean-Pierre Kocher, Hongzhe Li, Jun Chen

Research output: Contribution to journalArticle

4 Citations (Scopus)

Abstract

One central theme of modern high-throughput genomic data analysis is to identify relevant genomic features as well as build up a predictive model based on selected features for various tasks such as personalized medicine. Correlating the large number of 'omics' features with a certain phenotype is particularly challenging due to small sample size 〈n〉 and high dimensionality 〈p〉. To address this small n, large p problem, various forms of sparse regression models have been proposed by exploiting the sparsity assumption. Among these, network-constrained sparse regression model is of particular interest due to its ability to utilize the prior graph/network structure in the omics data. Despite its potential usefulness for omics data analysis, no efficient R implementation is publicly available. Here we present an R software package 'glmgraph' that implements the graph-constrained regularization for both sparse linear regression and sparse logistic regression. We implement both the L1 penalty and minimax concave penalty for variable selection and Laplacian penalty for coefficient smoothing. Efficient coordinate descent algorithm is used to solve the optimization problem. We demonstrate the use of the package by applying it to a human microbiome dataset, where phylogeny structure among bacterial taxa is available.

Original languageEnglish (US)
Pages (from-to)3991-3993
Number of pages3
JournalBioinformatics
Volume31
Issue number24
DOIs
StatePublished - Jul 3 2015

Fingerprint

Predictive Modeling
Variable Selection
Genomics
Bacterial Structures
Penalty
Precision Medicine
Microbiota
Phylogeny
Sample Size
Linear Models
Regression Model
Data analysis
Software
Logistic Models
Coordinate Descent
Phenotype
Linear regression
Software packages
Medicine
Descent Algorithm

ASJC Scopus subject areas

  • Biochemistry
  • Molecular Biology
  • Computational Theory and Mathematics
  • Computer Science Applications
  • Computational Mathematics
  • Statistics and Probability

Cite this

Glmgraph : An R package for variable selection and predictive modeling of structured genomic data. / Chen, Li; Liu, Han; Kocher, Jean-Pierre; Li, Hongzhe; Chen, Jun.

In: Bioinformatics, Vol. 31, No. 24, 03.07.2015, p. 3991-3993.

Research output: Contribution to journalArticle

@article{a53f89b220864873aca56cbc0bc0d507,
title = "Glmgraph: An R package for variable selection and predictive modeling of structured genomic data",
abstract = "One central theme of modern high-throughput genomic data analysis is to identify relevant genomic features as well as build up a predictive model based on selected features for various tasks such as personalized medicine. Correlating the large number of 'omics' features with a certain phenotype is particularly challenging due to small sample size 〈n〉 and high dimensionality 〈p〉. To address this small n, large p problem, various forms of sparse regression models have been proposed by exploiting the sparsity assumption. Among these, network-constrained sparse regression model is of particular interest due to its ability to utilize the prior graph/network structure in the omics data. Despite its potential usefulness for omics data analysis, no efficient R implementation is publicly available. Here we present an R software package 'glmgraph' that implements the graph-constrained regularization for both sparse linear regression and sparse logistic regression. We implement both the L1 penalty and minimax concave penalty for variable selection and Laplacian penalty for coefficient smoothing. Efficient coordinate descent algorithm is used to solve the optimization problem. We demonstrate the use of the package by applying it to a human microbiome dataset, where phylogeny structure among bacterial taxa is available.",
author = "Li Chen and Han Liu and Jean-Pierre Kocher and Hongzhe Li and Jun Chen",
year = "2015",
month = "7",
day = "3",
doi = "10.1093/bioinformatics/btv497",
language = "English (US)",
volume = "31",
pages = "3991--3993",
journal = "Bioinformatics",
issn = "1367-4803",
publisher = "Oxford University Press",
number = "24",

}

TY - JOUR

T1 - Glmgraph

T2 - An R package for variable selection and predictive modeling of structured genomic data

AU - Chen, Li

AU - Liu, Han

AU - Kocher, Jean-Pierre

AU - Li, Hongzhe

AU - Chen, Jun

PY - 2015/7/3

Y1 - 2015/7/3

N2 - One central theme of modern high-throughput genomic data analysis is to identify relevant genomic features as well as build up a predictive model based on selected features for various tasks such as personalized medicine. Correlating the large number of 'omics' features with a certain phenotype is particularly challenging due to small sample size 〈n〉 and high dimensionality 〈p〉. To address this small n, large p problem, various forms of sparse regression models have been proposed by exploiting the sparsity assumption. Among these, network-constrained sparse regression model is of particular interest due to its ability to utilize the prior graph/network structure in the omics data. Despite its potential usefulness for omics data analysis, no efficient R implementation is publicly available. Here we present an R software package 'glmgraph' that implements the graph-constrained regularization for both sparse linear regression and sparse logistic regression. We implement both the L1 penalty and minimax concave penalty for variable selection and Laplacian penalty for coefficient smoothing. Efficient coordinate descent algorithm is used to solve the optimization problem. We demonstrate the use of the package by applying it to a human microbiome dataset, where phylogeny structure among bacterial taxa is available.

AB - One central theme of modern high-throughput genomic data analysis is to identify relevant genomic features as well as build up a predictive model based on selected features for various tasks such as personalized medicine. Correlating the large number of 'omics' features with a certain phenotype is particularly challenging due to small sample size 〈n〉 and high dimensionality 〈p〉. To address this small n, large p problem, various forms of sparse regression models have been proposed by exploiting the sparsity assumption. Among these, network-constrained sparse regression model is of particular interest due to its ability to utilize the prior graph/network structure in the omics data. Despite its potential usefulness for omics data analysis, no efficient R implementation is publicly available. Here we present an R software package 'glmgraph' that implements the graph-constrained regularization for both sparse linear regression and sparse logistic regression. We implement both the L1 penalty and minimax concave penalty for variable selection and Laplacian penalty for coefficient smoothing. Efficient coordinate descent algorithm is used to solve the optimization problem. We demonstrate the use of the package by applying it to a human microbiome dataset, where phylogeny structure among bacterial taxa is available.

UR - http://www.scopus.com/inward/record.url?scp=84950245285&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84950245285&partnerID=8YFLogxK

U2 - 10.1093/bioinformatics/btv497

DO - 10.1093/bioinformatics/btv497

M3 - Article

C2 - 26315909

AN - SCOPUS:84950245285

VL - 31

SP - 3991

EP - 3993

JO - Bioinformatics

JF - Bioinformatics

SN - 1367-4803

IS - 24

ER -