Predicting patient survival from microarray data by accelerated failure time modeling using partial least squares and LASSO

Susmita Datta, Jennifer Le-Rademacher, Somnath Datta

Research output: Contribution to journalReview article

46 Citations (Scopus)

Abstract

We consider the problem of predicting survival times of cancer patients from the gene expression profiles of their tumor samples via linear regression modeling of log-transformed failure times. The partial least squares (PLS) and least absolute shrinkage and selection operator (LASSO) methodologies are used for this purpose where we first modify the data to account for censoring. Three approaches of handling right censored data - reweighting, mean imputation, and multiple imputation - are considered. Their performances are examined in a detailed simulation study and compared with that of full data PLS and LASSO had there been no censoring. A major objective of this article is to investigate the performances of PLS and LASSO in the context of microarray data where the number of covariates is very large and there are extremely few samples. We demonstrate that LASSO outperforms PLS in terms of prediction error when the list of covariates includes a moderate to large percentage of useless or noise variables; otherwise, PLS may outperform LASSO. For a moderate sample size (100 with 10,000 covariates), LASSO performed better than a no covariate model (or noise-based prediction). The mean imputation method appears to best track the performance of the full data PLS or LASSO. The mean imputation scheme is used on an existing data set on lung cancer. This reanalysis using the mean imputed PLS and LASSO identifies a number of genes that were known to be related to cancer or tumor activities from previous studies.

Original languageEnglish (US)
JournalBiometrics
Volume63
Issue number1
DOIs
StatePublished - Mar 1 2007
Externally publishedYes

Fingerprint

operator regions
Partial Least Squares
Failure Time
Microarrays
Shrinkage
Microarray Data
Least-Squares Analysis
shrinkage
least squares
Survival
Operator
Modeling
Covariates
Imputation
neoplasms
Censoring
Noise
Tumors
Tumor
Cancer

Keywords

  • Cancer
  • Gene expression
  • Partial least squares
  • Right censoring
  • Survival

ASJC Scopus subject areas

  • Statistics and Probability
  • Biochemistry, Genetics and Molecular Biology(all)
  • Immunology and Microbiology(all)
  • Agricultural and Biological Sciences(all)
  • Applied Mathematics

Cite this

Predicting patient survival from microarray data by accelerated failure time modeling using partial least squares and LASSO. / Datta, Susmita; Le-Rademacher, Jennifer; Datta, Somnath.

In: Biometrics, Vol. 63, No. 1, 01.03.2007.

Research output: Contribution to journalReview article

@article{a0236862575c45aebf3b836c9a6022ec,
title = "Predicting patient survival from microarray data by accelerated failure time modeling using partial least squares and LASSO",
abstract = "We consider the problem of predicting survival times of cancer patients from the gene expression profiles of their tumor samples via linear regression modeling of log-transformed failure times. The partial least squares (PLS) and least absolute shrinkage and selection operator (LASSO) methodologies are used for this purpose where we first modify the data to account for censoring. Three approaches of handling right censored data - reweighting, mean imputation, and multiple imputation - are considered. Their performances are examined in a detailed simulation study and compared with that of full data PLS and LASSO had there been no censoring. A major objective of this article is to investigate the performances of PLS and LASSO in the context of microarray data where the number of covariates is very large and there are extremely few samples. We demonstrate that LASSO outperforms PLS in terms of prediction error when the list of covariates includes a moderate to large percentage of useless or noise variables; otherwise, PLS may outperform LASSO. For a moderate sample size (100 with 10,000 covariates), LASSO performed better than a no covariate model (or noise-based prediction). The mean imputation method appears to best track the performance of the full data PLS or LASSO. The mean imputation scheme is used on an existing data set on lung cancer. This reanalysis using the mean imputed PLS and LASSO identifies a number of genes that were known to be related to cancer or tumor activities from previous studies.",
keywords = "Cancer, Gene expression, Partial least squares, Right censoring, Survival",
author = "Susmita Datta and Jennifer Le-Rademacher and Somnath Datta",
year = "2007",
month = "3",
day = "1",
doi = "10.1111/j.1541-0420.2006.00660.x",
language = "English (US)",
volume = "63",
journal = "Biometrics",
issn = "0006-341X",
publisher = "Wiley-Blackwell",
number = "1",

}

TY - JOUR

T1 - Predicting patient survival from microarray data by accelerated failure time modeling using partial least squares and LASSO

AU - Datta, Susmita

AU - Le-Rademacher, Jennifer

AU - Datta, Somnath

PY - 2007/3/1

Y1 - 2007/3/1

N2 - We consider the problem of predicting survival times of cancer patients from the gene expression profiles of their tumor samples via linear regression modeling of log-transformed failure times. The partial least squares (PLS) and least absolute shrinkage and selection operator (LASSO) methodologies are used for this purpose where we first modify the data to account for censoring. Three approaches of handling right censored data - reweighting, mean imputation, and multiple imputation - are considered. Their performances are examined in a detailed simulation study and compared with that of full data PLS and LASSO had there been no censoring. A major objective of this article is to investigate the performances of PLS and LASSO in the context of microarray data where the number of covariates is very large and there are extremely few samples. We demonstrate that LASSO outperforms PLS in terms of prediction error when the list of covariates includes a moderate to large percentage of useless or noise variables; otherwise, PLS may outperform LASSO. For a moderate sample size (100 with 10,000 covariates), LASSO performed better than a no covariate model (or noise-based prediction). The mean imputation method appears to best track the performance of the full data PLS or LASSO. The mean imputation scheme is used on an existing data set on lung cancer. This reanalysis using the mean imputed PLS and LASSO identifies a number of genes that were known to be related to cancer or tumor activities from previous studies.

AB - We consider the problem of predicting survival times of cancer patients from the gene expression profiles of their tumor samples via linear regression modeling of log-transformed failure times. The partial least squares (PLS) and least absolute shrinkage and selection operator (LASSO) methodologies are used for this purpose where we first modify the data to account for censoring. Three approaches of handling right censored data - reweighting, mean imputation, and multiple imputation - are considered. Their performances are examined in a detailed simulation study and compared with that of full data PLS and LASSO had there been no censoring. A major objective of this article is to investigate the performances of PLS and LASSO in the context of microarray data where the number of covariates is very large and there are extremely few samples. We demonstrate that LASSO outperforms PLS in terms of prediction error when the list of covariates includes a moderate to large percentage of useless or noise variables; otherwise, PLS may outperform LASSO. For a moderate sample size (100 with 10,000 covariates), LASSO performed better than a no covariate model (or noise-based prediction). The mean imputation method appears to best track the performance of the full data PLS or LASSO. The mean imputation scheme is used on an existing data set on lung cancer. This reanalysis using the mean imputed PLS and LASSO identifies a number of genes that were known to be related to cancer or tumor activities from previous studies.

KW - Cancer

KW - Gene expression

KW - Partial least squares

KW - Right censoring

KW - Survival

UR - http://www.scopus.com/inward/record.url?scp=34247259498&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=34247259498&partnerID=8YFLogxK

U2 - 10.1111/j.1541-0420.2006.00660.x

DO - 10.1111/j.1541-0420.2006.00660.x

M3 - Review article

C2 - 17447952

AN - SCOPUS:34247259498

VL - 63

JO - Biometrics

JF - Biometrics

SN - 0006-341X

IS - 1

ER -