Leveraging global gene expression patterns to predict expression of unmeasured genes

James Rudd; René A. Zelaya; Eugene Demidenko; Ellen L. Goode; Casey S. Greene; Jennifer A. Doherty

doi:10.1186/s12864-015-2250-5

Leveraging global gene expression patterns to predict expression of unmeasured genes

James Rudd, René A. Zelaya, Eugene Demidenko, Ellen L. Goode, Casey S. Greene, Jennifer A. Doherty

Quantitative Health Sciences

Research output: Contribution to journal › Article › peer-review

2 Scopus citations

Abstract

Background: Large collections of paraffin-embedded tissue represent a rich resource to test hypotheses based on gene expression patterns; however, measurement of genome-wide expression is cost-prohibitive on a large scale. Using the known expression correlation structure within a given disease type (in this case, high grade serous ovarian cancer; HGSC), we sought to identify reduced sets of directly measured (DM) genes which could accurately predict the expression of a maximized number of unmeasured genes. Results: We developed a greedy gene set selection (GGS) algorithm which returns a DM set of user specified size based on a specific correlation threshold (|r_P|) and minimum number of DM genes that must be correlated to an unmeasured gene in order to infer the value of the unmeasured gene (redundancy). We evaluated GGS in the Cancer Genome Atlas (TCGA) HGSC data across 144 combinations of DM size, redundancy (1-3), and |r_P| (0.60, 0.65, 0.70). Across the parameter sweep, GGS allows on average 9 times more gene expression information to be captured compared to the DM set alone. GGS successfully augments prognostic HGSC gene sets; the addition of 20 GGS selected genes more than doubles the number of genes whose expression is predictable. Moreover, the expression prediction is highly accurate. After training regression models for the predictable gene set using 2/3 of the TCGA data, the average accuracy (ranked correlation of true and predicted values) in the 1/3 testing partition and four independent populations is above 0.65 and approaches 0.8 for conservative parameter sets. We observe similar accuracies in the TCGA HGSC RNA-sequencing data. Specifically, the prediction accuracy increases with increasing redundancy and increasing |r_P|. Conclusions: GGS-selected genes, which maximize expression information about unmeasured genes, can be combined with candidate gene sets as a cost effective way to increase the amount of gene expression information obtained in large studies. This method can be applied to any organism, model system, disease, or tissue type for which whole genome gene expression data exists.

Original language	English (US)
Article number	1065
Journal	BMC genomics
Volume	16
Issue number	1
DOIs	https://doi.org/10.1186/s12864-015-2250-5
State	Published - Dec 15 2015

Keywords

GGS
Gene expression
Greedy gene set selection
Imputation

ASJC Scopus subject areas

Biotechnology
Genetics

Access to Document

10.1186/s12864-015-2250-5

Cite this

@article{d628033bb59e4eff9cb464f2bd1be06a,

title = "Leveraging global gene expression patterns to predict expression of unmeasured genes",

abstract = "Background: Large collections of paraffin-embedded tissue represent a rich resource to test hypotheses based on gene expression patterns; however, measurement of genome-wide expression is cost-prohibitive on a large scale. Using the known expression correlation structure within a given disease type (in this case, high grade serous ovarian cancer; HGSC), we sought to identify reduced sets of directly measured (DM) genes which could accurately predict the expression of a maximized number of unmeasured genes. Results: We developed a greedy gene set selection (GGS) algorithm which returns a DM set of user specified size based on a specific correlation threshold (|rP|) and minimum number of DM genes that must be correlated to an unmeasured gene in order to infer the value of the unmeasured gene (redundancy). We evaluated GGS in the Cancer Genome Atlas (TCGA) HGSC data across 144 combinations of DM size, redundancy (1-3), and |rP| (0.60, 0.65, 0.70). Across the parameter sweep, GGS allows on average 9 times more gene expression information to be captured compared to the DM set alone. GGS successfully augments prognostic HGSC gene sets; the addition of 20 GGS selected genes more than doubles the number of genes whose expression is predictable. Moreover, the expression prediction is highly accurate. After training regression models for the predictable gene set using 2/3 of the TCGA data, the average accuracy (ranked correlation of true and predicted values) in the 1/3 testing partition and four independent populations is above 0.65 and approaches 0.8 for conservative parameter sets. We observe similar accuracies in the TCGA HGSC RNA-sequencing data. Specifically, the prediction accuracy increases with increasing redundancy and increasing |rP|. Conclusions: GGS-selected genes, which maximize expression information about unmeasured genes, can be combined with candidate gene sets as a cost effective way to increase the amount of gene expression information obtained in large studies. This method can be applied to any organism, model system, disease, or tissue type for which whole genome gene expression data exists.",

keywords = "GGS, Gene expression, Greedy gene set selection, Imputation",

author = "James Rudd and Zelaya, {Ren{\'e} A.} and Eugene Demidenko and Goode, {Ellen L.} and Greene, {Casey S.} and Doherty, {Jennifer A.}",

note = "Publisher Copyright: {\textcopyright} 2015 Rudd et al.",

year = "2015",

month = dec,

day = "15",

doi = "10.1186/s12864-015-2250-5",

language = "English (US)",

volume = "16",

journal = "BMC genomics",

issn = "1471-2164",

publisher = "BioMed Central",

number = "1",

}

TY - JOUR

T1 - Leveraging global gene expression patterns to predict expression of unmeasured genes

AU - Rudd, James

AU - Zelaya, René A.

AU - Demidenko, Eugene

AU - Goode, Ellen L.

AU - Greene, Casey S.

AU - Doherty, Jennifer A.

PY - 2015/12/15

Y1 - 2015/12/15

N2 - Background: Large collections of paraffin-embedded tissue represent a rich resource to test hypotheses based on gene expression patterns; however, measurement of genome-wide expression is cost-prohibitive on a large scale. Using the known expression correlation structure within a given disease type (in this case, high grade serous ovarian cancer; HGSC), we sought to identify reduced sets of directly measured (DM) genes which could accurately predict the expression of a maximized number of unmeasured genes. Results: We developed a greedy gene set selection (GGS) algorithm which returns a DM set of user specified size based on a specific correlation threshold (|rP|) and minimum number of DM genes that must be correlated to an unmeasured gene in order to infer the value of the unmeasured gene (redundancy). We evaluated GGS in the Cancer Genome Atlas (TCGA) HGSC data across 144 combinations of DM size, redundancy (1-3), and |rP| (0.60, 0.65, 0.70). Across the parameter sweep, GGS allows on average 9 times more gene expression information to be captured compared to the DM set alone. GGS successfully augments prognostic HGSC gene sets; the addition of 20 GGS selected genes more than doubles the number of genes whose expression is predictable. Moreover, the expression prediction is highly accurate. After training regression models for the predictable gene set using 2/3 of the TCGA data, the average accuracy (ranked correlation of true and predicted values) in the 1/3 testing partition and four independent populations is above 0.65 and approaches 0.8 for conservative parameter sets. We observe similar accuracies in the TCGA HGSC RNA-sequencing data. Specifically, the prediction accuracy increases with increasing redundancy and increasing |rP|. Conclusions: GGS-selected genes, which maximize expression information about unmeasured genes, can be combined with candidate gene sets as a cost effective way to increase the amount of gene expression information obtained in large studies. This method can be applied to any organism, model system, disease, or tissue type for which whole genome gene expression data exists.

AB - Background: Large collections of paraffin-embedded tissue represent a rich resource to test hypotheses based on gene expression patterns; however, measurement of genome-wide expression is cost-prohibitive on a large scale. Using the known expression correlation structure within a given disease type (in this case, high grade serous ovarian cancer; HGSC), we sought to identify reduced sets of directly measured (DM) genes which could accurately predict the expression of a maximized number of unmeasured genes. Results: We developed a greedy gene set selection (GGS) algorithm which returns a DM set of user specified size based on a specific correlation threshold (|rP|) and minimum number of DM genes that must be correlated to an unmeasured gene in order to infer the value of the unmeasured gene (redundancy). We evaluated GGS in the Cancer Genome Atlas (TCGA) HGSC data across 144 combinations of DM size, redundancy (1-3), and |rP| (0.60, 0.65, 0.70). Across the parameter sweep, GGS allows on average 9 times more gene expression information to be captured compared to the DM set alone. GGS successfully augments prognostic HGSC gene sets; the addition of 20 GGS selected genes more than doubles the number of genes whose expression is predictable. Moreover, the expression prediction is highly accurate. After training regression models for the predictable gene set using 2/3 of the TCGA data, the average accuracy (ranked correlation of true and predicted values) in the 1/3 testing partition and four independent populations is above 0.65 and approaches 0.8 for conservative parameter sets. We observe similar accuracies in the TCGA HGSC RNA-sequencing data. Specifically, the prediction accuracy increases with increasing redundancy and increasing |rP|. Conclusions: GGS-selected genes, which maximize expression information about unmeasured genes, can be combined with candidate gene sets as a cost effective way to increase the amount of gene expression information obtained in large studies. This method can be applied to any organism, model system, disease, or tissue type for which whole genome gene expression data exists.

KW - GGS

KW - Gene expression

KW - Greedy gene set selection

KW - Imputation

UR - http://www.scopus.com/inward/record.url?scp=84949642833&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84949642833&partnerID=8YFLogxK

U2 - 10.1186/s12864-015-2250-5

DO - 10.1186/s12864-015-2250-5

M3 - Article

C2 - 26666289

AN - SCOPUS:84949642833

SN - 1471-2164

VL - 16

JO - BMC genomics

JF - BMC genomics

IS - 1

M1 - 1065

ER -

Leveraging global gene expression patterns to predict expression of unmeasured genes

Abstract

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this