Assessment of data transformations for model-based clustering of RNA-Seq data

Janelle R. Noel-MacDonnell; Joseph Usset; Ellen L. Goode; Brooke L. Fridley

doi:10.1371/journal.pone.0191758

Assessment of data transformations for model-based clustering of RNA-Seq data

Janelle R. Noel-MacDonnell, Joseph Usset, Ellen L. Goode, Brooke L. Fridley

Quantitative Health Sciences

Research output: Contribution to journal › Article › peer-review

2 Scopus citations

Abstract

Quality control, global biases, normalization, and analysis methods for RNA-Seq data are quite different than those for microarray-based studies. The assumption of normality is reasonable for microarray based gene expression data; however, RNA-Seq data tend to follow an over-dispersed Poisson or negative binomial distribution. Little research has been done to assess how data transformations impact Gaussian model-based clustering with respect to clustering performance and accuracy in estimating the correct number of clusters in RNASeq data. In this article, we investigate Gaussian model-based clustering performance and accuracy in estimating the correct number of clusters by applying four data transformations (i.e., naïve, logarithmic, Blom, and variance stabilizing transformation) to simulated RNASeq data. To do so, an extensive simulation study was carried out in which the scenarios varied in terms of: how genes were selected to be included in the clustering analyses, size of the clusters, and number of clusters. Following the application of the different transformations to the simulated data, Gaussian model-based clustering was carried out. To assess clustering performance for each of the data transformations, the adjusted rand index, clustering error rate, and concordance index were utilized. As expected, our results showed that clustering performance was gained in scenarios where data transformations were applied to make the data appear "more" Gaussian in distribution.

Original language	English (US)
Article number	e0191758
Journal	PloS one
Volume	13
Issue number	2
DOIs	https://doi.org/10.1371/journal.pone.0191758
State	Published - Feb 2018

ASJC Scopus subject areas

General

Access to Document

10.1371/journal.pone.0191758

Cite this

@article{dd4b12b320234c9c99c0737b894ddb97,

title = "Assessment of data transformations for model-based clustering of RNA-Seq data",

abstract = "Quality control, global biases, normalization, and analysis methods for RNA-Seq data are quite different than those for microarray-based studies. The assumption of normality is reasonable for microarray based gene expression data; however, RNA-Seq data tend to follow an over-dispersed Poisson or negative binomial distribution. Little research has been done to assess how data transformations impact Gaussian model-based clustering with respect to clustering performance and accuracy in estimating the correct number of clusters in RNASeq data. In this article, we investigate Gaussian model-based clustering performance and accuracy in estimating the correct number of clusters by applying four data transformations (i.e., na{\"i}ve, logarithmic, Blom, and variance stabilizing transformation) to simulated RNASeq data. To do so, an extensive simulation study was carried out in which the scenarios varied in terms of: how genes were selected to be included in the clustering analyses, size of the clusters, and number of clusters. Following the application of the different transformations to the simulated data, Gaussian model-based clustering was carried out. To assess clustering performance for each of the data transformations, the adjusted rand index, clustering error rate, and concordance index were utilized. As expected, our results showed that clustering performance was gained in scenarios where data transformations were applied to make the data appear {"}more{"} Gaussian in distribution.",

author = "Noel-MacDonnell, {Janelle R.} and Joseph Usset and Goode, {Ellen L.} and Fridley, {Brooke L.}",

note = "Publisher Copyright: Copyright {\textcopyright} 2018 Noel-MacDonnell et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.",

year = "2018",

month = feb,

doi = "10.1371/journal.pone.0191758",

language = "English (US)",

volume = "13",

journal = "PloS one",

issn = "1932-6203",

publisher = "Public Library of Science",

number = "2",

}

TY - JOUR

T1 - Assessment of data transformations for model-based clustering of RNA-Seq data

AU - Noel-MacDonnell, Janelle R.

AU - Usset, Joseph

AU - Goode, Ellen L.

AU - Fridley, Brooke L.

N1 - Publisher Copyright: Copyright © 2018 Noel-MacDonnell et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

PY - 2018/2

Y1 - 2018/2

N2 - Quality control, global biases, normalization, and analysis methods for RNA-Seq data are quite different than those for microarray-based studies. The assumption of normality is reasonable for microarray based gene expression data; however, RNA-Seq data tend to follow an over-dispersed Poisson or negative binomial distribution. Little research has been done to assess how data transformations impact Gaussian model-based clustering with respect to clustering performance and accuracy in estimating the correct number of clusters in RNASeq data. In this article, we investigate Gaussian model-based clustering performance and accuracy in estimating the correct number of clusters by applying four data transformations (i.e., naïve, logarithmic, Blom, and variance stabilizing transformation) to simulated RNASeq data. To do so, an extensive simulation study was carried out in which the scenarios varied in terms of: how genes were selected to be included in the clustering analyses, size of the clusters, and number of clusters. Following the application of the different transformations to the simulated data, Gaussian model-based clustering was carried out. To assess clustering performance for each of the data transformations, the adjusted rand index, clustering error rate, and concordance index were utilized. As expected, our results showed that clustering performance was gained in scenarios where data transformations were applied to make the data appear "more" Gaussian in distribution.

AB - Quality control, global biases, normalization, and analysis methods for RNA-Seq data are quite different than those for microarray-based studies. The assumption of normality is reasonable for microarray based gene expression data; however, RNA-Seq data tend to follow an over-dispersed Poisson or negative binomial distribution. Little research has been done to assess how data transformations impact Gaussian model-based clustering with respect to clustering performance and accuracy in estimating the correct number of clusters in RNASeq data. In this article, we investigate Gaussian model-based clustering performance and accuracy in estimating the correct number of clusters by applying four data transformations (i.e., naïve, logarithmic, Blom, and variance stabilizing transformation) to simulated RNASeq data. To do so, an extensive simulation study was carried out in which the scenarios varied in terms of: how genes were selected to be included in the clustering analyses, size of the clusters, and number of clusters. Following the application of the different transformations to the simulated data, Gaussian model-based clustering was carried out. To assess clustering performance for each of the data transformations, the adjusted rand index, clustering error rate, and concordance index were utilized. As expected, our results showed that clustering performance was gained in scenarios where data transformations were applied to make the data appear "more" Gaussian in distribution.

UR - http://www.scopus.com/inward/record.url?scp=85042775636&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85042775636&partnerID=8YFLogxK

U2 - 10.1371/journal.pone.0191758

DO - 10.1371/journal.pone.0191758

M3 - Article

C2 - 29485993

AN - SCOPUS:85042775636

SN - 1932-6203

VL - 13

JO - PloS one

JF - PloS one

IS - 2

M1 - e0191758

ER -

Assessment of data transformations for model-based clustering of RNA-Seq data

Abstract

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this