Symbolic covariance matrix for interval-valued variables and its application to principal component analysis: A case study

Katarina Koŝmelj, Jennifer Le-Rademacher, Lynne Billard

Research output: Contribution to journalArticle

1 Citation (Scopus)

Abstract

In the last two decades, principal component analysis (PCA) was extended to interval-valued data; several adaptations of the classical approach are known from the literature. Our approach is based on the symbolic covariance matrix Cov for the interval-valued variables proposed by Billard (2008). Its crucial advantage, when compared to other approaches, is that it fully utilizes all the information in the data. The symbolic covariance matrix can be decomposed into a within part CovW and a between part CovB. We propose a further insight into the PCA results: the proportion of variance explained due to the within information and the proportion of variance explained due to the between information can be calculated. Additionally, we suggest PCA on CovB and CovW to be done to obtain deeper insights into the data under study. In the case study presented, the information gain when performing PCA on the intervals instead of the interval midpoints (conditionally the means) is about 45%. It turns out that, for these data, the uniformity assumption over intervals does not hold and so analysis of the data represented by histogram-valued variables is suggested.

Original languageEnglish (US)
Pages (from-to)1-20
Number of pages20
JournalMetodoloski Zvezki
Volume11
Issue number1
StatePublished - Jan 1 2014
Externally publishedYes

Fingerprint

Principal Component Analysis
Covariance matrix
Interval
Proportion
Information Gain
Midpoint
Uniformity
Histogram

ASJC Scopus subject areas

  • Statistics and Probability
  • Social Sciences (miscellaneous)

Cite this

Symbolic covariance matrix for interval-valued variables and its application to principal component analysis : A case study. / Koŝmelj, Katarina; Le-Rademacher, Jennifer; Billard, Lynne.

In: Metodoloski Zvezki, Vol. 11, No. 1, 01.01.2014, p. 1-20.

Research output: Contribution to journalArticle

@article{f86e529c0ce548cb82a39774dd78729a,
title = "Symbolic covariance matrix for interval-valued variables and its application to principal component analysis: A case study",
abstract = "In the last two decades, principal component analysis (PCA) was extended to interval-valued data; several adaptations of the classical approach are known from the literature. Our approach is based on the symbolic covariance matrix Cov for the interval-valued variables proposed by Billard (2008). Its crucial advantage, when compared to other approaches, is that it fully utilizes all the information in the data. The symbolic covariance matrix can be decomposed into a within part CovW and a between part CovB. We propose a further insight into the PCA results: the proportion of variance explained due to the within information and the proportion of variance explained due to the between information can be calculated. Additionally, we suggest PCA on CovB and CovW to be done to obtain deeper insights into the data under study. In the case study presented, the information gain when performing PCA on the intervals instead of the interval midpoints (conditionally the means) is about 45{\%}. It turns out that, for these data, the uniformity assumption over intervals does not hold and so analysis of the data represented by histogram-valued variables is suggested.",
author = "Katarina Koŝmelj and Jennifer Le-Rademacher and Lynne Billard",
year = "2014",
month = "1",
day = "1",
language = "English (US)",
volume = "11",
pages = "1--20",
journal = "Metodoloski Zvezki",
issn = "1854-0023",
publisher = "Faculty of Social Sciences, University of Ljubljana",
number = "1",

}

TY - JOUR

T1 - Symbolic covariance matrix for interval-valued variables and its application to principal component analysis

T2 - A case study

AU - Koŝmelj, Katarina

AU - Le-Rademacher, Jennifer

AU - Billard, Lynne

PY - 2014/1/1

Y1 - 2014/1/1

N2 - In the last two decades, principal component analysis (PCA) was extended to interval-valued data; several adaptations of the classical approach are known from the literature. Our approach is based on the symbolic covariance matrix Cov for the interval-valued variables proposed by Billard (2008). Its crucial advantage, when compared to other approaches, is that it fully utilizes all the information in the data. The symbolic covariance matrix can be decomposed into a within part CovW and a between part CovB. We propose a further insight into the PCA results: the proportion of variance explained due to the within information and the proportion of variance explained due to the between information can be calculated. Additionally, we suggest PCA on CovB and CovW to be done to obtain deeper insights into the data under study. In the case study presented, the information gain when performing PCA on the intervals instead of the interval midpoints (conditionally the means) is about 45%. It turns out that, for these data, the uniformity assumption over intervals does not hold and so analysis of the data represented by histogram-valued variables is suggested.

AB - In the last two decades, principal component analysis (PCA) was extended to interval-valued data; several adaptations of the classical approach are known from the literature. Our approach is based on the symbolic covariance matrix Cov for the interval-valued variables proposed by Billard (2008). Its crucial advantage, when compared to other approaches, is that it fully utilizes all the information in the data. The symbolic covariance matrix can be decomposed into a within part CovW and a between part CovB. We propose a further insight into the PCA results: the proportion of variance explained due to the within information and the proportion of variance explained due to the between information can be calculated. Additionally, we suggest PCA on CovB and CovW to be done to obtain deeper insights into the data under study. In the case study presented, the information gain when performing PCA on the intervals instead of the interval midpoints (conditionally the means) is about 45%. It turns out that, for these data, the uniformity assumption over intervals does not hold and so analysis of the data represented by histogram-valued variables is suggested.

UR - http://www.scopus.com/inward/record.url?scp=84921000990&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84921000990&partnerID=8YFLogxK

M3 - Article

AN - SCOPUS:84921000990

VL - 11

SP - 1

EP - 20

JO - Metodoloski Zvezki

JF - Metodoloski Zvezki

SN - 1854-0023

IS - 1

ER -