Clustering and variable selection in the presence of mixed variable types and missing data

C. B. Storlie; S. M. Myers; S. K. Katusic; A. L. Weaver; R. G. Voigt; P. E. Croarkin; R. E. Stoeckel; J. D. Port

doi:10.1002/sim.7697

Clustering and variable selection in the presence of mixed variable types and missing data

C. B. Storlie, S. M. Myers, S. K. Katusic, A. L. Weaver, R. G. Voigt, P. E. Croarkin, R. E. Stoeckel, J. D. Port

Research output: Contribution to journal › Article › peer-review

5 Scopus citations

Abstract

We consider the problem of model-based clustering in the presence of many correlated, mixed continuous, and discrete variables, some of which may have missing values. Discrete variables are treated with a latent continuous variable approach, and the Dirichlet process is used to construct a mixture model with an unknown number of components. Variable selection is also performed to identify the variables that are most influential for determining cluster membership. The work is motivated by the need to cluster patients thought to potentially have autism spectrum disorder on the basis of many cognitive and/or behavioral test scores. There are a modest number of patients (486) in the data set along with many (55) test score variables (many of which are discrete valued and/or missing). The goal of the work is to (1) cluster these patients into similar groups to help identify those with similar clinical presentation and (2) identify a sparse subset of tests that inform the clusters in order to eliminate unnecessary testing. The proposed approach compares very favorably with other methods via simulation of problems of this type. The results of the autism spectrum disorder analysis suggested 3 clusters to be most likely, while only 4 test scores had high (>0.5) posterior probability of being informative. This will result in much more efficient and informative testing. The need to cluster observations on the basis of many correlated, continuous/discrete variables with missing values is a common problem in the health sciences as well as in many other disciplines.

Original language	English (US)
Pages (from-to)	2884-2899
Number of pages	16
Journal	Statistics in Medicine
Volume	37
Issue number	19
DOIs	https://doi.org/10.1002/sim.7697
State	Published - Aug 30 2018

Keywords

Dirichlet process
hierarchical Bayesian modeling
missing data
mixed variable types
model-based clustering
variable selection

ASJC Scopus subject areas

Epidemiology
Statistics and Probability

Access to Document

10.1002/sim.7697

Cite this

@article{d609cbdf206d4189a373796305136ac7,

title = "Clustering and variable selection in the presence of mixed variable types and missing data",

abstract = "We consider the problem of model-based clustering in the presence of many correlated, mixed continuous, and discrete variables, some of which may have missing values. Discrete variables are treated with a latent continuous variable approach, and the Dirichlet process is used to construct a mixture model with an unknown number of components. Variable selection is also performed to identify the variables that are most influential for determining cluster membership. The work is motivated by the need to cluster patients thought to potentially have autism spectrum disorder on the basis of many cognitive and/or behavioral test scores. There are a modest number of patients (486) in the data set along with many (55) test score variables (many of which are discrete valued and/or missing). The goal of the work is to (1) cluster these patients into similar groups to help identify those with similar clinical presentation and (2) identify a sparse subset of tests that inform the clusters in order to eliminate unnecessary testing. The proposed approach compares very favorably with other methods via simulation of problems of this type. The results of the autism spectrum disorder analysis suggested 3 clusters to be most likely, while only 4 test scores had high (>0.5) posterior probability of being informative. This will result in much more efficient and informative testing. The need to cluster observations on the basis of many correlated, continuous/discrete variables with missing values is a common problem in the health sciences as well as in many other disciplines.",

keywords = "Dirichlet process, hierarchical Bayesian modeling, missing data, mixed variable types, model-based clustering, variable selection",

author = "Storlie, {C. B.} and Myers, {S. M.} and Katusic, {S. K.} and Weaver, {A. L.} and Voigt, {R. G.} and Croarkin, {P. E.} and Stoeckel, {R. E.} and Port, {J. D.}",

note = "Publisher Copyright: Copyright {\textcopyright} 2018 John Wiley & Sons, Ltd.",

year = "2018",

month = aug,

day = "30",

doi = "10.1002/sim.7697",

language = "English (US)",

volume = "37",

pages = "2884--2899",

journal = "Statistics in Medicine",

issn = "0277-6715",

publisher = "John Wiley and Sons Ltd",

number = "19",

}

TY - JOUR

T1 - Clustering and variable selection in the presence of mixed variable types and missing data

AU - Storlie, C. B.

AU - Myers, S. M.

AU - Katusic, S. K.

AU - Weaver, A. L.

AU - Voigt, R. G.

AU - Croarkin, P. E.

AU - Stoeckel, R. E.

AU - Port, J. D.

PY - 2018/8/30

Y1 - 2018/8/30

N2 - We consider the problem of model-based clustering in the presence of many correlated, mixed continuous, and discrete variables, some of which may have missing values. Discrete variables are treated with a latent continuous variable approach, and the Dirichlet process is used to construct a mixture model with an unknown number of components. Variable selection is also performed to identify the variables that are most influential for determining cluster membership. The work is motivated by the need to cluster patients thought to potentially have autism spectrum disorder on the basis of many cognitive and/or behavioral test scores. There are a modest number of patients (486) in the data set along with many (55) test score variables (many of which are discrete valued and/or missing). The goal of the work is to (1) cluster these patients into similar groups to help identify those with similar clinical presentation and (2) identify a sparse subset of tests that inform the clusters in order to eliminate unnecessary testing. The proposed approach compares very favorably with other methods via simulation of problems of this type. The results of the autism spectrum disorder analysis suggested 3 clusters to be most likely, while only 4 test scores had high (>0.5) posterior probability of being informative. This will result in much more efficient and informative testing. The need to cluster observations on the basis of many correlated, continuous/discrete variables with missing values is a common problem in the health sciences as well as in many other disciplines.

AB - We consider the problem of model-based clustering in the presence of many correlated, mixed continuous, and discrete variables, some of which may have missing values. Discrete variables are treated with a latent continuous variable approach, and the Dirichlet process is used to construct a mixture model with an unknown number of components. Variable selection is also performed to identify the variables that are most influential for determining cluster membership. The work is motivated by the need to cluster patients thought to potentially have autism spectrum disorder on the basis of many cognitive and/or behavioral test scores. There are a modest number of patients (486) in the data set along with many (55) test score variables (many of which are discrete valued and/or missing). The goal of the work is to (1) cluster these patients into similar groups to help identify those with similar clinical presentation and (2) identify a sparse subset of tests that inform the clusters in order to eliminate unnecessary testing. The proposed approach compares very favorably with other methods via simulation of problems of this type. The results of the autism spectrum disorder analysis suggested 3 clusters to be most likely, while only 4 test scores had high (>0.5) posterior probability of being informative. This will result in much more efficient and informative testing. The need to cluster observations on the basis of many correlated, continuous/discrete variables with missing values is a common problem in the health sciences as well as in many other disciplines.

KW - Dirichlet process

KW - hierarchical Bayesian modeling

KW - missing data

KW - mixed variable types

KW - model-based clustering

KW - variable selection

UR - http://www.scopus.com/inward/record.url?scp=85050138161&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85050138161&partnerID=8YFLogxK

U2 - 10.1002/sim.7697

DO - 10.1002/sim.7697

M3 - Article

AN - SCOPUS:85050138161

SN - 0277-6715

VL - 37

SP - 2884

EP - 2899

JO - Statistics in Medicine

JF - Statistics in Medicine

IS - 19

ER -

Clustering and variable selection in the presence of mixed variable types and missing data

Abstract

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this