applying latent profile analysis to classify chicago neighborhoods · 2019. 5. 15. · latent...

Post on 24-Aug-2020

2 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Applying Latent Profile Analysisto Classify Chicago Neighborhoods

Oksana Pugach, PhD

Institute for Health Research and Policy

University of Illinois at Chicago

December, 2017

Cluster Analysis

• Identifying group of individuals or objects that are similar to each other but different from

individuals in other groups

• Cluster analysis and discriminant analysis both classify objects into categories

• In a nutshell:

– select cases

– select variables (standardize?)

– select clustering procedure

• hierarchical clustering

• k-means clustering

• two-step clustering

• Cluster analysis does not identify a particular statistical method

Cluster Analysis

• Different cluster methods will result in different and conflicting solutions. Final cluster

solution and selection of cluster number is informal and subjective

• Alternative approach to clustering which postulates a formal statistical model for the

population: model assumes that population consists of subpopulations (‘clusters’) in each of

which variables have different multivariate probability density function, resulting is a finite

mixture density for the population as a whole.

• Problem: estimate parameters of the density functions and mixing probabilities

• Calculate: posterior probability of cluster membership

• How to determine number of clusters: model selection by objective procedures

Latent Profile Analysis• Latent profile models are commonly attributed to Lazarsfeld and Henry (1968).

• Cluster analysis based on finite mixture models (FMM) are aka model-based clustering methods (Banfield, J. D & Raftery, A. E, 1993)

• FMM can be seen as a form of latent variable analysis (Skrondal & Rabe-Hesketh, 2004) with subpopulation being a latent categorical variable – aka latent class cluster analysis

Source: Oberski, D. (2016). Mixture Models: Latent Profile and Latent Class Analysis. In Modern Statistical Methods for HCI (pp. 275–287). Springer, Cham. https://doi.org/10.1007/978-3-319-26633-6_12

Observed Models for means Regression models

Latent Latent

Continuous Discrete Continuous Discrete

Continuous Factor analysisLatent profile analysis

Random effectsRegression mixture

DiscreteItem response theory

Latent class analysis

Logistic ran. eff. Logistic reg. mix.

Names of different kinds of latent variable models

Finite Mixture Densities

• Model

• x – p-dimensional random vector

• Pj – mixing probabilities

• gj() – component densities

• c – number of clusters

Assumption for finite mixture as model for cluster analysis: each group of observations in a

dataset comes from population with a different probability distribution

1

( ) ;c

j j jj

f p g

x;p,θ x θ

11

c

jj

p

Cluster allocation

Having estimated the parameters of the assumed mixture density, observations can be

associated with particular clusters based on the basis of the maximum value of the posterior

probability

ˆˆ ,Pr |

ˆˆ; ,

j j i

i

i

p gcluster j

f

x θx

x p θ

Maximum Likelihood Estimation

Estimation by:

Expectation Maximization algorithm (usually used)

Bayesian estimation methods using Gibbs sampler or other MCMC methods

1

, ln ; ,n

i

i

l f x

p θ p θ

Maximum Likelihood Estimation for mixtures of multivariate normal

• As number of clusters increases, number of model parameters increases rapidly. Restrictions

on can be imposed to obtain more parsimony and stability.

• Banfield, J. D & Raftery, A. E, 1993 proposed reparameterizing of class-specific covariance

matrix by principal component

Geometrical interpretation of the decomposition

Volume, Orientation, and Shape of j-cluster

Restrictions applied can be directly interpreted in terms of geometrical form of a cluster

j j j j jD A D

j

Parameterisations of the within-group covariance matrix for multidimensional data available in the mclust

package, and the corresponding geometric characteristics (Scrucca, Fop, Murphy, & Raftery, 2016)

Example of mixture of two normals

Other finite mixture models

• Mixture of multivariate t-distributions – robust to outliers and skewed distributions

• Mixtures for categorical data – latent class analysis.

• Multivariate Bernoulli densities with assumption that, given class, the categorical

variables are independent of each other.

Model selection and Inference

• Log-likelihood ratio test

• Unfortunately this does not lead to a suitable statistical test, since the regularity conditions do not hold for - it is on

the edge of the parameter space, when components coincide, their mixing probability become unidentifiable. Tends to

overestimate number of clusters. Alternative – parametric bootstrap – preferred method. Both are available only for

nested models.

• Information theoretic approaches

• Uses a measure of information lost when a particular model is used to approximate the true model: AIC and BIC –

both are penalized log-likelihoods. Smaller value is preferred. All depends heavily on regularity conditions, which do

not necessarily holds in FMM. Robustness is not studied. Recommended to use multiple criteria along with

theoretical and practical considerations.

• Bayes factors

• It is a posterior odds of one model against another model. Estimation requires integration of marginal likelihood

(limitation).

• MCMC method using reversible jump MCMC

Statistical Software

• R: mclust by Fraley and Raftery

• R: flexmix by Gruen and Fleisch

• R: caman by Schlattmann

• Latent GOLD (Statistical Innovations) - is a powerful latent class and finite mixture

program with a very user-friendly point-and-click interface (GUI).

• Mplus by Muthen and Muthen

• gllamm in Stata

• FMM in SAS (experimental)

Application

• Project: Measuring Disparities in the Chain of Survival in Latino Communities

• PI: Marina Del Rios Rivera, MD, MSc

• Funding Agency: American Heart Association (Award No. 16MCPRP30960065)

• Purpose: Explore the relationship between neighborhood-level variables (i.e., language,

educational attainment, and residential instability) and out-of-hospital cardiac arrest

(OHCA) outcomes in Hispanics.

• Data: Surveillance data prospectively submitted to the Cardiac Arrest Registry to Enhance

Survival (CARES) will be geocoded to Census Tracts.

Concentrated disadvantage- composite measure of census-tract level socioeconomic composition in Chicago

Sampson

et.al., 1997

Cagney and

Browning,

2004

Current Analysis,

N=797

mean (sd)

Age Dependency Ratio 52.90 (20.01)

% Unemployed 14.52 (10.15)

% Female-headed HH 20.12 (14.34)

% Median Income HH, 1K 49.66 (26.68)

% Vacant Housing 13.90 (8.81)

% Below Poverty 24.07 (14.65)

% on Public Assistance 24.44 (17.66)

% Less Than High School 18.39 (12.95)

% less than Age 18

% Black

Census tract characteristics of 2010-2014 5-year ACS estimates

> library(mclust) > mod <- Mclust(mydata2[,-1]) > summary(mod$BIC) Best BIC values: VVV,4 VVE,6 VVV,3 BIC -45562.89 -45592.95763 -45606.92200 BIC diff 0.00 -30.06785 -44.03223 > summary(mod) ---------------------------------------------------- Gaussian finite mixture model fitted by EM algorithm ---------------------------------------------------- Mclust VVV (ellipsoidal, varying volume, shape, and orientation) model with 4 components: log.likelihood n df BIC ICL -22183.51 797 179 -45562.89 -45638.06 Clustering table: 1 2 3 4 260 331 185 21

21 cases is 2.6%

BIC plot

Fitting mixture model with 3 classes

> mod.3 <- Mclust(mydata2[,-1], G=3) > summary(mod.3) ---------------------------------------------------- Gaussian finite mixture model fitted by EM algorithm ---------------------------------------------------- Mclust VVV (ellipsoidal, varying volume, shape, and orientation) model with 3 components: log.likelihood n df BIC ICL -22355.84 797 134 -45606.92 -45689.57 Clustering table: 1 2 3 328 273 196

Mixture probabilities and mean (sd) for Census tract characteristics

Component Class 1 Class 2 Class 3Mixing Probabilities 41.3% 34.3% 24.4%Age Dependency Ratio 49.6 (14.7) 67.6 (15.4) 37.9 (19.9) % Less Than High School 24.2 (14.3) 20.4 (8.21) 5.6 (4.48) % Unemployed 11.3 (4.27) 24.8 (9.8) 5.45 (2.67) % Female-headed HH 15.4 (6.71) 35.8 (11) 6.03 (4.02) % Median Income HH, 1K 44.7 (12.5) 29.4 (10.3) 86.5 (22.9) % Vacant Housing 10.8 (4.65) 21 (9.63) 9.11 (6.44) % Below Poverty 22.6 (9.58) 36.4 (13.6) 9.22 (4.94) % on Public Assistance 21.1 (9.89) 42.2 (13.6) 5.2 (4.23) Labels poor distressed affluent

Uncertainty plot

> drmod<- MclustDR(mod.3, lambda=1) > summary(drmod) > plot(drmod, what='contour') > plot(drmod, what='contour') > miscl<-mod.3$uncertainty>0.3 > points(drmod$dir[miscl,], pch=1, cex=2) > table(miscl) miscl FALSE TRUE 761 36

Contour plot of estimated mixture

densities on a projection subspace

Chicago Map

Classification by %Race

• calculated as weighted by factor loading sum of components with loading above 0.3

• Mean (range) = 210.60 (37.81 – 406.15)

• Density Plot

• Class n mean sd min max

• 1 328 203.92 39.08 114.39 315.19

• 2 273 290.47 50.17 155.18 406.15

• 3 196 110.53 29.66 37.81 194.17

Concentrated disadvantage as continuous variable

Thank you!

• This work was supported by Award No. 16MCPRP30960065 from the NIH – American Heart Association and by the Methodology Research Core at IHRP, UIC.

References

• Banfield, J. D, & Raftery, A. E. (1993). Model-based Gaussian and non-Gaussian clustering. Biometrics, 49, 803–821.

• Browning, C. R., & Cagney, K. A. (2002). Neighborhood structural disadvantage, collective efficacy, and self-rated physical health in an urban setting. Journal of Health and Social Behavior, 43(4), 383–399.

• Cagney, K. A., & Browning, C. R. (2004). Exploring Neighborhood-level Variation in Asthma and other Respiratory Diseases. Journal of General Internal Medicine, 19(3), 229–236. https://doi.org/10.1111/j.1525-1497.2004.30359.x

• Hagenaars, J. A., & McCutcheon, A. L. (2002). Applied Latent Class Analysis. New York: Cambridge University Press. Retrieved from http://ebookcentral.proquest.com/lib/uic/detail.action?docID=217833

• Oberski, D. (2016). Mixture Models: Latent Profile and Latent Class Analysis. In Modern Statistical Methods for HCI (pp. 275–287). Springer, Cham. https://doi.org/10.1007/978-3-319-26633-6_12

• Sampson, R. J., Raudenbush, S. W., & Earls, F. (1997). Neighborhoods and Violent Crime: A Multilevel Study of Collective Efficacy. Science, 277(5328), 918–924. https://doi.org/10.1126/science.277.5328.918

• Scrucca, L., Fop, M., Murphy, T. B., & Raftery, A. E. (2016). mclust 5: Clustering, classification and density estimation using gaussian finite mixture models. The R Journal, 8(1), 289.

• Skrondal, A., & Rabe-Hesketh, S. (2004). Generalized Latent Variable Modeling: Multilevel, Longitudinal, and Structural Equation Models (1 edition). Boca Raton: Chapman and Hall/CRC.

• Wiley: Cluster Analysis, 5th Edition - Brian S. Everitt, Sabine Landau, Morven Leese, et al. (n.d.). Retrieved November 30, 2017, from http://www.wiley.com/WileyCDA/WileyTitle/productCd-EHEP002266.html

top related