ics 278: data mining lecture 5: low-dimensional representations of high-dimensional data

Data Mining Lectures Lecture 5: Dimension Reduction Padhraic Smyth, UC Irvine

ICS 278: Data Mining

Lecture 5: Low-Dimensional Representations of High-Dimensional Data


Today’s lecture

• Extend project proposal deadline to Monday, 8am: questions?

• Outline of today’s lecture:– “orphan” slides from earlier lectures

– Dimension reduction methods• Motivation• Variable selection methods• Linear projection techniques• Non-linear embedding methods


Notation Reminder

))(,),(),(()( 21 ixixixix p

• n objects, each with p measurements• data vector for ith object

• Data matrix • is the ith row, jth column• columns -> variables• rows -> data points

• Can define distances/similarities• between rows (data vectors i)• between columns (variables j)

)(ix j


Distance

),,,( 21 pxxxx • n objects with p measurements

2

1

1

2)(),(

p

kkkE yxyxd

• Most common distance metric is Euclidean distance:

• Makes sense in the case where the different measurements are commensurate; each variable measured in the same units.

• If the measurements are different, say length and weight, Euclidean distance is not necessarily meaningful.

),,,( 21 pyyyy


Dependence among Variables

• Covariance and correlation measure linear dependence(distance between variables, not objects)

• Assume we have two variables or attributes X and Y and n objects taking on values x(1), …, x(n) and y(1), …, y(n). The sample covariance of X and Y is:

• The covariance is a measure of how X and Y vary together.– it will be large and positive if large values of X are

associated with large values of Y, and small X small Y

n

i

yiyxixn

YXCov1

))()()((1

),(


Correlation coefficient

• Covariance depends on ranges of X and Y• Standardize by dividing by standard deviation• Linear correlation coefficient is defined as:

2

1

1

2

1

2

1

))(())((

))()()((),(

n

i

n

i

n

i

yiyxix

yiyxixYX


Sample Correlation Matrix

business acreage

nitrous oxide

percentage of large residential lots

-1 0 +1

Data on characteristicsof Boston surburbs

average # rooms

Median house value


Mahalanobis distance (between objects)

2

11),( yxyxyxd T

MH

1. It automatically accounts for the scaling of the coordinate axes

2. It corrects for correlation between the different features

Cost:1. The covariance matrices can be hard to determine

accurately2. The memory and time requirements grow quadratically,

O(p2), rather than linearly with the number of features.

Inverse covariance matrixVector difference inp-dimensional space

Evaluates to a scalar distance


Example 1 of Mahalanobis distance

Covariance matrix isdiagonal and isotropic

-> all dimensions have equal variance

-> MH distance reduces to Euclidean distance



Covariance matrix isdiagonal but non-isotropic

-> dimensions do not have equal variance

-> MH distance reduces to weighted Euclidean distance with weights = inverse variance



Two outer bluepoints will have same MHdistance to the centerblue point


Distances between Binary Vectors

• matching coefficient

i=1 j=0

i=1 n11 n10

i=0 n01 n00

00011011

0011

nnnn

nn

• Jaccard coefficient (e.g., for sparse vectors, non-symmetric)

011011

11

nnn

n

Number ofvariables whereitem j =1 and item i = 0


Other distance metrics

• Categorical variables– Number of matches divided by number of dimensions

• Distances between strings of different lengths– e.g., “Patrick J. Smyth” and “Padhraic Smyth”– Edit distance

• Distances between images and waveforms– Shift-invariant, scale invariant– i.e., d(x,y) = min_{a,b} ( (ax+b) – y)

• More generally, kernel methods


Transforming Data

• Duality between form of the data and the model– Useful to bring data onto a “natural scale”– Some variables are very skewed, e.g., income

• Common transforms: square root, reciprocal, logarithm, raising to a power– Often very useful when dealing with skewed real-world

data

• Logit: transforms from 0 to 1 to real-line

p

pp

1)(logit


Data Quality

• Individual measurements– Random noise in individual measurements

• Variance (precision)• Bias• Random data entry errors• Noise in label assignment (e.g., class labels in medical data sets)

– Systematic errors• E.g., all ages > 99 recorded as 99• More individuals aged 20, 30, 40, etc than expected

– Missing information• Missing at random

– Questions on a questionnaire that people randomly forget to fill in• Missing systematically

– Questions that people don’t want to answer– Patients who are too ill for a certain test


Data Quality

• Collections of measurements– Ideal case = random sample from population of interest– Real case = often a biased sample of some sort– Key point: patterns or models built on the training data may

only be valid on future data that comes from the same distribution

• Examples of non-randomly sampled data– Medical study where subjects are all students– Geographic dependencies– Temporal dependencies– Stratified samples

• E.g., 50% healthy, 50% ill– Hidden systematic effects

• E.g., market basket data the weekend of a large sale in the store• E.g., Web log data during finals week


Classifier technology and the illusion of progress

(abstract for workshop on State-of-theArt in Supervised Classification, May 2006)

Professor David J. Hand, Imperial College, London

Supervised classification methods are widely used in data mining. Highly sophisticated methods have been developed, using the full power of recent advances in computation. However, these advances have largely taken place within the context of a classical paradigm, in which construction of the classification rule is based on a ‘design sample’ of data randomly sampled from unknown but well defined distributions of the classes. In this paper, I argue that this paradigm fails to take account of other sources of uncertainty in the classification problem, and that these other sources lead to uncertainties which often swamp those arising from the classical ones of estimation and prediction. Several examples of such sources are given, including imprecision in the definitions of the classes, sample selectivity bias, population drift, and use of inappropriate optimisation criteria when fitting the model. Furthermore, it is argued, there are both theoretical arguments and practical evidence supporting the assertion that the marginal gains of increasing classifier complexity can often be minimal. In brief, the advances in classification technology are typically much less than is often claimed.


Dimension Reduction Methods

(reading: 3.6 to 3.8 in the text)


Dimension Reduction methods

• Dimension reduction

– From p-dimensional x to d-dimensional x’ , d < p

• Techniques

– Variable selection: • use an algorithm to find individual variables in x that are relevant to the

problem and discard the rest• stepwise logistic regression

– Linear projections• Project data to a lower-dimensional space• E.g., principal components

– Non-linear embedding• Use a non-linear mapping to “embed” data in a lower-dimensional space• E.g., multidimensional scaling


Dimension Reduction: why is it useful?

• In general, incurs loss of information about x– So why do this?

• If dimensionality p is very large (e.g., 1000’s), representing the data in a lower-dimensional space may make learning more reliable,– e.g., clustering example

• 100 dimensional data• but cluster structure is only present in 2 of the dimensions, the others are

just noise• if other 98 dimensions are just noise (relative to cluster structure), then

clusters will be much easier to discover if we just focus on the 2d space

• Dimension reduction can also provide interpretation/insight– e.g for 2d visualization purposes

• Caveat: 2-step approach of dimension reduction followed by learning algorithm is in general suboptimal


Variable Selection Methods• p variables, would like to use a smaller subset in our model

– e.g., in classification, do kNN in d-space rather than p-space– e.g., for logistic regresison, use d inputs rather than p

• Problem:– Number of subsets of p variables is O(2p)– Exhaustive search is impossible except for very small p– Typically the search problem is NP-hard

• Common solution:– Local systematic search (e.g., add/delete variables 1 at a time)

to locally maximize a score function (i.e., hill-climbing)– e.g., add a variable, build new model, generate new score, etc– Can often work well, but can get trapped in local maxima/minima– Can also be computationally-intensive (depends on model)

• Note: some techniques such as decision tree predictors automatically perform dimension reduction as part of the learning algorithm.


Linear Projections

x = p-dimensional vector of data measurements

Let a = weight vector, also dimension p

Assume aT a = 1 (i.e., unit norm)

aT x = aj xj

= projection of x onto vector a, gives distance of projected x along a

e.g., aT = [1 0] -> projection along 1st dimension aT = [0 1] -> projection along 2nd dimension aT = [0.71, 0.71] -> projection along diagonal


Example of projection from 2d to 1d

Direction of weight vector a

x1

x2


Projections to more than 1 dimension

• Multidimensional projections:– e.g., x is 4-dimensional

– a1T = [ 0.71 0.71 0 0 ]

– a2T = [ 0 0 0.71 0.71 ]

AT x -> coordinates of x in 2d space spanned by columns of A

-> linear transformation from 4d to 2d space

where A = [ a1 a2 ]


Principal Components Analysis (PCA)

X = p x n data matrix: columns = p-dim data vectors

Let a = weight vector, also dimension p

Assume aT a = 1 (i.e., unit norm)

aT X = projection of each column x onto vector a, = vector of distances of projected x vectors along a

PCA: find vector a such that var(aT X ) is maximized i.e., find linear projection with maximal variance

More generally:ATX = d x n data matrix with x vectors projected to d-dimensional space, where size(A) = p x d PCA: find d orthogonal columns of A such that variance in the d-dimensional projected space is maximized, d < p


PCA Example

Direction of 1st principal component vector (highest variance projection)

x1

x2


PCA Example

Direction of 1st principal component vector (highest variance projection)

x1

x2

Direction of 2ndprincipal component vector


How do we compute the principal components?

• See class notes

• See also page 78 in the text


Notes on PCA

• Basis representation and approximation error

• Scree diagrams

• Computational complexity of computing PCA– Equivalent to solving set of linear equations, matrix

inversion, singular value decomposition, etc• Scales in general as O(np2 + p3)• Many numerical tricks possible, e.g., for sparse X matrices,

for finding only the first k eigenvectors, etc• In MATLAB can use eig.m or svd.m (also note sparse

versions)


More notes on PCA

• Links between PCA and multivariate Gaussian density

• Caveat: PCA can destroy information about differences between groups for clustering or classification

• PCA for other data types– Images, e.g., eigenfaces– Text, e.g., “latent semantic indexing” (LSI)


Basis images (eigenimages) of faces


20 face images


First 16 Eigenimages


First 4 eigenimages


Reconstruction of First Image with 8 eigenimages

Original Image

ReconstructedImage


Reconstruction of first image with 8 eigenimages

Weights = -14.0 9.4 -1.1 -3.5 -9.8 -3.5 -0.6 0.6

Reconstructed image = weighted sum of 8 images on left


Reconstruction of 7th image with eigenimages

Original Image

ReconstructedImage


Reconstruction of 7th image with 8 eigenimages

Weights = -13.7 12.9 1.6 4.4 3.0 0.9 1.6 -6.3

Weights for Image 1 = -14.0 9.4 -1.1 -3.5 -9.8 -3.5 -0.6 0.6

Reconstructed image = weighted sum of 8 images on left


Reconstructing Image 6 with 16 eigenimages


Multidimensional Scaling (MDS)

• Say we have data in the form of an N x N matrix of dissimilarities– 0’s on the diagonal– Symmetric– Could either be given data in this form, or create such a

dissimilarity matrix from our data vectors

• Examples– Perceptual dissimilarity of N objects in cognitive science

experiments– String-edit distance between N protein sequences

• MDS:– Find k-dimensional coordinates for each of the N objects such

that Euclidean distances in “embedded” space matches set of dissimilarities as closely as possible


Multidimensional Scaling (MDS)

• MDS score function (“stress”)

• N points embedded in k-dimensions -> Nk locations or parameters– To find the Nk locations?

• Solve optimization problem -> minimize S function

• Often used for visualization, e.g., k=2, 3

jiji

jidjijidS,

2

,

2 ),(/)),(),((

Originaldissimilarities

Euclidean distancein “embedded” k-dim space


MDS Optimization

• Optimization problem: – S is a function of Nk parameters

• find the set of N k-dimensional positions that minimize S• Note: 3 parameters are redundant: location (2) and rotation (1)

• If original dissimilarities are Euclidean • -> linear algebra solution (equivalent to principal components)

• Non-Euclidean (more typical)– Local iterative hill-climbing, e.g., move each point to increase S, repeat– Non-trivial optimization, can have local minima, etc– Initialization: either random or heuristically (e.g., by PCA)– Complexity is O(N2 k) per iteration (iteration = move all points locally)– See Faloutsos and Lin (1995) for FastMap: O(Nk) approximation for large

N

jiji

jidjijidS,

2

,

2 ),(/)),(),((


MDS example: input distance data

Chicago Raleigh Boston Seattle S.F. Austin Orlando

Chicago 0

Raleigh 641 0

Boston 851 608 0

Seattle 1733 2363 2488 0

S.F. 1855 2406 2696 684 0

Austin 972 1167 1691 1764 1495 0

Orlando 994 520 1105 2565 2458 1015 0


Result of MDS


MDS for protein sequences

226 protein sequences of the Globin family (from Klock & Buhmann 1997).

MDS similarity matrix(note cluster structure) MDS embedding


MDS from human judgements of similarity


MDS: Example data


MDS: 2d embedding of face images


Other embedding techniques

• Many other algorithms for non-linear embedding

• Some of the better-known examples…– Self-organizing maps (Kohonen)

• Neural-network inspired algorithm for 2d embedding

– ISOMAP, Local linear embedding • Find low-d coordinates that preserve local distances• Ignore global distances


Example of Local Linear Embedding (LLE) (Roweis and Saul, Science, 2000)

Note how points that are far away on the 3d manifold(e.g., red and blue) in “manifold distance” would be mappedas being close together by MDS or PCA but are kept“far apart” by LLE. LLE emphasizes local relationships


LLE Algorithm

• N points in dimension p: wish to reduce to dimension d, d < p

• Step 1:– Select K nearest neighbors for each point in training data– Represent each point as X = a weighted linear combination of its K

neighbor points– Find best k weights for each of the X vectors (least squares fitting)

• Step 2:– Fix the weights from part 1 – For each p-dim vector X, find a d-dimensional Y vector that is

closest to its reconstructed approximation based on d-dim neighbors and weights

– Reduces to another linear algebra/eigenvalue problem, O(N3) complexity


LLE applied to text data


LLE applied to a set of face images


Local Linear Embedding example


ISOMAPTenenbaum, de Silva, Langford, Science, 2000

Similar to LLE in concept: preserves local distances

Computational strategy is different: measures distance in original space via geodesic paths (distance “on manifold”)

Algorithm involves finding shortest paths between points and then embedding


Examplesof ISOMAP

embeddingsin 2d


ISOMAP:morphingexamples


Summary on Dimension Reduction

• Can be used for defining a new set of (lower-dimensional) variables for modeling or for visualization/insight

• 3 general approaches– Variable selection (only select “relevant” variables)– Linear projections (e.g., PCA)– Non-linear projections (e.g., MDS, LLE, ISOMAP)– Can be used with text, images, etc by representing such data

as very high-dimensional vectors– MATLAB implementations of all of these techniques are

available on the Web

– These techniques can be useful, but like any high-powered tool they are not a solution to everything

• real-world data sets often do not produce the types of elegant 2d embeddings that one often sees in research papers on these topics.

ics 278: data mining lecture 5: low-dimensional representations of high-dimensional data

Documents