visualizing data using t-sne

Good visualizationMathematical framework

Implementation

Visualizing Data Using t-SNE

David Khosid

Dec. 21, 2015

1 / 20


Implementation

Agenda

Good visualizationMechanics of t-SNEExamples: image, text, voiceScalability: large datasets visualization, up to tens of millionsImplementations: scikit-learn, Matlab, Torch

2 / 20


Implementation

MNIST visualization with PDA

This PDA visualization is terrible

3 / 20


Implementation

MNIST visualization with t-SNE in 2Dt-SNE visualization can help you identify various clusters.

Youtube link to 3D t-SNE

(a) MNIST in t-SNE (b) Learning animation (view with AdobeReader)

4 / 20

http://youtu.be/tMQAwqsMb6k


Implementation

Good visualization (requirements)

Each high-dimensional object is represented by alow-dimensional object.Preserve the neighborhoodDistant points correspond to dissimilar objectsScalability: large, high-dimensional data sets.

5 / 20


Implementation

Manifold Learning

ManifoldsMNIST: 10 intrinsicdimensions in 28x28 imagesImages - 100 dimsText - 1000 dims

PCAPCA is mainly concerneddimensionality, with preservinglarge pairwise distances in themap

Swiss Roll

6 / 20


Implementation

Idea of t-SNE

A data point - is a point xi in the original data space RDA map point - is a point yi in the map space R2/R3. Everymap point represents one of the original data pointst-SNE is a visualization algorithm that choose positions of themap points in R2/R3

t-SNE procedure:1 Compute an N N similarity matrix in the original RD space2 Define an N N similarity matrix the low-dimensional

embedding space - a learn objective3 Define cost function - Kullback-Leibler divergence between

the two probability distributions4 Learn low-dimensional embedding

Result: t-SNE focuses on accurately modelling small pairwisedistances, i.e., on preserving local data structure in the R2/R3

7 / 20


Implementation

Conditional similarity between two data points

Similarity of datapoints (xi ) in data space RD

pj|i =exp(xixj

2

22i)

k 6=m exp(xkxm2

22i)

pj|i measures how close xj is from xi , considering Gaussiandistribution around xi with a given variance 2i .

8 / 20


Implementation

Symmetric similarity

Similarity of datapoints (xi ) in data space RD

pj|i =exp(xixj

2

22i)

k 6=m exp(xkxm2

22i)

(1)

Make the similarity metric pij symmetric. The main advantage ofsymmetry is simplifying the gradient (learning stage):

pij =pi |j + pj|i

2N (2)

we set pii = 0, as we interested in pairwise similaritiesi is chosen such that the data point has a fixed perplexity(effective number of neighbors).

9 / 20


Implementation

Similarity of map points in Low Dimension

Student t-distribution with one degree of freedom (same as Cauchydistribution)

qij =(1 + yi yj2)1

k 6=m(1 + yk ym2)1(3)

we set qii = 0, as we interested in pairwise similaritiesheavy-tail (will be discussed later)still closely related to the Gaussiancomputationally convenient (no exponent)

10 / 20


Implementation

Kullback-Leibler divergence (Cost Function)

(pij) is fixed, (qij) is flexible.We want (pij) and (qij) to be as close as possible.

C =

iKL(PiQi ) =

i

j

pji logpijqij

(4)

KL divergence:is not a distance, since it is asymmetriclarge pij modelled by small qij large penaltySmall pij modelled by large qij small penaltyKL divergence meaning: cross-entropy

11 / 20


Implementation

Learning: Gradient of t-SNE

t-SNE algorithm minimizes KL divergence between P and Qdistributions.

Cy = 4

i 6=j

(pij qij)yi yj

1 + yi yj2(5)

positive attraction, negative repulsion(dissimilar DPs, similar MPs) repulsionrepulsions do not go to infinity

12 / 20


Implementation

Learning: Physical Analogy

Cy = 4

i 6=j

(pij qij)yi yj

1 + yi yj2

Physical Analogy: F = k x , attraction/repulsion

13 / 20


Implementation

Why t-Student for qij , instead of Gaussian?Q: How many equidistant datapoints in 10 dimensions?Crowding Problem: the area of the 2D map that is available toaccomodate moderately distant datapoints will not be largeenough compared with the area available to accommodate nearbydatapoints.

14 / 20


Implementation

t-SNE in sklearn

Follow example:http://alexanderfabisch.github.io/t-sne-in-scikit-learn.html

15 / 20

http://alexanderfabisch.github.io/t-sne-in-scikit-learn.html


Implementation

Scalability: Barnes-Hut-SNE

Original t-SNE data and computational complexity is O(N2).Limits 10K points.Reduce complexity to O(N log(N)) via Barnes-Hut-SNE(tree-based) algorithm. Up to tens of millions data points.

16 / 20


Implementation

Review of t-SNE for Images, Speach, Text

(Flash Player should be installed on Windows, to see the embedded video)

17 / 20


Implementation

Additional points

Q: Every time I run t-SNE, I get a (slightly) different result?Discussion: KL divergence in informative theoryQ: We want pij = pji and defined pij =

pi|j+pj|i2N . Why we

chose symmetric similarity metric?Discussion: What is the best visualization method forhigh-dimensional data so far?Q: Is it feasible to use t-SNE to reduce a dataset to onedimension?A: yes

18 / 20


Implementation

Summary, Q&A

t-SNE is an effective method to visualize a complex datasetst-SNE exposes natural clustersImplemented in many languagesScalable with O(NlogN) version

19 / 20


Implementation

References

Laurens van der Maaten page: https://lvdmaaten.github.io/tsne/

Kevin Murphy Machine Learning: a Probabilistic Perspective,MIT, 2012https://www.oreilly.com/learning/an-illustrated-introduction-to-the-t-sne-algorithm

20 / 20

https://lvdmaaten.github.io/tsne/

Good visualizationMathematical frameworkImplementation

0.0: 0.1: 0.2: 0.3: 0.4: 0.5: 0.6: 0.7: 0.8: 0.9: 0.10: 0.11: 0.12: 0.13: 0.14: 0.15: 0.16: 0.17: 0.18: 0.19: 0.20: 0.21: 0.22: 0.23: 0.24: 0.25: 0.26: 0.27: 0.28: 0.29: 0.30: 0.31: 0.32: 0.33: 0.34: 0.35: 0.36: 0.37: 0.38: 0.39: 0.40: 0.41: 0.42: 0.43: 0.44: 0.45: 0.46: 0.47: 0.48: 0.49: 0.50: 0.51: 0.52: 0.53: 0.54: 0.55: 0.56: 0.57: 0.58: 0.59: 0.60: 0.61: 0.62: 0.63: 0.64: 0.65: 0.66: 0.67: 0.68: 0.69: 0.70: 0.71: 0.72: 0.73: 0.74: 0.75: 0.76: 0.77: 0.78: 0.79: 0.80: 0.81: 0.82: 0.83: 0.84: 0.85: 0.86: 0.87: 0.88: 0.89: 0.90: 0.91: 0.92: 0.93: 0.94: 0.95: 0.96: 0.97: anm0:

visualizing data using t-sne

Data & Analytics