visualizing data using t-sne

20
Good visualization Mathematical framework Implementation Visualizing Data Using t-SNE David Khosid Dec. 21, 2015 1 / 20

Upload: david-khosid

Post on 21-Feb-2017

608 views

Category:

Data & Analytics


0 download

TRANSCRIPT

  • Good visualizationMathematical framework

    Implementation

    Visualizing Data Using t-SNE

    David Khosid

    Dec. 21, 2015

    1 / 20

  • Good visualizationMathematical framework

    Implementation

    Agenda

    Good visualizationMechanics of t-SNEExamples: image, text, voiceScalability: large datasets visualization, up to tens of millionsImplementations: scikit-learn, Matlab, Torch

    2 / 20

  • Good visualizationMathematical framework

    Implementation

    MNIST visualization with PDA

    This PDA visualization is terrible

    3 / 20

  • Good visualizationMathematical framework

    Implementation

    MNIST visualization with t-SNE in 2Dt-SNE visualization can help you identify various clusters.

    Youtube link to 3D t-SNE

    (a) MNIST in t-SNE (b) Learning animation (view with AdobeReader)

    4 / 20

    http://youtu.be/tMQAwqsMb6k

  • Good visualizationMathematical framework

    Implementation

    Good visualization (requirements)

    Each high-dimensional object is represented by alow-dimensional object.Preserve the neighborhoodDistant points correspond to dissimilar objectsScalability: large, high-dimensional data sets.

    5 / 20

  • Good visualizationMathematical framework

    Implementation

    Manifold Learning

    ManifoldsMNIST: 10 intrinsicdimensions in 28x28 imagesImages - 100 dimsText - 1000 dims

    PCAPCA is mainly concerneddimensionality, with preservinglarge pairwise distances in themap

    Swiss Roll

    6 / 20

  • Good visualizationMathematical framework

    Implementation

    Idea of t-SNE

    A data point - is a point xi in the original data space RDA map point - is a point yi in the map space R2/R3. Everymap point represents one of the original data pointst-SNE is a visualization algorithm that choose positions of themap points in R2/R3

    t-SNE procedure:1 Compute an N N similarity matrix in the original RD space2 Define an N N similarity matrix the low-dimensional

    embedding space - a learn objective3 Define cost function - Kullback-Leibler divergence between

    the two probability distributions4 Learn low-dimensional embedding

    Result: t-SNE focuses on accurately modelling small pairwisedistances, i.e., on preserving local data structure in the R2/R3

    7 / 20

  • Good visualizationMathematical framework

    Implementation

    Conditional similarity between two data points

    Similarity of datapoints (xi ) in data space RD

    pj|i =exp(xixj

    2

    22i)

    k 6=m exp(xkxm2

    22i)

    pj|i measures how close xj is from xi , considering Gaussiandistribution around xi with a given variance 2i .

    8 / 20

  • Good visualizationMathematical framework

    Implementation

    Symmetric similarity

    Similarity of datapoints (xi ) in data space RD

    pj|i =exp(xixj

    2

    22i)

    k 6=m exp(xkxm2

    22i)

    (1)

    Make the similarity metric pij symmetric. The main advantage ofsymmetry is simplifying the gradient (learning stage):

    pij =pi |j + pj|i

    2N (2)

    we set pii = 0, as we interested in pairwise similaritiesi is chosen such that the data point has a fixed perplexity(effective number of neighbors).

    9 / 20

  • Good visualizationMathematical framework

    Implementation

    Similarity of map points in Low Dimension

    Student t-distribution with one degree of freedom (same as Cauchydistribution)

    qij =(1 + yi yj2)1

    k 6=m(1 + yk ym2)1(3)

    we set qii = 0, as we interested in pairwise similaritiesheavy-tail (will be discussed later)still closely related to the Gaussiancomputationally convenient (no exponent)

    10 / 20

  • Good visualizationMathematical framework

    Implementation

    Kullback-Leibler divergence (Cost Function)

    (pij) is fixed, (qij) is flexible.We want (pij) and (qij) to be as close as possible.

    C =

    iKL(PiQi ) =

    i

    j

    pji logpijqij

    (4)

    KL divergence:is not a distance, since it is asymmetriclarge pij modelled by small qij large penaltySmall pij modelled by large qij small penaltyKL divergence meaning: cross-entropy

    11 / 20

  • Good visualizationMathematical framework

    Implementation

    Learning: Gradient of t-SNE

    t-SNE algorithm minimizes KL divergence between P and Qdistributions.

    Cy = 4

    i 6=j

    (pij qij)yi yj

    1 + yi yj2(5)

    positive attraction, negative repulsion(dissimilar DPs, similar MPs) repulsionrepulsions do not go to infinity

    12 / 20

  • Good visualizationMathematical framework

    Implementation

    Learning: Physical Analogy

    Cy = 4

    i 6=j

    (pij qij)yi yj

    1 + yi yj2

    Physical Analogy: F = k x , attraction/repulsion

    13 / 20

  • Good visualizationMathematical framework

    Implementation

    Why t-Student for qij , instead of Gaussian?Q: How many equidistant datapoints in 10 dimensions?Crowding Problem: the area of the 2D map that is available toaccomodate moderately distant datapoints will not be largeenough compared with the area available to accommodate nearbydatapoints.

    14 / 20

  • Good visualizationMathematical framework

    Implementation

    t-SNE in sklearn

    Follow example:http://alexanderfabisch.github.io/t-sne-in-scikit-learn.html

    15 / 20

    http://alexanderfabisch.github.io/t-sne-in-scikit-learn.html

  • Good visualizationMathematical framework

    Implementation

    Scalability: Barnes-Hut-SNE

    Original t-SNE data and computational complexity is O(N2).Limits 10K points.Reduce complexity to O(N log(N)) via Barnes-Hut-SNE(tree-based) algorithm. Up to tens of millions data points.

    16 / 20

  • Good visualizationMathematical framework

    Implementation

    Review of t-SNE for Images, Speach, Text

    (Flash Player should be installed on Windows, to see the embedded video)

    17 / 20

  • Good visualizationMathematical framework

    Implementation

    Additional points

    Q: Every time I run t-SNE, I get a (slightly) different result?Discussion: KL divergence in informative theoryQ: We want pij = pji and defined pij =

    pi|j+pj|i2N . Why we

    chose symmetric similarity metric?Discussion: What is the best visualization method forhigh-dimensional data so far?Q: Is it feasible to use t-SNE to reduce a dataset to onedimension?A: yes

    18 / 20

  • Good visualizationMathematical framework

    Implementation

    Summary, Q&A

    t-SNE is an effective method to visualize a complex datasetst-SNE exposes natural clustersImplemented in many languagesScalable with O(NlogN) version

    19 / 20

  • Good visualizationMathematical framework

    Implementation

    References

    Laurens van der Maaten page: https://lvdmaaten.github.io/tsne/

    Kevin Murphy Machine Learning: a Probabilistic Perspective,MIT, 2012https://www.oreilly.com/learning/an-illustrated-introduction-to-the-t-sne-algorithm

    20 / 20

    https://lvdmaaten.github.io/tsne/

    Good visualizationMathematical frameworkImplementation

    0.0: 0.1: 0.2: 0.3: 0.4: 0.5: 0.6: 0.7: 0.8: 0.9: 0.10: 0.11: 0.12: 0.13: 0.14: 0.15: 0.16: 0.17: 0.18: 0.19: 0.20: 0.21: 0.22: 0.23: 0.24: 0.25: 0.26: 0.27: 0.28: 0.29: 0.30: 0.31: 0.32: 0.33: 0.34: 0.35: 0.36: 0.37: 0.38: 0.39: 0.40: 0.41: 0.42: 0.43: 0.44: 0.45: 0.46: 0.47: 0.48: 0.49: 0.50: 0.51: 0.52: 0.53: 0.54: 0.55: 0.56: 0.57: 0.58: 0.59: 0.60: 0.61: 0.62: 0.63: 0.64: 0.65: 0.66: 0.67: 0.68: 0.69: 0.70: 0.71: 0.72: 0.73: 0.74: 0.75: 0.76: 0.77: 0.78: 0.79: 0.80: 0.81: 0.82: 0.83: 0.84: 0.85: 0.86: 0.87: 0.88: 0.89: 0.90: 0.91: 0.92: 0.93: 0.94: 0.95: 0.96: 0.97: anm0: