data visualization at codetalks 2016

31
Visualizing High-dimensional Data Dr. Stefan Kühn Lead Data Scientist code.talks - 30.09.2016

Upload: stefan-kuehn

Post on 21-Feb-2017

118 views

Category:

Data & Analytics


1 download

TRANSCRIPT

Page 1: Data Visualization at codetalks 2016

VisualizingHigh-dimensionalData

Dr. Stefan KühnLead Data Scientist

code.talks - 30.09.2016

Page 2: Data Visualization at codetalks 2016

Overview

• Data Visualization as you might know it• Main Properties of Graphics (and Humans)• A short story about Charts• Pair Plots and Correlations

• Data Visualization as you might not know it• Fundamental Problems• SVD, t-SNE and other approximations• Principal Components and Principal Curves • Parallel Coordinates• The Grand Tour

2

Page 3: Data Visualization at codetalks 2016

Takeaways - hopefully ;-)

• Data Visualization is complicated• There is always an approximation• There is always a bias• There is always a misinterpretation

• Data Visualization is simple• Lots of packages available• Lots of studies + literature• Lots of examples to learn from

3

Page 4: Data Visualization at codetalks 2016

Data Visualization Basics

4

Page 5: Data Visualization at codetalks 2016

The Modes of Perception

5

X

X

X

X

X

XX

O

O

O O

O

O

O

OX

X

XX

O

O

O

OX

X

X

X

X

X

XX

O

O

O O

O

O

O

O

X

X X

X

XO

OO

Fast Slow

Find the outlier

Page 6: Data Visualization at codetalks 2016

The Modes of Perception

6

• Pre-attentive• fast• parallel processing• effortless

• Pattern recognition• semi-fast• governed by laws of per by

• Attentive• slow• sequential• high effort (attention a very limited resource)

Page 7: Data Visualization at codetalks 2016

Main Properties of Graphics

7

Category ExamplePosition

Shape

Size

Color

Orientation (Line)

Length (Line)

Type and Size (Line)

Brightness

Page 8: Data Visualization at codetalks 2016

Main Properties of Graphics Humans

8

Category Amount of pre-attentive informationPosition very high

Shape ———

Size approx. 4

Color approx. 8

Orientation (Line) approx. 4

Length (Line) ———

Type and Size (Line) ———

Brightness approx. 8

Page 9: Data Visualization at codetalks 2016

Pre-attentive perception

9

• Position• fast• effective• high number of different positions

• Color• use with care

• Shape• Orientation

Pre-attentive perception is effortless.Exploit this as much as you can.

Page 10: Data Visualization at codetalks 2016

Pattern detection

10

„It is interesting to note that our brain […] subconsciously always prefers meaningful situations and objects.“

• Emergence• Reiification• Multi-stability• Invariance

Pattern detection can be trained.Exploit this for frequent visualizations.

Page 11: Data Visualization at codetalks 2016

What is this?

11

Page 12: Data Visualization at codetalks 2016

What is this?

12

Page 13: Data Visualization at codetalks 2016

What is this?

13

Page 14: Data Visualization at codetalks 2016

Laws of Gestalt

14

„It is interesting to note that our brain, inaccordance with the laws of Gestalt, subconsciously always prefers meaningful situations and objects.“

Page 15: Data Visualization at codetalks 2016

Accuracy of Graphics

15

Square Pie vs Stacked Bar vs Pie vs Donut

What do you think?

https://eagereyes.org/blog/2016/a-reanalysis-of-a-study-about-square-pie-charts-from-2009

Page 16: Data Visualization at codetalks 2016

Demo

16

Page 17: Data Visualization at codetalks 2016

Beyond the Basics

17

Page 18: Data Visualization at codetalks 2016

Fundamental Problems

• No accurate method in higher dimensions• Approximations methods• „Simulated“ dimensions (color, size, shape)• Animations?

• No notion of quality or accuracy for Visualizations• Information Theory?• „Stability“?

All Visualizations are wrong, but some are useful.

18

Page 19: Data Visualization at codetalks 2016

Approximation methods

• Pair Plots• Axis-aligned projections • Interpretable in terms of original variables

• Singular Value Decomposition• Optimal with respect to 2-norm (Euclidean norm)

and supremum norm• Comes with an error estimate

• Other methods• Stochastic Neighborhood Embedding (t-SNE)• „Manifold Learning“

19

Page 20: Data Visualization at codetalks 2016

Demo

20

Page 21: Data Visualization at codetalks 2016

Manifold Learning Methods

• Locally Linear Embedding• Neighborhood-preserving embedding

• Isomap• quasi-isometric

• Multi-dimensional scaling• quasi-isometric

• Spectral Embedding• Spectral clustering based on similarity

• Stochastic Neighborhood Embedding (SNE, t-SNE)• preserves probabilities

• Local Tangent Space Alignement (LTSA)

21

Page 22: Data Visualization at codetalks 2016

Local Tangent Space Alignement

22

Page 23: Data Visualization at codetalks 2016

Principal Components and Curves

• Principal Component Analysis• orthogonal decomposition based on SVD• linear in all variables• tries to preserve variance

• Principal Curves• minimize the Sum of Squared Errors with respect

to all variables (as PCA, preserve variance)• nonlinear• smooth

23

Page 24: Data Visualization at codetalks 2016

Principal Components and Curves

24

Page 25: Data Visualization at codetalks 2016

Parallel Coordinates

• Parallel Coordinates• especially useful for high-dimensional data• depends on ordering and scaling

25

Page 26: Data Visualization at codetalks 2016

Demo

26

Page 27: Data Visualization at codetalks 2016

The Grand Tour

• Animated sequence of 2-D projections• https://en.wikipedia.org/wiki/

Grand_Tour_(data_visualisation)• Asimov (1985): The grand tour: a tool for viewing

multidimensional data.• Underlying idea• Randomly generate 2-D projections (random

walk)• Over time generate a dense subset of all

possible 2-D projections• Optional: Follow a given path / guided tour

27

Page 28: Data Visualization at codetalks 2016

The Grand Tour

28

Page 29: Data Visualization at codetalks 2016

The Grand Tour

29

Page 30: Data Visualization at codetalks 2016

Demo

30