meaningful visual exploration of massive datammds-data.org/presentations/2016/s-wang.pdf · bokeh...

41
© 2016 Continuum Analytics Meaningful Visual Exploration of Massive Data Peter Wang Continuum Analytics CTO & Co-Founder @pwang

Upload: others

Post on 20-May-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Meaningful Visual Exploration of Massive Datammds-data.org/presentations/2016/s-wang.pdf · Bokeh Apps: Shiny for Python • Fully interactive data web apps • Streaming data, dynamic

© 2016 Continuum Analytics

Meaningful Visual Exploration of Massive Data

Peter Wang Continuum Analytics CTO & Co-Founder

@pwang

Page 2: Meaningful Visual Exploration of Massive Datammds-data.org/presentations/2016/s-wang.pdf · Bokeh Apps: Shiny for Python • Fully interactive data web apps • Streaming data, dynamic

Big data magnifies small problems

2

• Big data presents storage and computation problems • More importantly, standard plotting tools have problems that are

magnified by big data: • Overdrawing/Overplotting • Saturation • Undersaturation • Binning issues

Page 3: Meaningful Visual Exploration of Massive Datammds-data.org/presentations/2016/s-wang.pdf · Bokeh Apps: Shiny for Python • Fully interactive data web apps • Streaming data, dynamic

Overdrawing

3

• For a scatterplot, the order in which points are drawn is very important

• The same distribution can look entirely different depending on plotting order

• Last data plotted overplots

Page 4: Meaningful Visual Exploration of Massive Datammds-data.org/presentations/2016/s-wang.pdf · Bokeh Apps: Shiny for Python • Fully interactive data web apps • Streaming data, dynamic

Overdrawing

4

• Underlying issue is just occlusion

• Same problem happens with one category, but less obvious

• Can prevent occlusion using transparency

Page 5: Meaningful Visual Exploration of Massive Datammds-data.org/presentations/2016/s-wang.pdf · Bokeh Apps: Shiny for Python • Fully interactive data web apps • Streaming data, dynamic

Saturation

5

• E.g. for alpha = 0.1, up to 10 points can overlap before saturating the available brightness

• Now the order of plotting matters less

• After 10 points, first-plotted data still lost

• For one category, 10, 20, or 2000 points overlapping will look identical

Page 6: Meaningful Visual Exploration of Massive Datammds-data.org/presentations/2016/s-wang.pdf · Bokeh Apps: Shiny for Python • Fully interactive data web apps • Streaming data, dynamic

Saturation

6

• Same alpha value, more points:

• Now is highly misleading

• alpha value depends on size, overlap of dataset

• Difficult-to-set parameter, hard to know when data is misrepresented

Page 7: Meaningful Visual Exploration of Massive Datammds-data.org/presentations/2016/s-wang.pdf · Bokeh Apps: Shiny for Python • Fully interactive data web apps • Streaming data, dynamic

Saturation

7

• Can try to reduce point size to reduce overplotting and saturation

• Now points are hard to see, with no guarantee of avoiding problems

• Another difficult-to-set parameter

• For really big data, scatterplots start to become very inefficient, because there are many datapoints per pixel — may as well be binning by pixel

Page 8: Meaningful Visual Exploration of Massive Datammds-data.org/presentations/2016/s-wang.pdf · Bokeh Apps: Shiny for Python • Fully interactive data web apps • Streaming data, dynamic

Binning issues

8

• Can use heatmap instead of scatter

• Avoids saturation by auto-ranging on bins

• Result independent of data size

• Here two merged normal distributions look very different at different binning

• Another difficult-to-set parameter

Page 9: Meaningful Visual Exploration of Massive Datammds-data.org/presentations/2016/s-wang.pdf · Bokeh Apps: Shiny for Python • Fully interactive data web apps • Streaming data, dynamic

Notebook

9

Page 10: Meaningful Visual Exploration of Massive Datammds-data.org/presentations/2016/s-wang.pdf · Bokeh Apps: Shiny for Python • Fully interactive data web apps • Streaming data, dynamic

Real world example

10

Page 11: Meaningful Visual Exploration of Massive Datammds-data.org/presentations/2016/s-wang.pdf · Bokeh Apps: Shiny for Python • Fully interactive data web apps • Streaming data, dynamic

So what?

11

Page 12: Meaningful Visual Exploration of Massive Datammds-data.org/presentations/2016/s-wang.pdf · Bokeh Apps: Shiny for Python • Fully interactive data web apps • Streaming data, dynamic

Plotting big data

12

• When exploring really big data, the visualization is all you have — there’s no way to look at each of the individual data points

• Common plotting problems can lead to completely incorrect conclusions based on misleading visualizations

• Slow processing makes trial and error approach ineffective

When data is large, you don’t know when the viz is lying.

Page 13: Meaningful Visual Exploration of Massive Datammds-data.org/presentations/2016/s-wang.pdf · Bokeh Apps: Shiny for Python • Fully interactive data web apps • Streaming data, dynamic

Datashading

13

• Flexible, configurable pipeline for automatic plotting • Provides flexible plugins for viz stages, like in graphics shaders • Completely prevents overplotting, saturation, and undersaturation • Mitigates binning issues by providing fully interactive exploration in web

browsers, even of very large datasets on ordinary machines • Statistical transformations of data are a first-class aspect of the

visualization • Allows rapid iteration of visual styles & configs, interactive selections and

filtering, to support data exploration

Page 14: Meaningful Visual Exploration of Massive Datammds-data.org/presentations/2016/s-wang.pdf · Bokeh Apps: Shiny for Python • Fully interactive data web apps • Streaming data, dynamic

Datashading Pipeline: Projection

Data

Project / Synthesize

Scene

• Stage 1: select variables (columns) to project onto the screen

• Data often filtered at this stage

14

Page 15: Meaningful Visual Exploration of Massive Datammds-data.org/presentations/2016/s-wang.pdf · Bokeh Apps: Shiny for Python • Fully interactive data web apps • Streaming data, dynamic

Datashading Pipeline: Aggregation

Data

Project / Synthesize

Scene Aggregates

Sample / Raster

• Stage 2: Aggregate data into a fixed set of bins

• Each bin yields one or more scalars (total count, mean, stddev, etc.)

15

Page 16: Meaningful Visual Exploration of Massive Datammds-data.org/presentations/2016/s-wang.pdf · Bokeh Apps: Shiny for Python • Fully interactive data web apps • Streaming data, dynamic

Datashading Pipeline: Transfer

Data

Project / Synthesize

Scene Aggregates

Sample / Raster Transfer

Image

• Stage 3: Transform data using one or more transfer functions, culminating in a function that yields a visible image

• Each stage can be replaced and configured separately

16

Page 17: Meaningful Visual Exploration of Massive Datammds-data.org/presentations/2016/s-wang.pdf · Bokeh Apps: Shiny for Python • Fully interactive data web apps • Streaming data, dynamic

Infovis Pipeline

17

Visual Abstraction

DataTransforms

VisualMappings

ViewTransforms

Data Tables

Source Data Views

Page 18: Meaningful Visual Exploration of Massive Datammds-data.org/presentations/2016/s-wang.pdf · Bokeh Apps: Shiny for Python • Fully interactive data web apps • Streaming data, dynamic

Infovis Pipeline… modified

18

Visual Abstraction

DataTransforms

VisualMappings

ViewTransforms

Data Tables

Source Data Views

Selection Aggregation Transfer

SignificantSet Aggregates

Page 19: Meaningful Visual Exploration of Massive Datammds-data.org/presentations/2016/s-wang.pdf · Bokeh Apps: Shiny for Python • Fully interactive data web apps • Streaming data, dynamic

NYC Taxi DemoThis demo shows how traditional plotting tools break down for large datasets, and how to use datashading to make even large datasets practical interactively.

19

• Data for 10 million New York City taxi trips

• Even 100,000 points gets slow for scatterplot

• Parameters usually need adjusting for every zoom

• True relationships within data not visible in std plot

Datashading automatically reveals the entire dataset, including outliers, hot spots, and missing data

Page 20: Meaningful Visual Exploration of Massive Datammds-data.org/presentations/2016/s-wang.pdf · Bokeh Apps: Shiny for Python • Fully interactive data web apps • Streaming data, dynamic

Categorical data: 2010 US Census

20

• One point per person

• 300 million total • Categorized by

race • Datashading

shows faithful distribution per pixel

Page 21: Meaningful Visual Exploration of Massive Datammds-data.org/presentations/2016/s-wang.pdf · Bokeh Apps: Shiny for Python • Fully interactive data web apps • Streaming data, dynamic

OSM Dataset: 3 Billion PointsBecause Datashader decouples the data-processing from the visualization, it can handle arbitrarily large data

21

• About 3 billion GPS coordinates

• https://blog.openstreetmap.org/2012/04/01/bulk-gps-point-data/.

• This image was rendered in one minute on a standard MacBook with 16 GB RAM

• Renders in 7 seconds on a 128GB Amazon EC2 instance

Page 22: Meaningful Visual Exploration of Massive Datammds-data.org/presentations/2016/s-wang.pdf · Bokeh Apps: Shiny for Python • Fully interactive data web apps • Streaming data, dynamic

Additional

• Timeseries • Trajectory • Race & Elevation • API / Pipeline walkthrough

22

Page 23: Meaningful Visual Exploration of Massive Datammds-data.org/presentations/2016/s-wang.pdf · Bokeh Apps: Shiny for Python • Fully interactive data web apps • Streaming data, dynamic

Bokeh

• Interactive visualization • Novel graphics • Streaming, dynamic, large data • For the browser, with or without a server • No need to write Javascript

23

http://bokeh.pydata.org

Page 24: Meaningful Visual Exploration of Massive Datammds-data.org/presentations/2016/s-wang.pdf · Bokeh Apps: Shiny for Python • Fully interactive data web apps • Streaming data, dynamic

Versatile Plotting Capabilities

24

Page 25: Meaningful Visual Exploration of Massive Datammds-data.org/presentations/2016/s-wang.pdf · Bokeh Apps: Shiny for Python • Fully interactive data web apps • Streaming data, dynamic

Linked plots, tools

25

• Easy to show multiple plots and link them • Easy to link data selections between plots • Can easily customize the kind of linkage straight from

Python, without needing to fiddle around with JS

Page 26: Meaningful Visual Exploration of Massive Datammds-data.org/presentations/2016/s-wang.pdf · Bokeh Apps: Shiny for Python • Fully interactive data web apps • Streaming data, dynamic

Large data

• With easy WebGL support, can scale to 500k points or so

• Bottlenecks are browser performance, JSON encoding, network transport

26

Page 27: Meaningful Visual Exploration of Massive Datammds-data.org/presentations/2016/s-wang.pdf · Bokeh Apps: Shiny for Python • Fully interactive data web apps • Streaming data, dynamic

rBokehPlays well with R ecosystem: HTMLwidget, RMarkdown…

27

http://hafen.github.io/rbokeh

Page 28: Meaningful Visual Exploration of Massive Datammds-data.org/presentations/2016/s-wang.pdf · Bokeh Apps: Shiny for Python • Fully interactive data web apps • Streaming data, dynamic

rBokeh with RStudio & ShinyPlays well with R ecosystem: HTMLwidget, RMarkdown…

28

Page 29: Meaningful Visual Exploration of Massive Datammds-data.org/presentations/2016/s-wang.pdf · Bokeh Apps: Shiny for Python • Fully interactive data web apps • Streaming data, dynamic

Bokeh Apps: Shiny for Python

• Fully interactive data web apps • Streaming data, dynamic data • Easy-to-write pure Python charts, widgets, event

handlers • Open source (BSD licensed), including server • Enterprise on-prem version in Anaconda Enterprise,

with Active Directory/LDAP auth

29

Page 30: Meaningful Visual Exploration of Massive Datammds-data.org/presentations/2016/s-wang.pdf · Bokeh Apps: Shiny for Python • Fully interactive data web apps • Streaming data, dynamic

Example Apps

30

Page 31: Meaningful Visual Exploration of Massive Datammds-data.org/presentations/2016/s-wang.pdf · Bokeh Apps: Shiny for Python • Fully interactive data web apps • Streaming data, dynamic

Easy Streaming Apps

In this demo, we will demonstrate how the Bokeh server makes it easy to visualize streaming and dynamic data.

31

• A minimal example with < 50 LOC • Demonstrates ease of pushing data

from Python code into the browser

Page 33: Meaningful Visual Exploration of Massive Datammds-data.org/presentations/2016/s-wang.pdf · Bokeh Apps: Shiny for Python • Fully interactive data web apps • Streaming data, dynamic

Embeds Well

33

http://cecp.mit.edu

Page 34: Meaningful Visual Exploration of Massive Datammds-data.org/presentations/2016/s-wang.pdf · Bokeh Apps: Shiny for Python • Fully interactive data web apps • Streaming data, dynamic

For more information on Bokeh Apps

• Webinar: http://www.slideshare.net/continuumio/hassle-free-data-science-apps-with-bokeh-webinar

• PyData Videos, Tutorials

34

Page 35: Meaningful Visual Exploration of Massive Datammds-data.org/presentations/2016/s-wang.pdf · Bokeh Apps: Shiny for Python • Fully interactive data web apps • Streaming data, dynamic

Community & AdoptionGithub • 4100+ stars • 860+ forks

Mailing list • 400+ members • 150+ posts in November

Downloads • 45,000 / month (conda) • 4,000 / month (pip)

Page 36: Meaningful Visual Exploration of Massive Datammds-data.org/presentations/2016/s-wang.pdf · Bokeh Apps: Shiny for Python • Fully interactive data web apps • Streaming data, dynamic

Architecture

Page 37: Meaningful Visual Exploration of Massive Datammds-data.org/presentations/2016/s-wang.pdf · Bokeh Apps: Shiny for Python • Fully interactive data web apps • Streaming data, dynamic

37Server-side Data Processing: Python, Java, etc.

HTML

Javascript

D3 Highcharts Flot nvd3 dcjs

JavaScript Plotting library

CSV, SQL

Data

Traditional Web Visualization

CSSTech: • Python/R/Java • HTML & browser compat • CSS/LESS/Sass • JS plotting library API • Javascript

• jQuery, underscore

• svg, canvas2D • webGL, three.js • React • Angular • node.js,

browserify, gulp, grunt, npm, …

Page 38: Meaningful Visual Exploration of Massive Datammds-data.org/presentations/2016/s-wang.pdf · Bokeh Apps: Shiny for Python • Fully interactive data web apps • Streaming data, dynamic

Browser

HTML

38

HTML

CSSJavascript

User

Data

Python, Ruby, Java, .NET

Server

Traditional Web Viz - Interaction

Javascript

Javascript

Data’

Simple dashboard: Server language generating HTML, JS, CSS styling, subset of data

Handling user interaction: Custom Javascript, calling Server endpoint, which generates updated JSON or JS that gets pushed back to client via websocket

Page 39: Meaningful Visual Exploration of Massive Datammds-data.org/presentations/2016/s-wang.pdf · Bokeh Apps: Shiny for Python • Fully interactive data web apps • Streaming data, dynamic

Server

Bokeh BokehJS

JSON

(HTML, CSS)

Client

Bokeh Conceptual Architecture

UserPython, R,

Scala

Data

Simple dashboard: Single language, no need to write HTML, JS, CSS

Handling user interaction: Single language that you already know; interactive data updates feel seamless to the user

Page 40: Meaningful Visual Exploration of Massive Datammds-data.org/presentations/2016/s-wang.pdf · Bokeh Apps: Shiny for Python • Fully interactive data web apps • Streaming data, dynamic

• Skills required: 5-10 skills • Time to market: weeks to months • Server code: 100s to 1000s lines

• Skills required: ~1 skill • Time to market: minutes • Server code: 0

Client

Data

BokehJSPython, RBokeh

Server

Python, Ruby Java, .NET

Data

Client

CSSData

Comparison Chart

Page 41: Meaningful Visual Exploration of Massive Datammds-data.org/presentations/2016/s-wang.pdf · Bokeh Apps: Shiny for Python • Fully interactive data web apps • Streaming data, dynamic

Thank you

Email: [email protected]

Twitter: @ContinuumIO

Peter WangTwitter: @pwang

Bokeh

Twitter: @bokehplots