how to interactively visualise and explore a billion objects (wit vaex)

74
How to interactively visualise and explore a billion objects (with vaex) Maarten Breddels & Amina Helmi PyData Paris 2016

Upload: ali-ziane-myriam

Post on 21-Jan-2018

150 views

Category:

Technology


2 download

TRANSCRIPT

Page 1: How to interactively visualise and explore a billion objects (wit vaex)

How to interactively visualise and explore a billion objects (with vaex)

Maarten Breddels&

Amina Helmi

PyData Paris 2016

Page 2: How to interactively visualise and explore a billion objects (wit vaex)

Outline

• Motivation

• Technical

• Vaex

• Conclusions

Page 3: How to interactively visualise and explore a billion objects (wit vaex)

Copyright ESA/ATG medialab; background: ESO/S. Brunier

• Gaia satellite

• launched by ESA in december 2013

• determines positions, velocities and astrophysical parameters of >109 stars of the Milky Way

• First catalogue in September 2016

• Final catalogue ~2020

Page 4: How to interactively visualise and explore a billion objects (wit vaex)

Credit: NASA/Adler/U. Chicago/Wesleyan/JPL-Caltech

Page 5: How to interactively visualise and explore a billion objects (wit vaex)

image:Devin Powell

Page 6: How to interactively visualise and explore a billion objects (wit vaex)

Motivation• We will soon have the Gaia catalogue

• > 109 objects/stars

• Can we visualise and explore this?

• We want to ‘see’ the data

• Data checks (reduction issues)

• Science: trends, relations, clustering

• You are the (biological) neutral network

Page 7: How to interactively visualise and explore a billion objects (wit vaex)

Motivation• We will soon have the Gaia catalogue

• > 109 objects/stars

• Can we visualise and explore this?

• We want to ‘see’ the data

• Data checks (reduction issues)

• Science: trends, relations, clustering

• You are the (biological) neutral network

• Problem

• Scatter plots do not work well for 109

rows/objects (like Gaia)

• Work with densities/averages in 1,2 and 3d

• Interactive?

• Zoom, pan etc

• Explore

• selections/queries

• Subspace ranking

Page 8: How to interactively visualise and explore a billion objects (wit vaex)

Motivation• We will soon have the Gaia catalogue

• > 109 objects/stars

• Can we visualise and explore this?

• We want to ‘see’ the data

• Data checks (reduction issues)

• Science: trends, relations, clustering

• You are the (biological) neutral network

• Problem

• Scatter plots do not work well for 109

rows/objects (like Gaia)

• Work with densities/averages in 1,2 and 3d

• Interactive?

• Zoom, pan etc

• Explore

• selections/queries

• Subspace ranking

Page 9: How to interactively visualise and explore a billion objects (wit vaex)

Motivation• We will soon have the Gaia catalogue

• > 109 objects/stars

• Can we visualise and explore this?

• We want to ‘see’ the data

• Data checks (reduction issues)

• Science: trends, relations, clustering

• You are the (biological) neutral network

• Problem

• Scatter plots do not work well for 109

rows/objects (like Gaia)

• Work with densities/averages in 1,2 and 3d

• Interactive?

• Zoom, pan etc

• Explore

• selections/queries

• Subspace ranking

Page 10: How to interactively visualise and explore a billion objects (wit vaex)

Motivation• We will soon have the Gaia catalogue

• > 109 objects/stars

• Can we visualise and explore this?

• We want to ‘see’ the data

• Data checks (reduction issues)

• Science: trends, relations, clustering

• You are the (biological) neutral network

• Problem

• Scatter plots do not work well for 109

rows/objects (like Gaia)

• Work with densities/averages in 1,2 and 3d

• Interactive?

• Zoom, pan etc

• Explore

• selections/queries

• Subspace ranking

Page 11: How to interactively visualise and explore a billion objects (wit vaex)

Motivation• We will soon have the Gaia catalogue

• > 109 objects/stars

• Can we visualise and explore this?

• We want to ‘see’ the data

• Data checks (reduction issues)

• Science: trends, relations, clustering

• You are the (biological) neutral network

• Problem

• Scatter plots do not work well for 109

rows/objects (like Gaia)

• Work with densities/averages in 1,2 and 3d

• Interactive?

• Zoom, pan etc

• Explore

• selections/queries

• Subspace ranking

Page 12: How to interactively visualise and explore a billion objects (wit vaex)

Motivation• We will soon have the Gaia catalogue

• > 109 objects/stars

• Can we visualise and explore this?

• We want to ‘see’ the data

• Data checks (reduction issues)

• Science: trends, relations, clustering

• You are the (biological) neutral network

• Problem

• Scatter plots do not work well for 109

rows/objects (like Gaia)

• Work with densities/averages in 1,2 and 3d

• Interactive?

• Zoom, pan etc

• Explore

• selections/queries

• Subspace ranking• Python

• rich set of libraries, becoming the default in science

Page 13: How to interactively visualise and explore a billion objects (wit vaex)

Situation

• TOPCAT comes close, not fast enough, works with individual rows/particles, no exploratory tools, written in Java.

• Your own IDL/Python code: a lot to consider to do it optimal (multicore, efficient storage, efficient algorithms, interactive becomes complex)

• DataShader?

• We want something to visualize 109 rows/objects in ~1 second

• Do we need to resort to Big Data solutions?

Page 14: How to interactively visualise and explore a billion objects (wit vaex)

Big data?• Big data

• definitions vary

• Practical question to ask

• Do you need Big Data solutions?

• Many computers / grid

• Invest in software

• Or

• Can we do it on a single computer ‘old style’

Page 15: How to interactively visualise and explore a billion objects (wit vaex)

Outline

• Motivation

• Technical

• Vaex

• Conclusions

Page 16: How to interactively visualise and explore a billion objects (wit vaex)

Interactive? How?

• Interactive?

• 109

* 2 * 8 bytes = 15 GiB (double is 8 bytes)

• Memory bandwidth: 10-20 GiB/s: ~1 second

• CPU: 3 Ghz (but multicore, say 4): 12 cycles/second

• Few cycles per row/object, simple algorithm

• Histograms/Density grids

• Yes, but

• If it fits/cached in memory, otherwise sdd/hdd speeds (30-100 seconds)

• proper storage and reading of data

• simple and fast algorithm for binning

•Aquarius A-2: pure dark matter N body simulation of Milky Way like Halo•6. *108 particles• < 1 second

Page 17: How to interactively visualise and explore a billion objects (wit vaex)

Interactive? How?

• Interactive?

• 109

* 2 * 8 bytes = 15 GiB (double is 8 bytes)

• Memory bandwidth: 10-20 GiB/s: ~1 second

• CPU: 3 Ghz (but multicore, say 4): 12 cycles/second

• Few cycles per row/object, simple algorithm

• Histograms/Density grids

• Yes, but

• If it fits/cached in memory, otherwise sdd/hdd speeds (30-100 seconds)

• proper storage and reading of data

• simple and fast algorithm for binning

•Aquarius A-2: pure dark matter N body simulation of Milky Way like Halo•6. *108 particles• < 1 second

Page 18: How to interactively visualise and explore a billion objects (wit vaex)

• Storage: native, column based (hdf5, fits)

• Normal (POSIX read) method:

• Allocate memory

• read from disk to memory

• Actually: from disk, to OS cache, to memory (if unbuffered, otherwise another copy)

• Wastes memory (actually disk cache)

• 15 GB data, requires 30 GB is you want to use the file system cache

cache

How to store and read the data

Page 19: How to interactively visualise and explore a billion objects (wit vaex)

• Storage: native, column based (hdf5, fits)

• Normal (POSIX read) method:

• Allocate memory

• read from disk to memory

• Actually: from disk, to OS cache, to memory (if unbuffered, otherwise another copy)

• Wastes memory (actually disk cache)

• 15 GB data, requires 30 GB is you want to use the file system cache

cache

How to store and read the data

• Memory mapping:

• get direct access to OS memory cache, no copy, no setup (apart from the kernel doing setting up the pages)

• avoid memory copies, more cache available

• In previous example:

• copying 15 GB will take about 0.5-1.0 second, at 10-20 GB/s

• Can be 2-3x slower (cpu cache helps a bit)

Page 20: How to interactively visualise and explore a billion objects (wit vaex)

• Storage: native, column based (hdf5, fits)

• Normal (POSIX read) method:

• Allocate memory

• read from disk to memory

• Actually: from disk, to OS cache, to memory (if unbuffered, otherwise another copy)

• Wastes memory (actually disk cache)

• 15 GB data, requires 30 GB is you want to use the file system cache

cache

How to store and read the data

• Memory mapping:

• get direct access to OS memory cache, no copy, no setup (apart from the kernel doing setting up the pages)

• avoid memory copies, more cache available

• In previous example:

• copying 15 GB will take about 0.5-1.0 second, at 10-20 GB/s

• Can be 2-3x slower (cpu cache helps a bit)

Page 21: How to interactively visualise and explore a billion objects (wit vaex)

Translate to something visual

• 1d: histogram

• 2d:

• histogram

• use colormap to map to a color

• 3d: volume rendering

Page 22: How to interactively visualise and explore a billion objects (wit vaex)

Translate to something visual

• 1d: histogram

• 2d:

• histogram

• use colormap to map to a color

• 3d: volume rendering

Page 23: How to interactively visualise and explore a billion objects (wit vaex)

Translate to something visual

• 1d: histogram

• 2d:

• histogram

• use colormap to map to a color

• 3d: volume rendering

Page 24: How to interactively visualise and explore a billion objects (wit vaex)

How to get good performance

• Pure python

• slow, GIL

• Numpy

• numpy.histogram slow

• C extension

• fast, no GIL, slower development

• less boilerplate code: cffi

• Numba @jit(nopython=True, nogil=True)

• Python code, c performance

• (nogil didn’t exist)

Page 25: How to interactively visualise and explore a billion objects (wit vaex)

How to get good performance

• Pure python

• slow, GIL

• Numpy

• numpy.histogram slow

• C extension

• fast, no GIL, slower development

• less boilerplate code: cffi

• Numba @jit(nopython=True, nogil=True)

• Python code, c performance

• (nogil didn’t exist)

Page 26: How to interactively visualise and explore a billion objects (wit vaex)

How to get good performance

• Pure python

• slow, GIL

• Numpy

• numpy.histogram slow

• C extension

• fast, no GIL, slower development

• less boilerplate code: cffi

• Numba @jit(nopython=True, nogil=True)

• Python code, c performance

• (nogil didn’t exist)

Page 27: How to interactively visualise and explore a billion objects (wit vaex)

How to get good performance

• Pure python

• slow, GIL

• Numpy

• numpy.histogram slow

• C extension

• fast, no GIL, slower development

• less boilerplate code: cffi

• Numba @jit(nopython=True, nogil=True)

• Python code, c performance

• (nogil didn’t exist)

600 times faster

Page 28: How to interactively visualise and explore a billion objects (wit vaex)
Page 29: How to interactively visualise and explore a billion objects (wit vaex)
Page 30: How to interactively visualise and explore a billion objects (wit vaex)

Great… 2d histograms…1 32 4 .. 9 11x4 7 41 .. 91 61y

+1

Page 31: How to interactively visualise and explore a billion objects (wit vaex)

Great… 2d histograms…• There is more

• Weighted histogram

• Don’t sums 1’s

• Sum values

1 32 4 .. 9 11x4 7 41 .. 91 61y

+1

Page 32: How to interactively visualise and explore a billion objects (wit vaex)

Great… 2d histograms…• There is more

• Weighted histogram

• Don’t sums 1’s

• Sum values

1 32 4 .. 9 11x4 7 41 .. 91 61y

+12 1 4 .. 8 3v

+v, or v2

Page 33: How to interactively visualise and explore a billion objects (wit vaex)

Great… 2d histograms…• There is more

• Weighted histogram

• Don’t sums 1’s

• Sum values

1 32 4 .. 9 11x4 7 41 .. 91 61y

+12 1 4 .. 8 3v

+v, or v2• Possibilities

• Total: flux, mass

• Mean: velocity, metallicity

• Dispersions: velocity…

• Basically statistics on a grid

Page 34: How to interactively visualise and explore a billion objects (wit vaex)

Examples

Page 35: How to interactively visualise and explore a billion objects (wit vaex)

Examples

Page 36: How to interactively visualise and explore a billion objects (wit vaex)

Examples

Page 37: How to interactively visualise and explore a billion objects (wit vaex)

Exploration

• Selections

• expressions

• visual

• linked views

• Subspace ranking

Page 38: How to interactively visualise and explore a billion objects (wit vaex)

Exploration

• Selections

• expressions

• visual

• linked views

• Subspace ranking

Page 39: How to interactively visualise and explore a billion objects (wit vaex)

Exploration

• Selections

• expressions

• visual

• linked views

• Subspace ranking

Page 40: How to interactively visualise and explore a billion objects (wit vaex)

Exploration

• Selections

• expressions

• visual

• linked views

• Subspace ranking

Page 41: How to interactively visualise and explore a billion objects (wit vaex)

Outline

• Motivation

• Technical

• Vaex

• Conclusions

Page 42: How to interactively visualise and explore a billion objects (wit vaex)

Vaex: Visualization And EXploration

• A library

• python package

• ‘import vaex’

• reading of data

• multithreading

• binning (1,2,3, Nd)

• selections/queries

• server/client

• integrates with IPython notebook

• A GUI program

• Gives interactive navigation, zoom, pan

• interactive selection (lasso, rectangle)

• client

• undo/redo

Page 43: How to interactively visualise and explore a billion objects (wit vaex)

Example data / vaex.example()• Helmi and de Zeeuw 2000

• build up of a MW (stellar) halo from 33 satellites

• 3.3 * 106 particles

• Almost smooth in configuration space (x,y,z)

• Structure visible in E-Lz-L space

Page 44: How to interactively visualise and explore a billion objects (wit vaex)
Page 45: How to interactively visualise and explore a billion objects (wit vaex)
Page 46: How to interactively visualise and explore a billion objects (wit vaex)
Page 47: How to interactively visualise and explore a billion objects (wit vaex)
Page 48: How to interactively visualise and explore a billion objects (wit vaex)

Exploration: Automated

• High dimensionality → many subspaces

• E-L, E-y, E-x, E-vx, E-vy, E-vz, E-z, E-Lz, L-y, L-x, L-vx, L-vy, L-vz, L-z, L-Lz, y-x, y-vx, y-vy, y-vz, y-z, y-Lz, x-vx, x-vy, x-vz, x-z, x-Lz, vx-vy, vx-vz, vx-z, vx-Lz, vy-vz, vy-z, vy-Lz, vz-z, vz-Lz, z-Lz

• Can we automate this / at least help?

• Ranking subspaces using Mutual Information

Page 49: How to interactively visualise and explore a billion objects (wit vaex)

Mutual information• Measure the information loss between

p(x,y) and p(x)p(y)

• information loss measured using the KL divergence:

• In short:

• How much does breaking up correlation (not just linear) change density distribution?

Page 50: How to interactively visualise and explore a billion objects (wit vaex)

Ranking by Mutual informationp(x,y) p(x)p(y) I(X;Y)

0.003

0.837

x,y

Lz,L

Page 51: How to interactively visualise and explore a billion objects (wit vaex)

Ranking by Mutual informationp(x,y) p(x)p(y) I(X;Y)

0.003

0.837

x,y

Lz,L

Page 52: How to interactively visualise and explore a billion objects (wit vaex)

Expressions/Virtual columns

• Not just visualize existing columns

• mathematical operations

• sqrt(x**2+y**2)

• log(r)

• Virtual columns

• r=sqrt(x**2+y**2)

• Suggest common ones

• equatorial to galactic coordinates

Page 53: How to interactively visualise and explore a billion objects (wit vaex)
Page 54: How to interactively visualise and explore a billion objects (wit vaex)
Page 55: How to interactively visualise and explore a billion objects (wit vaex)
Page 56: How to interactively visualise and explore a billion objects (wit vaex)
Page 57: How to interactively visualise and explore a billion objects (wit vaex)

How to get the data

• Gaia: ~1-2 TB

• download

• torrent

• Do you need to?

• server/client?

Page 58: How to interactively visualise and explore a billion objects (wit vaex)

Client/Server

• 2d histogram example

• raw data:15 GB

• binned 256x256 grid

• 500KB, or 10-100kb compressed

• Proof of concept

• over http / websocket

• vaex webserver [file …]

Page 59: How to interactively visualise and explore a billion objects (wit vaex)
Page 60: How to interactively visualise and explore a billion objects (wit vaex)
Page 61: How to interactively visualise and explore a billion objects (wit vaex)

Random subset / shuffled storage

Page 62: How to interactively visualise and explore a billion objects (wit vaex)

Random subset / shuffled storage

Page 63: How to interactively visualise and explore a billion objects (wit vaex)

Random subset / shuffled storage

Page 64: How to interactively visualise and explore a billion objects (wit vaex)

Random subset / shuffled storage

Page 65: How to interactively visualise and explore a billion objects (wit vaex)

Random subset / shuffled storage

Page 66: How to interactively visualise and explore a billion objects (wit vaex)

dask?wendelin?

Page 67: How to interactively visualise and explore a billion objects (wit vaex)

Get vaex• standalone binary (OS X , Linux) (just download and start)

• www.astro.rug.nl/~breddels/vaex (or google ‘vaex visualisation’)

• www.github.com/maartenbreddels/vaex

• In Python

• Vanilla Python

• pip install —user —pre vaex

• Anaconda

• conda install -c maartenbreddels vaex

Page 68: How to interactively visualise and explore a billion objects (wit vaex)

Summary• vaex: visualisation and exploration

• of large datasets 106-9

objects

• fast: > 109 objects/second

• 1-2 and 3d visualisation using densities

• ~6d with vector field overlaid

• Explore using selections+linked views or automated ranking of subspaces

• Can run on a single computer, zero setup

• Or as a server

• Main goal is Gaia catalogue, but tested/suitable for

• other catalogues

• simulations (N body)

Page 69: How to interactively visualise and explore a billion objects (wit vaex)

Summary• vaex: visualisation and exploration

• of large datasets 106-9

objects

• fast: > 109 objects/second

• 1-2 and 3d visualisation using densities

• ~6d with vector field overlaid

• Explore using selections+linked views or automated ranking of subspaces

• Can run on a single computer, zero setup

• Or as a server

• Main goal is Gaia catalogue, but tested/suitable for

• other catalogues

• simulations (N body)

Page 70: How to interactively visualise and explore a billion objects (wit vaex)

Summary• vaex: visualisation and exploration

• of large datasets 106-9

objects

• fast: > 109 objects/second

• 1-2 and 3d visualisation using densities

• ~6d with vector field overlaid

• Explore using selections+linked views or automated ranking of subspaces

• Can run on a single computer, zero setup

• Or as a server

• Main goal is Gaia catalogue, but tested/suitable for

• other catalogues

• simulations (N body)

Page 71: How to interactively visualise and explore a billion objects (wit vaex)

Summary• vaex: visualisation and exploration

• of large datasets 106-9

objects

• fast: > 109 objects/second

• 1-2 and 3d visualisation using densities

• ~6d with vector field overlaid

• Explore using selections+linked views or automated ranking of subspaces

• Can run on a single computer, zero setup

• Or as a server

• Main goal is Gaia catalogue, but tested/suitable for

• other catalogues

• simulations (N body)

Page 72: How to interactively visualise and explore a billion objects (wit vaex)

Summary• vaex: visualisation and exploration

• of large datasets 106-9

objects

• fast: > 109 objects/second

• 1-2 and 3d visualisation using densities

• ~6d with vector field overlaid

• Explore using selections+linked views or automated ranking of subspaces

• Can run on a single computer, zero setup

• Or as a server

• Main goal is Gaia catalogue, but tested/suitable for

• other catalogues

• simulations (N body)

Page 73: How to interactively visualise and explore a billion objects (wit vaex)

Summary• vaex: visualisation and exploration

• of large datasets 106-9

objects

• fast: > 109 objects/second

• 1-2 and 3d visualisation using densities

• ~6d with vector field overlaid

• Explore using selections+linked views or automated ranking of subspaces

• Can run on a single computer, zero setup

• Or as a server

• Main goal is Gaia catalogue, but tested/suitable for

• other catalogues

• simulations (N body)

Page 74: How to interactively visualise and explore a billion objects (wit vaex)