Interactive Data Visualization
11/19/19Mark Grivainis
Overview
What is Interactive Data Visualization
Common Interactive Visualization Techniques
What Tools Exist for Interactive Visualization
Working with Bokeh
What is Interactive Data Visualization
Interactive Data Visualization allows for real time queries to be made on plots
The underlying visualizations tend to be standard figures - bar plots, scatter plots, heatmaps etc.
Adding interactions allow for data to be explored more thoroughly
You would want to start with a solid static figure before adding interactions
Different Types of Interaction
Identification (Hovering)
Scaling (Zooming)
Selection (Brushing)
Linking
Available Tools for Interactive Visualization
Python: Bokeh, Plotly, Matplotlib
R: Shiny
Javascript: D3
Most of these tools rely on HTML and Javascript for rendering of plots
If you want to create a non standard plot:
Learn Javascript
Use D3
What is Bokeh
Interactive visualization library for Python
Works with large datasets
Simplifies the process of creating:
Interactive plots
Dashboards
Data applications
Installing Bokeh
Bokeh is not part of the Python Standard Library
It can be installed using pip or conda (conda is prefered)
conda install bokeh
You can either install into your base environment or create a new environment
conda create -n vis python=3.6 bokeh jupyter pandas numpy
Using Bokeh Output
Bokeh has three output modes:
Server Mode
Static HTML- output_file()
Notebook- output_notebook()
https://docs.bokeh.org/en/1.4.0/docs/reference/server.html?highlight=server#module-bokeh.server
Defining a figure
from bokeh.plotting import figure, showfrom bokeh.io import output_notebook
output_notebook()
p = figure()
show(p)
Bokeh Input Data
Providing Data Directly
1. from bokeh.plotting import figure, show2. from bokeh.io import output_notebook3.4. output_notebook()5.6. x_values = [1, 2, 3, 4, 5]7. y_values = [6, 7, 2, 3, 6]8.9. p = figure()
10.11. p.scatter(x=x_values, y=y_values)12. show(p)
Using ColumnDataSource
1. from bokeh.plotting import figure, show2. from bokeh.io import output_notebook3. from bokeh.models import ColumnDataSource4.5. output_notebook()6.7. data = {'x_values': [1, 2, 3, 4, 5],8. 'y_values': [6, 7, 2, 3, 6]}9.
10. source = ColumnDataSource(data=data)11.12. p = figure()13. p.scatter(x='x_values', y='y_values', source=source)14. show(p)
Using ColumnDataSource and Pandas
1. from bokeh.plotting import figure, show2. from bokeh.io import output_notebook3. from bokeh.models import ColumnDataSource4. import pandas as pd5.6. output_notebook()7.8. data = {'x_values': [1, 2, 3, 4, 5],9. 'y_values': [6, 7, 2, 3, 6]}
10.11. df = pd.DataFrame.from_dict(data)12.13. source = ColumnDataSource(df)14.15. p = figure()16. p.scatter(x='x_values', y='y_values', source=source)17. show(p)
Built in Plot Types
line multiline vbar scatter
hbar image hex_tile
A full list is available in the documentation here
There are no prebuilt statistical plots
Eg: Boxplot, heatmaps
Many of these plots are not complicated to generate
Build your own package defining them that can be used across projects
Adding Hover Functionality1. from bokeh.plotting import figure, show2. from bokeh.io import output_notebook3.4. output_notebook()5.6. source = ColumnDataSource(data=dict(7. x=[1, 2, 3, 4, 5],8. y=[2, 5, 8, 2, 7],9. desc=['A', 'b', 'C', 'd', 'E'],
10. ))11.12. TOOLS = 'hover,pan'13.14. p = figure(tools=TOOLS, tooltips=TOOLTIPS)15. p.scatter('x', 'y', source=source)16. show(p)
Adding Hover Functionality1. from bokeh.plotting import figure, show2. from bokeh.io import output_notebook3.4. output_notebook()5.6. source = ColumnDataSource(data=dict(7. x=[1, 2, 3, 4, 5],8. y=[2, 5, 8, 2, 7],9. desc=['A', 'b', 'C', 'd', 'E'],
10. ))11.12. TOOLS = 'hover,pan'13. TOOLTIPS = [14. ("index", "$index"),15. ("(x,y)", "($x, $y)"),16. ("desc", "@desc"),17. ]18.19. p = figure(tools=TOOLS, tooltips=TOOLTIPS)20. p.scatter('x', 'y', source=source)21. show(p)
The autompg Dataframe
mpg cyl displ hp weight accel yr origin name18.0 8 307.0 130 3504 12.0 70 1 chevrolet chevelle malibu15.0 8 350.0 165 3693 11.5 70 1 buick skylark 32018.0 8 318.0 150 3436 11.0 70 1 plymouth satellite16.0 8 304.0 150 3433 12.0 70 1 amc rebel sst17.0 8 302.0 140 3449 10.5 70 1 ford torino ... ... ... ... ... ... .. ... ...27.0 4 140.0 86 2790 15.6 82 1 ford mustang gl44.0 4 97.0 52 2130 24.6 82 2 vw pickup32.0 4 135.0 84 2295 11.6 82 1 dodge rampage28.0 4 120.0 79 2625 18.6 82 1 ford ranger31.0 4 119.0 82 2720 19.4 82 1 chevy s-10
Summarizing a Dataframe
from bokeh.sampledata.autompg import autompg as df
mpg = df.groupby('cyl').describe()['mpg']acc = df.groupby('cyl').describe()['accel']
print(mpg.to_string(max_rows=10)print(acc.to_string(max_rows=10)
mpg count mean std min 25% 50% 75% maxcyl 3 4.0 20.550000 2.564501 18.0 18.75 20.25 22.05 23.74 199.0 29.283920 5.670546 18.0 25.00 28.40 32.95 46.65 3.0 27.366667 8.228204 20.3 22.85 25.40 30.90 36.46 83.0 19.973494 3.828809 15.0 18.00 19.00 21.00 38.08 103.0 14.963107 2.836284 9.0 13.00 14.00 16.00 26.6
accel count mean std min 25% 50% 75% maxcyl 3 4.0 13.250000 0.500000 12.5 13.25 13.5 13.5 13.54 199.0 16.581910 2.383185 11.6 14.80 16.2 18.0 24.85 3.0 18.633333 2.369247 15.9 17.90 19.9 20.0 20.16 83.0 16.254217 2.031778 11.3 15.05 16.0 17.6 21.08 103.0 12.955340 2.224759 8.0 11.50 13.0 14.0 22.2
ColumnDataSource on a Group
from bokeh.sampledata.autompg import autompg as df
df['yr'] = df['yr'].astype(str)group = df.groupby('yr')source = ColumnDataSource(group)print(source.to_df().to_string(max_cols=10, index=False, max_rows=6))
yr mpg_count mpg_mean mpg_std mpg_min accel_min ... accel_25% accel_50% accel_75% accel_max 70 29.0 17.689655 5.339231 9.0 8.0 ... 10.000 12.5 15.000 20.5 71 27.0 21.111111 6.675635 12.0 11.5 ... 13.250 14.5 15.500 20.5 72 28.0 18.714286 5.435529 11.0 11.0 ... 13.375 14.5 16.625 23.5.. ... ... ... ... ... ... ... ... ... ... 80 27.0 33.803704 6.885854 19.1 11.4 ... 15.150 16.5 18.750 23.7 81 28.0 30.185714 5.635319 17.6 12.6 ... 14.700 16.3 17.425 20.7 82 30.0 32.000000 5.232524 22.0 11.6 ... 14.775 16.3 17.900 24.6
Categorical Data
from bokeh.sampledata.autompg import autompg as dffrom bokeh.plotting import figure, showfrom bokeh.models import ColumnDataSourcefrom bokeh.io import output_notebook
output_notebook()
df['yr'] = df['yr'].astype(str)group = df.groupby('yr')source = ColumnDataSource(group)p = figure(x_range=group)p.vbar(x='yr',
top='mpg_mean', width=0.8, source=source)
show(p)
Coloring Plotsfrom bokeh.sampledata.autompg import autompg as dffrom bokeh.plotting import figure, showfrom bokeh.models import ColumnDataSourcefrom bokeh.io import output_notebookfrom bokeh.palettes import d3from bokeh.transform import factor_cmap
output_notebook()
df['yr'] = df['yr'].astype(str)group = df.groupby('yr')source = ColumnDataSource(group)
fm = factor_cmap('yr', palette=d3['Category20'][13], factors=df['yr'].unique())
p = figure(x_range=group)p.vbar(x='yr', top='mpg_mean', width=0.8, color=fm, source=source)show(p)
Gridsfrom bokeh.sampledata.autompg import autompg as dffrom bokeh.plotting import figure, showfrom bokeh.layouts import column, gridplotfrom bokeh.models import ColumnDataSource, Gridfrom bokeh.io import output_notebookfrom itertools import product
def build_figure(title, x_lab, y_lab, source): p = figure(plot_width=300, plot_height=300) p.scatter(x=x_lab, y=y_lab, source=source) p.xaxis.axis_label = x_lab p.yaxis.axis_label = y_lab return p
output_notebook()
COMPARE = ['mpg', 'hp', 'weight']source = ColumnDataSource(df[COMPARE])GRID_W = len(COMPARE)
plots = [build_figure('', x, y, source) for y, x in product(COMPARE, repeat=2)]grid = gridplot(plots, ncols=GRID_W)
show(grid)
Gridsfrom bokeh.sampledata.autompg import autompg as dffrom bokeh.plotting import figure, showfrom bokeh.layouts import column, gridplotfrom bokeh.models import ColumnDataSource, Gridfrom bokeh.io import output_notebookfrom itertools import product
TOOLS = "box_select,lasso_select,help"
def build_figure(title, x_lab, y_lab, source): p = figure(plot_width=300, plot_height=300, tools=TOOLS) p.scatter(x=x_lab, y=y_lab, source=source) p.xaxis.axis_label = x_lab p.yaxis.axis_label = y_lab return p
output_notebook()
COMPARE = ['mpg', 'hp', 'weight']source = ColumnDataSource(df[COMPARE])GRID_W = len(COMPARE)
plots = [build_figure('', x, y, source) for y, x in product(COMPARE, repeat=2)]grid = gridplot(plots, ncols=GRID_W)
show(grid)
Linked Plots
This examples code was too long to put in a slide
https://demo.bokeh.org/selection_histogram
Source Code
https://demo.bokeh.org/selection_histogram
Getting your Figure Online
It is easy to host static html content on GitHub
Use output_file(‘index.html’) to save your Bokeh plot as an html file- ‘index.html’ is always loaded by default, it must be the entry point- In this case it is easy as it is the only html file
Upload this file to the master branch of a GitHub repository
Navigate: Settings -> GitHub Pages -> set Source to ‘master branch’
Note: This will not work very well with datasets that are large as the data needs to be downloaded before it can be plotted
Example: Bokeh Conda Environment
Open a Terminal window (Mac) or Anaconda Prompt (Windows)
conda create -n ivis python=3.6 bokeh jupyter numpy pandas
conda activate ivis
jupyter notebook
Example: Building a Boxplot
https://en.wikipedia.org/wiki/Box_plot#/media/File:Boxplot_vs_PDF.svg
References
https://www.knowablemagazine.org/article/mind/2019/science-data-visualization
http://docs.bokeh.org/en/1.3.2/index.html
http://docs.bokeh.org/en/1.3.2/docs/user_guide/data.html