visualization lifecycle

32
Visualization Lifecycle Raffael Marty datainsight San Francisco 2011

Post on 18-Oct-2014

7.072 views

Category:

Education


1 download

DESCRIPTION

 

TRANSCRIPT

Page 1: Visualization Lifecycle

VisualizationLifecycle

Raffael Marty

datainsightSan Francisco 2011

Page 2: Visualization Lifecycle

Copyright (c) 2011pixlcloud | collect. visualize. understand.

Art

“Transform a dataset into a captive story.”

Youʼre on your own‣Assess‣Parse‣Clean‣Visualize

Visualization Tools and Libraries

Page 3: Visualization Lifecycle

Copyright (c) 2011pixlcloud | collect. visualize. understand.

AudienceExpert

Beginner

Technical Overview

Fun

Boring

Page 4: Visualization Lifecycle

Copyright (c) 2011pixlcloud | collect. visualize. understand.

Visualization Process

Data Sources (Data Store) Structured Data

filesdatabase

filteringaggregationcleansing

Contextual Data

Visual Representation

visualization

iterations

parsingfeature selection

Page 5: Visualization Lifecycle

Copyright (c) 2011pixlcloud | collect. visualize. understand.

Data Sources‣File‣Database‣API‣Factual‣Freebase‣Infochimps‣OpenStreetMap

curl ‘http://freebase.com/api/service/search?query=al+gore&indent=1’

XML, JSON, CSV, TSV

mysql -u root -p mydatabase < dump.sql

Page 6: Visualization Lifecycle

Copyright (c) 2011pixlcloud | collect. visualize. understand.

Explore Data‣What is the data about?‣What are the data features/columns?‣Is there a common structure in the data?‣What are the data types?

Nov 7 09:14:46 fwbox kernel: DROPPED IN=eth0 OUT= MAC=00:0c:29:e3:45:bd:00:0c:29:b5:5c:ee:08:00 SRC=10.1.222.31 DST=10.1.222.202 LEN=60 TOS=0x00 PREC=0x00 TTL=64 ID=63849 DF PROTO=TCP SPT=58485 DPT=9111 WINDOW=5840 RES=0x00 SYN URGP=0

May 25 20:24:20 ram-laptop kernel: BLOCK any in: IN=eth1 OUT= MAC=00:13:02:ac:d8:ea:00:09:5b:3d:df:00:08:00 SRC=213.175.90.24 DST=192.168.0.15 LEN=576 TOS=0x00 PREC=0x00 TTL=115 ID=23513 PROTO=TCP SPT=9030 DPT=56772 WINDOW=65535 RES=0x00 ACK URGP=0

Page 7: Visualization Lifecycle

Copyright (c) 2011pixlcloud | collect. visualize. understand.

Parsing and Normalization‣Parsing‣extraction of entities / features‣imposing structure ‣often use regexes

‣Normalize‣field normalization‣term normalization: block, deny, dropped

‣Generate a common output format for vis-tools (e.g., CSV)

Oct 13 20:00:43.874401 rule 193/0(match): block in on xl0: 212.251.89.126.3859 >: S 1818630320:1818630320(0) win 65535 <mss 1460,nop,nop,sackOK> (DF)

Oct 13 20:00:43 fwbox local4:warn|warning fw07 %PIX-4-106023: Deny tcp src internet: 212.251.89.126/3859 dst 212.254.110.98/135 by access-group "internet_access_in"

Oct 13 20:00:43 fwbox kernel: DROPPED IN=eth0 OUT= MAC=ff:ff:ff:ff:ff:ff:00:0f:cc:81:40:94:08:00 SRC=212.251.89.126 DST=212.254.110.98 LEN=576 TOS=0x00 PREC=0x00 TTL=255 ID=8624 PROTO=TCP SPT=3859 DPT=135 LEN=556

Page 8: Visualization Lifecycle

Copyright (c) 2011pixlcloud | collect. visualize. understand.

ParserOct 13 20:00:38.018152 rule 57/0(match): pass in on xl1: 195.141.69.45.1030 > 62.2.32.250.53: 34388 [1au][|domain] (DF)

Oct 13 20:00:38.115862 rule 57/0(match): pass in on xl1: 195.141.69.45.1030 > 192.134.0.49.53: 49962 [1au][|domain] (DF)

Oct 13 20:00:38.157238 rule 57/0(match): pass in on xl1: 195.141.69.45.1030 > 194.25.2.133.53: 14434 [1au][|domain] (DF)

(.*) rule ([-\d]+\/\d+)\(.*?\): (pass|block) (in|out) on (\w+): (\d+\.\d+\.\d+\.\d+)\.?(\d*) [<>] (\d+\.\d+\.\d+\.\d+)\.?(\d*): (.*)

Oct 13 20:00:38.018152,57/0,match,pass,in,xl1,195.141.69.45,1030,62.2.32.250,53,34388 [1au][|domain] (DF)

Oct 13 20:00:38.115862,57/0,match,pass,in,xl1,195.141.69.45,1030,192.134.0.49,53,49962 [1au][|domain] (DF)

Oct 13 20:00:38.157238,57/0,match,pass,in,xl1,195.141.69.45,1030,194.25.2.133,53,14434 [1au][|domain] (DF)

Raw

Normalized(CSV)

Regex / Parser

Page 9: Visualization Lifecycle

Copyright (c) 2011pixlcloud | collect. visualize. understand.

‣grep‣cat file | grep –v “foo”‣awk ‣awk –F, ‘{printf(“%s,%s\n”,$2,$1);}’‣awk -F, -v OFS=, ‘{print $2,$1}’‣sed‣sed -e 's/fubar/foobar/g' filename

UNIX Tools

Page 10: Visualization Lifecycle

Copyright (c) 2011pixlcloud | collect. visualize. understand.

Regular Expression Resources ‣ http://regexlib.com‣ http://www.regular-expressions.info

‣ http://gskinner.com/RegExr

Page 11: Visualization Lifecycle

Copyright (c) 2011pixlcloud | collect. visualize. understand.

Data Cleansing‣Filter

‣Normalize (see earlier)

‣Aggregation

Page 12: Visualization Lifecycle

Copyright (c) 2011pixlcloud | collect. visualize. understand.

Load CSV into Database# mysql -u <user> -p

mysql> create database data;

mysql> create table set1 (id int, address varchar(20), ...);

mysql> LOAD DATA LOCAL INFILE 'input_file' INTO TABLE set1 FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n';

Sometimes you just load your data into a tool, and you can omit this step

Page 13: Visualization Lifecycle

Copyright (c) 2011pixlcloud | collect. visualize. understand.

Contextual Data‣Either dump into DB or use via API calls to augment

‣IP -> Geo mapping‣Information about countries‣Port number -> service name

Page 14: Visualization Lifecycle

Copyright (c) 2011pixlcloud | collect. visualize. understand.

Feature Selection‣What are the fields you are interested in?‣Compute new fields‣start time, end time -> duration‣IP subnets [ 10.2.4.2 -> 10.0.0.0/8 or 192.168.1.2 -> 192.168.1.0/24 ]‣Entropy: H ( X ) = E ( I ( X ) )

‣Dimensionality reduction‣See Bryan’s talk!

Page 15: Visualization Lifecycle

Copyright (c) 2011pixlcloud | collect. visualize. understand.

Choose Your Poison

Page 16: Visualization Lifecycle

Copyright (c) 2011pixlcloud | collect. visualize. understand.

Ode to the Pie

Page 17: Visualization Lifecycle

Copyright (c) 2011pixlcloud | collect. visualize. understand.

A Good Visual‣Chose the right graph

‣Reduce non-data ink

‣ Simultaneous views

‣ Interactivity

Page 18: Visualization Lifecycle

Copyright (c) 2011pixlcloud | collect. visualize. understand.

Visual Transformations‣keep iterating on visual transformations, change‣color‣shape‣features display‣add new fields? ‣add more context?‣is the output expressive?‣capture output and prettify it for presentation

Page 19: Visualization Lifecycle

Data Visualization Toolsand Libraries

Page 20: Visualization Lifecycle

Copyright (c) 2011pixlcloud | collect. visualize. understand.

Tools and Libraries‣http://datainsightsf.com/resources/‣Choose what’s appropriate!‣Data Analysis and Visualization LInuX‣davix.secviz.org‣GraphViz‣graphviz.org‣AfterGlow (CSV -> DOT)‣afterglow.sf.net

Page 21: Visualization Lifecycle

Copyright (c) 2011pixlcloud | collect. visualize. understand.

Libraries‣Reporting Libraries‣HighCharts‣Flot‣Google Chart API‣Open Flash Chart‣JQuery Sparklines ‣Polymaps

‣Visualization Libraries‣TheJIT‣Graphael‣Protovis‣ProcessingJS‣Flare‣D3

-

Page 22: Visualization Lifecycle

Copyright (c) 2011pixlcloud | collect. visualize. understand.

HighCharts

‣Click-Through‣On load‣near real-time updates

‣Zoomwww.highcharts.com

Page 23: Visualization Lifecycle

Copyright (c) 2011pixlcloud | collect. visualize. understand.

Google Visualization API

‣JavaScript‣Based on DataTables()‣Many graphs‣Playground

‣ http://code.google.com/apis/ajax/playground

http://code.google.com/apis/visualization/interactive_charts.html

Page 25: Visualization Lifecycle

Copyright (c) 2011pixlcloud | collect. visualize. understand.

TheJIT‣JavaScript InfoVis Toolkit‣Interactive‣Link Graphs

http://thejit.org/

Page 26: Visualization Lifecycle

Copyright (c) 2011pixlcloud | collect. visualize. understand.

Processing‣ Visualization library‣ Java based‣ Interactive (event handling)‣ Number of libraries to‣draw in OpenGL‣read XML files

‣ Processing JS‣ JavaScript‣HTML 5 Canvas‣WebGL‣Web IDE http://processing.org/

http://processingjs.org/

Page 27: Visualization Lifecycle

Copyright (c) 2011pixlcloud | collect. visualize. understand.

Visualization Tools‣Gephi‣R‣Matlab‣Mondrian‣PicViz‣Treemap 4.1‣Google Earth

Page 28: Visualization Lifecycle

Copyright (c) 2011pixlcloud | collect. visualize. understand.

Gephi‣reads: CSV, DOT, etc.‣graph analysis algorithms‣highly interactive

http://gephi.org

Page 29: Visualization Lifecycle

Copyright (c) 2011pixlcloud | collect. visualize. understand.

PicViz

http://www.wallinfire.net/picviz/

Page 30: Visualization Lifecycle

Copyright (c) 2011pixlcloud | collect. visualize. understand.

Treemap 4.1

http://www.cs.umd.edu/hcil/treemap/

Page 31: Visualization Lifecycle

Copyright (c) 2011pixlcloud | collect. visualize. understand.

•KML data format for encoding data

Google Earth