lecture 25 eda - github pages · lecture 25: eda theodoros rekatsinas 1. data visualizations today...

36
CS639: Data Management for Data Science Lecture 25: EDA Theodoros Rekatsinas 1

Upload: others

Post on 09-Aug-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Lecture 25 EDA - GitHub Pages · Lecture 25: EDA Theodoros Rekatsinas 1. Data Visualizations Today 2. Data Visualizations Today 3. 4 Standard Data Visualization Recipe: 1.Loaddataset

CS639:DataManagementfor

DataScienceLecture25:EDA

TheodorosRekatsinas

1

Page 2: Lecture 25 EDA - GitHub Pages · Lecture 25: EDA Theodoros Rekatsinas 1. Data Visualizations Today 2. Data Visualizations Today 3. 4 Standard Data Visualization Recipe: 1.Loaddataset

DataVisualizationsToday

2

Page 3: Lecture 25 EDA - GitHub Pages · Lecture 25: EDA Theodoros Rekatsinas 1. Data Visualizations Today 2. Data Visualizations Today 3. 4 Standard Data Visualization Recipe: 1.Loaddataset

DataVisualizationsToday

3

Page 4: Lecture 25 EDA - GitHub Pages · Lecture 25: EDA Theodoros Rekatsinas 1. Data Visualizations Today 2. Data Visualizations Today 3. 4 Standard Data Visualization Recipe: 1.Loaddataset

4

StandardDataVisualizationRecipe:

1. Load datasetintodataviz tool2. Start withadesiredhypothesis/pattern(explore

combinationofattributes)3. Select viz tobegenerated4. See ifitmatchesdesiredpattern5. Repeat 3-4untilyoufindamatch

Page 5: Lecture 25 EDA - GitHub Pages · Lecture 25: EDA Theodoros Rekatsinas 1. Data Visualizations Today 2. Data Visualizations Today 3. 4 Standard Data Visualization Recipe: 1.Loaddataset

5

TediousandTime-consuming!

KeyIssue:

Visualizationcanbegeneratedby:varyingsubsetsofdatavaryingattributesbeingvisualized

Toomanyvisualizationtolookattofinddesiredvisualpatterns!

Page 6: Lecture 25 EDA - GitHub Pages · Lecture 25: EDA Theodoros Rekatsinas 1. Data Visualizations Today 2. Data Visualizations Today 3. 4 Standard Data Visualization Recipe: 1.Loaddataset

1.Visualizationrecommendations

6

Page 7: Lecture 25 EDA - GitHub Pages · Lecture 25: EDA Theodoros Rekatsinas 1. Data Visualizations Today 2. Data Visualizations Today 3. 4 Standard Data Visualization Recipe: 1.Loaddataset

Whatyouwilllearnaboutinthissection

1. SpaceofVisualizations

2. RecommendationMetrics

7

Page 8: Lecture 25 EDA - GitHub Pages · Lecture 25: EDA Theodoros Rekatsinas 1. Data Visualizations Today 2. Data Visualizations Today 3. 4 Standard Data Visualization Recipe: 1.Loaddataset

Goal

Givenadatasetandatask,automaticallyproduceasetofvisualizationsthatarethemost“interesting”giventhetask

8

Particularlyvague

Page 9: Lecture 25 EDA - GitHub Pages · Lecture 25: EDA Theodoros Rekatsinas 1. Data Visualizations Today 2. Data Visualizations Today 3. 4 Standard Data Visualization Recipe: 1.Loaddataset

Goal

Givenadatasetandatask,automaticallyproduceasetofvisualizationsthatarethemost“interesting”giventhetask

9

Page 10: Lecture 25 EDA - GitHub Pages · Lecture 25: EDA Theodoros Rekatsinas 1. Data Visualizations Today 2. Data Visualizations Today 3. 4 Standard Data Visualization Recipe: 1.Loaddataset

Example

10

• Dataanalyststudyingcensusdata• age,education,marital-status,sex,race,income,hours-workedetc.

• A =#attributesintable

• Task:Compareonvarioussocioeconomicindicators,unmarriedadultsvs.alladults

Page 11: Lecture 25 EDA - GitHub Pages · Lecture 25: EDA Theodoros Rekatsinas 1. Data Visualizations Today 2. Data Visualizations Today 3. 4 Standard Data Visualization Recipe: 1.Loaddataset

Spaceofvisualizations

11

Forsimplicity,assumeasingletable(starschema)

Visualizations=agg.+grp.byqueries

Vi=SELECTd,f(m)FROMtableWHERE___GROUPBYd

(d,m,f):dimension,measure,aggregate

1.5

2

2.5

3

3.5

4

4.5

5010 10

30

MA CA IL NY

Page 12: Lecture 25 EDA - GitHub Pages · Lecture 25: EDA Theodoros Rekatsinas 1. Data Visualizations Today 2. Data Visualizations Today 3. 4 Standard Data Visualization Recipe: 1.Loaddataset

Spaceofvisualizations

12

Vi=SELECTd,f(m)FROMtableWHERE___GROUPBYd

(d,m,f):dimension,measure,aggregate{d} :race,work-type,sexetc.{m} :capital-gain,capital-loss,hours-per-week{f} :COUNT,SUM,AVG

Page 13: Lecture 25 EDA - GitHub Pages · Lecture 25: EDA Theodoros Rekatsinas 1. Data Visualizations Today 2. Data Visualizations Today 3. 4 Standard Data Visualization Recipe: 1.Loaddataset

Goal

Givenadatasetandatask,automaticallyproduceasetofvisualizationsthatarethemost“interesting”giventhetask

13

Page 14: Lecture 25 EDA - GitHub Pages · Lecture 25: EDA Theodoros Rekatsinas 1. Data Visualizations Today 2. Data Visualizations Today 3. 4 Standard Data Visualization Recipe: 1.Loaddataset

Interestingvisualizations

14

Deviation-basedUtility

Avisualizationisinterestingifitdisplaysalargedeviationfromsomereference

Task:compareunmarriedadultswithalladults

50

10 10

30

MA CA IL NY

3020

10

40

MA CA IL NY

V1

V2

V1

V2

Compareinduced

probabilitydistributions!

Target Reference

V1=SELECTd,f(m)FROMtableWHEREtargetGROUPBYdV2=SELECTd,f(m)FROMtableWHEREreferenceGROUPBYd

Page 15: Lecture 25 EDA - GitHub Pages · Lecture 25: EDA Theodoros Rekatsinas 1. Data Visualizations Today 2. Data Visualizations Today 3. 4 Standard Data Visualization Recipe: 1.Loaddataset

Deviation-basedUtilityMetric

15

Avisualizationisinterestingifitdisplaysalargedeviationfromsomereference

Manymetricsforcomputingdistancebetweendistributions

V1

V2

D[P(V1),P(V2)]

Earthmover’sdistanceL1,L2distanceK-Ldivergence

Anydistancemetricb/ndistributionsisOK!

Page 16: Lecture 25 EDA - GitHub Pages · Lecture 25: EDA Theodoros Rekatsinas 1. Data Visualizations Today 2. Data Visualizations Today 3. 4 Standard Data Visualization Recipe: 1.Loaddataset

ComputingExpectedTrend

16

Racevs.AVG(capital-gain)ReferenceTrendSELECTrace,AVG(capital-gain)FROMcensusGROUPBYrace

P(V1)

Expected

Distribution

Page 17: Lecture 25 EDA - GitHub Pages · Lecture 25: EDA Theodoros Rekatsinas 1. Data Visualizations Today 2. Data Visualizations Today 3. 4 Standard Data Visualization Recipe: 1.Loaddataset

ComputingActualTrend

17

Racevs.AVG(capital-gain)TargetTrendSELECTrace,AVG(capital-gain)FROMcensusGROUPBYraceWHERE marital-status=‘unmarried’

P(V2)

Actual

Distribution

Page 18: Lecture 25 EDA - GitHub Pages · Lecture 25: EDA Theodoros Rekatsinas 1. Data Visualizations Today 2. Data Visualizations Today 3. 4 Standard Data Visualization Recipe: 1.Loaddataset

ComputingUtility

18

U=D[P(V1) ,P(V2)]D =EMD,L2etc.

Page 19: Lecture 25 EDA - GitHub Pages · Lecture 25: EDA Theodoros Rekatsinas 1. Data Visualizations Today 2. Data Visualizations Today 3. 4 Standard Data Visualization Recipe: 1.Loaddataset

LowUtilityVisualization

19

Actual

Expected

Page 20: Lecture 25 EDA - GitHub Pages · Lecture 25: EDA Theodoros Rekatsinas 1. Data Visualizations Today 2. Data Visualizations Today 3. 4 Standard Data Visualization Recipe: 1.Loaddataset

HighUtilityVisualization

20

Actual

Expected

Page 21: Lecture 25 EDA - GitHub Pages · Lecture 25: EDA Theodoros Rekatsinas 1. Data Visualizations Today 2. Data Visualizations Today 3. 4 Standard Data Visualization Recipe: 1.Loaddataset

Othermetrics

21

• Datacharacteristics• TaskorInsight• SemanticsandDomainKnowledge• VisualEaseofUnderstanding• UserPreference

Page 22: Lecture 25 EDA - GitHub Pages · Lecture 25: EDA Theodoros Rekatsinas 1. Data Visualizations Today 2. Data Visualizations Today 3. 4 Standard Data Visualization Recipe: 1.Loaddataset

2.DB-inspiredOptimizations

22

Page 23: Lecture 25 EDA - GitHub Pages · Lecture 25: EDA Theodoros Rekatsinas 1. Data Visualizations Today 2. Data Visualizations Today 3. 4 Standard Data Visualization Recipe: 1.Loaddataset

Whatyouwilllearnaboutinthissection

1. RankingVisualizations

2. Optimizations

23

Page 24: Lecture 25 EDA - GitHub Pages · Lecture 25: EDA Theodoros Rekatsinas 1. Data Visualizations Today 2. Data Visualizations Today 3. 4 Standard Data Visualization Recipe: 1.Loaddataset

Ranking

24

Acrossall(d,m,f),whereV1=SELECTd,f(m)FROMtableWHEREtargetGROUPBYd

V2=SELECTd,f(m)FROMtableWHEREreferenceGROUPBYd

Vi=(d:dimension,m:measure,f:aggregate)

10sofdimensions,10sofmeasures,handfulofaggregates

2*d*m*f

è100sofqueriesforasingleusertask!

èCanbeevenlarger.How?

Goal:returnkbestutilityvisualizations(d,m,f),(thosewithlargestD[V1,V2])

Page 25: Lecture 25 EDA - GitHub Pages · Lecture 25: EDA Theodoros Rekatsinas 1. Data Visualizations Today 2. Data Visualizations Today 3. 4 Standard Data Visualization Recipe: 1.Loaddataset

Evenlargerspaceofqueries

25

• Binning• 3dimensionalor4dimensionalvisualizations• Scatterplotormapvisualizations• …

Page 26: Lecture 25 EDA - GitHub Pages · Lecture 25: EDA Theodoros Rekatsinas 1. Data Visualizations Today 2. Data Visualizations Today 3. 4 Standard Data Visualization Recipe: 1.Loaddataset

Backtoranking

26

Acrossall(d,m,f),whereV1=SELECTd,f(m)FROMtableWHEREtargetGROUPBYd

V2=SELECTd,f(m)FROMtableWHEREreferenceGROUPBYd

Goal:returnkbestutilityvisualizations(d,m,f),(thosewithlargestD[V1,V2])

NaïveApproachForeach(d,m,f)insequence

evaluatequeriesforV1(target),V2(reference)computeD[V1,V2]

Returnthek(d,m,f)withlargestDvalues

Page 27: Lecture 25 EDA - GitHub Pages · Lecture 25: EDA Theodoros Rekatsinas 1. Data Visualizations Today 2. Data Visualizations Today 3. 4 Standard Data Visualization Recipe: 1.Loaddataset

IssueswithNaïveApproach

27

•Repeatedprocessingofsamedatainsequenceacrossqueries•Computationwastedonlow-utilityvisualizations

Sharing

Pruning

Page 28: Lecture 25 EDA - GitHub Pages · Lecture 25: EDA Theodoros Rekatsinas 1. Data Visualizations Today 2. Data Visualizations Today 3. 4 Standard Data Visualization Recipe: 1.Loaddataset

Optimizations

28

• Eachvisualization=2SQLqueries

• Latency>100s• Minimizenumberofqueriesandscans

0

100

200

300

400

500

50 100 250views

late

ncy

(s)

dbmsCOLROW

Page 29: Lecture 25 EDA - GitHub Pages · Lecture 25: EDA Theodoros Rekatsinas 1. Data Visualizations Today 2. Data Visualizations Today 3. 4 Standard Data Visualization Recipe: 1.Loaddataset

Optimizations

29

• Combineaggregatequeriesontargetandref

• Combinemultipleaggregates(d1,m1,f1),(d1,m2,f1)à (d1,[m1,m2],f1)

• Combinemultiplegroup-bys*(d1,m1,f1),(d2,m1,f1)à ([d1,d2],m1,f1)Couldbeproblematic…

• ParallelQueryExecution

Page 30: Lecture 25 EDA - GitHub Pages · Lecture 25: EDA Theodoros Rekatsinas 1. Data Visualizations Today 2. Data Visualizations Today 3. 4 Standard Data Visualization Recipe: 1.Loaddataset

CombiningMultipleGroup-by’s

30

• Toofewgroup-bys leadstomanytablescans

• Toomanygroup-bys hurtperformance• #groups=Π (#distinctvaluesperattributes)

• Optimalgroup-bycombination≈bin-packing• Binvolume=logS(maxnumberofgroups)• Volumeofitems(attributes)=log(|ai|)• Minimize#binss.t.

Σi log(|ai|)<=logS

Page 31: Lecture 25 EDA - GitHub Pages · Lecture 25: EDA Theodoros Rekatsinas 1. Data Visualizations Today 2. Data Visualizations Today 3. 4 Standard Data Visualization Recipe: 1.Loaddataset

Pruningoptimizations

31

• Keeprunningestimatesofutility• Prunevisualizationsbasedonestimates

• Twoflavors• VanillaConfidenceIntervalbasedPruning• Multi-armedBanditPruning

Discardlow-utilityviewsearlytoavoidwastedcomputation

Page 32: Lecture 25 EDA - GitHub Pages · Lecture 25: EDA Theodoros Rekatsinas 1. Data Visualizations Today 2. Data Visualizations Today 3. 4 Standard Data Visualization Recipe: 1.Loaddataset

32

Visualizations

Queries(100s)

Sharing

Pruning

Optimizer

DBMS

MiddlewareLayer

Page 33: Lecture 25 EDA - GitHub Pages · Lecture 25: EDA Theodoros Rekatsinas 1. Data Visualizations Today 2. Data Visualizations Today 3. 4 Standard Data Visualization Recipe: 1.Loaddataset

Moreonautomatedvisualizations

33

Page 34: Lecture 25 EDA - GitHub Pages · Lecture 25: EDA Theodoros Rekatsinas 1. Data Visualizations Today 2. Data Visualizations Today 3. 4 Standard Data Visualization Recipe: 1.Loaddataset

ZQL:aviz explorationlanguage

34

Page 35: Lecture 25 EDA - GitHub Pages · Lecture 25: EDA Theodoros Rekatsinas 1. Data Visualizations Today 2. Data Visualizations Today 3. 4 Standard Data Visualization Recipe: 1.Loaddataset

Intelligentqueryoptimizer

35

Page 36: Lecture 25 EDA - GitHub Pages · Lecture 25: EDA Theodoros Rekatsinas 1. Data Visualizations Today 2. Data Visualizations Today 3. 4 Standard Data Visualization Recipe: 1.Loaddataset

Summary

36

Humanintheloopanalytics

areheretostay!