lecture 25 eda - github pages · lecture 25: eda theodoros rekatsinas 1. data visualizations today...
TRANSCRIPT
CS639:DataManagementfor
DataScienceLecture25:EDA
TheodorosRekatsinas
1
DataVisualizationsToday
2
DataVisualizationsToday
3
4
StandardDataVisualizationRecipe:
1. Load datasetintodataviz tool2. Start withadesiredhypothesis/pattern(explore
combinationofattributes)3. Select viz tobegenerated4. See ifitmatchesdesiredpattern5. Repeat 3-4untilyoufindamatch
5
TediousandTime-consuming!
KeyIssue:
Visualizationcanbegeneratedby:varyingsubsetsofdatavaryingattributesbeingvisualized
Toomanyvisualizationtolookattofinddesiredvisualpatterns!
1.Visualizationrecommendations
6
Whatyouwilllearnaboutinthissection
1. SpaceofVisualizations
2. RecommendationMetrics
7
Goal
Givenadatasetandatask,automaticallyproduceasetofvisualizationsthatarethemost“interesting”giventhetask
8
Particularlyvague
Goal
Givenadatasetandatask,automaticallyproduceasetofvisualizationsthatarethemost“interesting”giventhetask
9
Example
10
• Dataanalyststudyingcensusdata• age,education,marital-status,sex,race,income,hours-workedetc.
• A =#attributesintable
• Task:Compareonvarioussocioeconomicindicators,unmarriedadultsvs.alladults
Spaceofvisualizations
11
Forsimplicity,assumeasingletable(starschema)
Visualizations=agg.+grp.byqueries
Vi=SELECTd,f(m)FROMtableWHERE___GROUPBYd
(d,m,f):dimension,measure,aggregate
1.5
2
2.5
3
3.5
4
4.5
5010 10
30
MA CA IL NY
Spaceofvisualizations
12
Vi=SELECTd,f(m)FROMtableWHERE___GROUPBYd
(d,m,f):dimension,measure,aggregate{d} :race,work-type,sexetc.{m} :capital-gain,capital-loss,hours-per-week{f} :COUNT,SUM,AVG
Goal
Givenadatasetandatask,automaticallyproduceasetofvisualizationsthatarethemost“interesting”giventhetask
13
Interestingvisualizations
14
Deviation-basedUtility
Avisualizationisinterestingifitdisplaysalargedeviationfromsomereference
Task:compareunmarriedadultswithalladults
50
10 10
30
MA CA IL NY
3020
10
40
MA CA IL NY
V1
V2
V1
V2
Compareinduced
probabilitydistributions!
Target Reference
V1=SELECTd,f(m)FROMtableWHEREtargetGROUPBYdV2=SELECTd,f(m)FROMtableWHEREreferenceGROUPBYd
Deviation-basedUtilityMetric
15
Avisualizationisinterestingifitdisplaysalargedeviationfromsomereference
Manymetricsforcomputingdistancebetweendistributions
V1
V2
D[P(V1),P(V2)]
Earthmover’sdistanceL1,L2distanceK-Ldivergence
Anydistancemetricb/ndistributionsisOK!
ComputingExpectedTrend
16
Racevs.AVG(capital-gain)ReferenceTrendSELECTrace,AVG(capital-gain)FROMcensusGROUPBYrace
P(V1)
Expected
Distribution
ComputingActualTrend
17
Racevs.AVG(capital-gain)TargetTrendSELECTrace,AVG(capital-gain)FROMcensusGROUPBYraceWHERE marital-status=‘unmarried’
P(V2)
Actual
Distribution
ComputingUtility
18
U=D[P(V1) ,P(V2)]D =EMD,L2etc.
LowUtilityVisualization
19
Actual
Expected
HighUtilityVisualization
20
Actual
Expected
Othermetrics
21
• Datacharacteristics• TaskorInsight• SemanticsandDomainKnowledge• VisualEaseofUnderstanding• UserPreference
2.DB-inspiredOptimizations
22
Whatyouwilllearnaboutinthissection
1. RankingVisualizations
2. Optimizations
23
Ranking
24
Acrossall(d,m,f),whereV1=SELECTd,f(m)FROMtableWHEREtargetGROUPBYd
V2=SELECTd,f(m)FROMtableWHEREreferenceGROUPBYd
Vi=(d:dimension,m:measure,f:aggregate)
10sofdimensions,10sofmeasures,handfulofaggregates
2*d*m*f
è100sofqueriesforasingleusertask!
èCanbeevenlarger.How?
Goal:returnkbestutilityvisualizations(d,m,f),(thosewithlargestD[V1,V2])
Evenlargerspaceofqueries
25
• Binning• 3dimensionalor4dimensionalvisualizations• Scatterplotormapvisualizations• …
Backtoranking
26
Acrossall(d,m,f),whereV1=SELECTd,f(m)FROMtableWHEREtargetGROUPBYd
V2=SELECTd,f(m)FROMtableWHEREreferenceGROUPBYd
Goal:returnkbestutilityvisualizations(d,m,f),(thosewithlargestD[V1,V2])
NaïveApproachForeach(d,m,f)insequence
evaluatequeriesforV1(target),V2(reference)computeD[V1,V2]
Returnthek(d,m,f)withlargestDvalues
IssueswithNaïveApproach
27
•Repeatedprocessingofsamedatainsequenceacrossqueries•Computationwastedonlow-utilityvisualizations
Sharing
Pruning
Optimizations
28
• Eachvisualization=2SQLqueries
• Latency>100s• Minimizenumberofqueriesandscans
0
100
200
300
400
500
50 100 250views
late
ncy
(s)
dbmsCOLROW
Optimizations
29
• Combineaggregatequeriesontargetandref
• Combinemultipleaggregates(d1,m1,f1),(d1,m2,f1)à (d1,[m1,m2],f1)
• Combinemultiplegroup-bys*(d1,m1,f1),(d2,m1,f1)à ([d1,d2],m1,f1)Couldbeproblematic…
• ParallelQueryExecution
CombiningMultipleGroup-by’s
30
• Toofewgroup-bys leadstomanytablescans
• Toomanygroup-bys hurtperformance• #groups=Π (#distinctvaluesperattributes)
• Optimalgroup-bycombination≈bin-packing• Binvolume=logS(maxnumberofgroups)• Volumeofitems(attributes)=log(|ai|)• Minimize#binss.t.
Σi log(|ai|)<=logS
Pruningoptimizations
31
• Keeprunningestimatesofutility• Prunevisualizationsbasedonestimates
• Twoflavors• VanillaConfidenceIntervalbasedPruning• Multi-armedBanditPruning
Discardlow-utilityviewsearlytoavoidwastedcomputation
32
Visualizations
Queries(100s)
Sharing
Pruning
Optimizer
DBMS
MiddlewareLayer
Moreonautomatedvisualizations
33
ZQL:aviz explorationlanguage
34
Intelligentqueryoptimizer
35
Summary
36
Humanintheloopanalytics
areheretostay!