interactive query processing vijayshankar raman computer science division university of california...
Post on 22-Dec-2015
217 views
TRANSCRIPT
Interactive Query Processing
Vijayshankar RamanComputer Science Division
University of California at Berkeley
April 19, 2023 Interactive Query Processing 2
Motivation: Nature of querying
querying = extracting information from data sets different techniques in different settings
intrinsically slow significant user-system interaction
info. seekers work iteratively, gradually refining requests based on feedback
April 19, 2023 Interactive Query Processing 3
Problems with traditional solutions
mismatch between system functionality and mode of HCI
black boxes• do batch processing• frustrating delays in iterative process
process
query
exactanswer
April 19, 2023 Interactive Query Processing 4
Interactive processing
HCI requirements• users must get continual feedback on results of processing• allow users to control processing based on prior feedback
performance goals• not to minimize time to give complete results• give continually improving partial results• adapt to dynamically specified performance goals
process
query
exactanswer
April 19, 2023 Interactive Query Processing 5
Context and Talk Outline query processing over structured data processing = dataflow through
pipelining operators
outline of rest of talksupport for dynamic user control in traditional query proc.
architectures (static plans)• architecture for giving more aggressive partial results• policy for generating partial results
client user interface impact of user actions on query execution
other related issues• interactive data cleaning (A-B-C)• adaptive query processing
R S P Q
April 19, 2023 Interactive Query Processing 6
Design goals in supporting user control
make minimal change to system architecture
must be independent of particular query processing algorithms
no delay in processing
April 19, 2023 Interactive Query Processing 7
Online Reordering (Raman et al. ’99,’00)
users perceive data being processed over time• prioritize processing for “interesting” tuples• interest based on user-specified preferences
reorder dataflow so that interesting tuples go first
encapsulate reordering as pipelined dataflow operator
scan
reorder in
dex sca
n
April 19, 2023 Interactive Query Processing 8
Framework for Online Reordering
want no delay in processing in general, reordering can only be best-effort
typically process/consume slower than produce• exploit throughput difference to reorder
two aspects• mechanism for best-effort reordering• reordering policy
acddbadb...
abcdabc..
reorder consumeproduceprocess
f(t)
user interest
April 19, 2023 Interactive Query Processing 9
Juggle mechanism for reordering
continually prefetch from input, spooling onto auxiliary side disk if needed
juggle data between buffer and side disk, to keep buffer full of “interesting” items• getNext chooses best item currently on buffer
reordering policy determines what getNext returns, and enrich/spool decisions
buffer
spoolfetch enrich
getNextjuggle
side diskproduce
process/consume
April 19, 2023 Interactive Query Processing 10
Reordering policies
quality of feedback for a prefix t1t2
…tk
QOF(UP(t1), UP(t2
), … UP(tk )), UP = user preference
• determined by application
goodness of reordering: dQOF/dt implication for juggle mechanism
• process gets item from buffer that increases QOF the most
• juggle tries to maintain buffer with such items
time
QOFGOAL: “good” permutation of
items t1…tn to t1…tn
April 19, 2023 Interactive Query Processing 11
One application of reordering online aggregation [Hellerstein et. al ‘97,
Hellerstein and Haas’99 ]• for SQL aggregate queries, give gradually improving
estimates • with confidence intervals
• allow users to speed up estimate refinement for groups of interest
• prioritize for processing at a per-group granularity
SELECT AVG(gpa) FROM studentsGROUP BY college
April 19, 2023 Interactive Query Processing 12
Online Aggregation Screenshot
SELECT AVG(gpa) FROM studentsGROUP BY college
April 19, 2023 Interactive Query Processing 13
QOF in Online Aggregation
avg weighted confidence interval preference acts as weight on confidence interval
• QOF = UPi / ni , ni = number of tuples processed from group i
process pulls items from group with max UPi / nini
desired ratio of group i tuples on buffer = UPi2/3/ UPj
2/3
• juggle tries to maintain this by enrich/spool
April 19, 2023 Interactive Query Processing 14
Other Quality of Feedback functions
rate of processing (for a group) preference • QOF = (ni - nUPi)
2 (variance from ideal proportions)
process pulls items from group with max (nUPi - ni )
desired ratio of group i tuples in buffer = UPi
will see more complex QOF later
April 19, 2023 Interactive Query Processing 15
Results: Reordering in Online Aggregation
implemented in Informix UDO experiments with modified TPC-D queries questions:
• how much throughput difference is needed for reordering• can we reorder handle skewed data
one stress test: skew, minimal proc. cost
• index-only join• 5 orderpriorities, zipf distribution
A B C D E 1 1/2 1/3 1/4 1/5
consume
process
scan
juggle
index
SELECT AVG(o_totalprice), o_orderpriorityFROM order WHERE exists ( SELECT * FROM lineitem WHERE l_orderkey = o_orderkey)GROUP BY o_orderpriority
April 19, 2023 Interactive Query Processing 16
Performance results A B C D E
initial preferences 1 1 1 5 3
after T1 1 1 3.5 0.5 1
time
# t
up
les
pro
cess
ed
3 times faster for interesting groups 2% completion time overhead
E
C A
con
fid
en
ce
inte
rval
time
April 19, 2023 Interactive Query Processing 17
Findings
higher processing costs (index/hash join, subquery, …) make reordering easy
at very low processing costs, juggle constrained bydensity of interesting tuples in source• outlier groups hard to speed up
• better to use index on reordering column [HHW ‘97] reordering becomes easier over time
questions to answer:• where to place juggle• are groups fixed statically
April 19, 2023 Interactive Query Processing 18
Outline
motivation and context support for dynamic user control in traditional
query proc. architectures architecture for giving more aggressive partial
results policy for generating partial results
• user interface for displaying partial results • impact on routing
related work, work-in-progress, future work
April 19, 2023 Interactive Query Processing 19
Generating partial results traditional arch. also generate continual result tuples
• arises from continual dataflow thru pipelining operators• much work on pipelining joins [WA91, HH99, IFF+99, UF00]
this is too rigid • especially in distributed envirorments
Resu
l t S
pace
April 19, 2023 Interactive Query Processing 20
Context: Query processing in Telegraph
Telegraph: adaptive dataflow system to query diverse, distributed sources
much data available as services over Internet• currently only accessible by browse/search/forms
want to semi-automatically combine this data
examples:• campaign finance information (fff.cs.berkeley.edu)
multiple donor lists, home prices, census info., crime ratings, maps, celebrity lists, legislative records …
• restaurant information lists, reviews, addresses, maps, health inspection reports,
nutrition information …
April 19, 2023 Interactive Query Processing 21
More aggressive partial results complete results too rigid
• source latencies high, diverse• dynamic source variations, delays• query does not capture user desires
want to process queries flexibly, giving results asap• adapt dynamically to user preferences and source variations• allow user feedback to refine partial results
April 19, 2023 Interactive Query Processing 22
Correctness of partial results must contain some essential columns
• group-by columns / sort-by columns
full disjunction/outer-join semantics• appropriate for many Internet data sources
join semantics• cannot give partial results without ensuring that match
exists
aggregates: over-estimate, even with outer-join semantics• update early, and compensate later
April 19, 2023 Interactive Query Processing 23
Dynamic query plans
Eddy [Avnur and Hellerstein 2000]• router for directing data thru modules• dynamically chooses join and selection order• all partial tuples generatable
original routing policy: optimize completion time our focus: continual partial results, adaptation to
user preferences
Eddy
R S
T
April 19, 2023 Interactive Query Processing 24
Eddy Routing
Eddy must decidea) what tuple to route nextb) where to route it
based on user preferences, and source properties
need routing policy and a reordering mechanism• eddy memory and module queues bounded
. . . modules
R PS . . . inputs
Eddy
April 19, 2023 Interactive Query Processing 25
Prioritizing tuples
perform online reordering within Eddy• all incoming tuples placed in reorderer• when space available on queues to modules, Eddy takes tuple from
reorderer and routes it to modules
best-effort automatically reorder before slow modules
. . . modules
R PS . . . inputs
copy
April 19, 2023 Interactive Query Processing 26
Routing and reordering policy
GOAL: at any time, route to max. dQOF/dt = benefit of sending tuple to module / cost
cost: estimate data rates to/from module benefit: dependent on application and UI
• how partial results impact the UI• user preferences
QOFGOAL:
tim
e
April 19, 2023 Interactive Query Processing 27
Outline
motivation and context online reordering for user prioritization of partial
results architecture for generating more aggressive
partial resultspolicy for generating partial results
• Telegraph UI: displaying results and inferring preferences• impact on routing • experimental results
related work/work-in-progress/future work
April 19, 2023 Interactive Query Processing 28
Telegraph UI
screenshot
April 19, 2023 Interactive Query Processing 29
Telegraph UI
partial results displayed on screen as they arrive values clustered into groups
• can roll-up or drill-down to see different granularities
client has hash table mapping groups to values navigation
• vertical/horizontal scrolling• column (un)hiding• rollup/drilldown
different columns visible at different drilldown levels
April 19, 2023 Interactive Query Processing 30
Getting user preferences (1)
infer from navigation• row/column scrolling, group drill down and rollup
prioritize visible rows/columns
• “query evolution” subset “one-size-fits-all” queries future work: query expansion
explicit• at the cell level -- need for some expensive sources• can also have up/down buttons on columns
April 19, 2023 Interactive Query Processing 31
Getting user preferences (2)
preferences:• col priorities• {<predicate, row priority, row-col priorities>… }
e.g. {<1, 1.0, none>, <2, 1.7, none>,
<3, 1.3, map=high>, <8/33014, 1.0, none>, <8/60610, 1.0, none>}
April 19, 2023 Interactive Query Processing 32
Benefit of a partial result
depends on user preferences benefit of updating a cell in output
• row priority x column weight x cell worthiness
• column weight = max(col priority, row-col priority)• incremental cell worthiness: how much does one extra
update add to the cell’s value quality of feedback enumerations -- 1 aggregations -- change in confidence interval
informing user about execution progress• map cell worthiness to foreground color value
April 19, 2023 Interactive Query Processing 33
Benefit of Routing a Tuple route tuple according to expected benefit and cost
benefit of sending a tuple t to a module M and forming set T = cells c T benefit of updating c
estimate fanout by sampling previous values -- selectivity estimation
t M T
April 19, 2023 Interactive Query Processing 34
Granularity of Reordering/Routing
reorder and route tuples at granularity of a group group = <base relations, predicate>
groups created and deleted dynamically, as user navigates in UI
group predicate may be a subset of application-specified row predicate• final row depends on values this tuple joins with
April 19, 2023 Interactive Query Processing 35
Partial results, no reordering
Bush Contributors Income Crime Ratings
800200 400 6000
10000
20000
40000
30000
50000
time (s)
tota
l w
ort
hin
ess
with p
artia
l res
ults
without partial results
April 19, 2023 Interactive Query Processing 36
Partial results, no reordering
0 100 200 300
10000
20000
Bush Contributors Income Crime Ratings
time(s)
tota
l w
ort
hin
ess
100 200 3000
1000delay
with partial results
without
partial results
April 19, 2023 Interactive Query Processing 37
Results with row-wise reordering
scrolling: {AZ, AR, CA CO, CT}, LA,KY,MA,MD,MI}
0
100
40 80
Ag
gr.
Up
date
s
time
0
0.24
groups
rel. e
rror
April 19, 2023 Interactive Query Processing 38
Prioritizing particular columns
previous graph: Income updates much faster than Crime Ratings• can we prioritize Crime Ratings?
increase number of threads to probe it
if eddy sends na tuples to A, nb to B, per secondna <= threadsa/latency(A), nb <= threadsb/latency(B)
threadsa + threadsb <= Number of Threads Per Query
max. na worthiness(A) + nb worthiness(B)
more generally, optimize for multiple resources• good citizenship • costs
April 19, 2023 Interactive Query Processing 39
Outline
motivation and context online reordering for user prioritization of partial
results architecture for generating more aggressive
partial results policy for generating partial results conclusions and future work
April 19, 2023 Interactive Query Processing 40
Related Work
IR work• ranked retrieval, relevance feedback• search strategies, Berry Picking
making operators pipelining (Ripple Join, XJoin, Tuqwila) precomputed summaries (OLAP, materialized views, AQUA) top N / fast-first query processing
parachute queries, union queries
adaptivity• dynamic query plans• mid-query reoptimization (KD’98, IFF+99)• competition (DEC Rdb)
April 19, 2023 Interactive Query Processing 41
Summary query processor should care about how
user/application uses results online reordering effective way of supporting
dynamic user control give partial results as user wishes by embedding
reordering within dynamically controlled dataflow
system flexibility helpful hard to map these user-interaction needs into
concrete algorithm performance goals• wanted: benchmarks based on user/application traces
For more information:http://telegraph.cs.berkeley.edu/, http://control.cs.berkeley.edu/
April 19, 2023 Interactive Query Processing 42
Future Work
session-granularity query processing• query evolution• lazy evaluation: user/client navigates through partial results• closer dbms-application interaction
reduced operator set querying• execute query by appropriate routing through data sources
and state modules
April 19, 2023 Interactive Query Processing 43
Lessons Learned query processor should care about how
user/application uses results
system flexibility helpful
hard to map these user-interaction needs intoconcrete algorithm performance goals• wanted: benchmarks based on user/application traces
For more information:• http://telegraph.cs.berkeley.edu/
• http://control.cs.berkeley.edu/
April 19, 2023 Interactive Query Processing 44
relational operators: logical astractions• encapsulate multiple physical effects
inflexible in handling unexpected changes• information hiding
operator speeds resource consumption
• work sharing competitive access paths
• inflexible tuple routing suppose user asks for SQ tuples suppose there is a delay in P
want to encapsulate at level of physical operators: data sources, data structures
Granularity of Query Operators
R SP
Q
R
Eddy
S
Q
RS RS-PQ PQ PQij
April 19, 2023 Interactive Query Processing 45
State Modules
store state in separate State Modules (SteMs)• SteM like an index: supports builds and probes• unifies caches, join state, rendezvous buffers
Eddy routes tuples through SteMs and data sources• gets results from sources, or probe/build into SteMs
joins and data access performed in the process of routing
R S P Q
Q R
Eddy
SteMP SteMQSteMSSteMR
S P
QR
Eddy
S
Q
RS RS-PQ PQ PQij
April 19, 2023 Interactive Query Processing 46
HCI Motivation: Why Interactivity?
Berry picking (Bates ‘90, ‘93)
user studies• decision support (O’day and Jeffries ‘93)• relevance feedback (Koenemann and Belkin ‘96)
April 19, 2023 Interactive Query Processing 47
BACKUP: Disk data layout during juggling
performance goal:• favor early results: optimize Phase 1, at expense of Phase 2• spool : sequential I/O, enrich: random I/O + reordering in Phase 2 much easier than in Phase 1
enrich needs approx. index on side-disk• have hash-index on tuples, according to user interest• done at “group” granularity
April 19, 2023 Interactive Query Processing 48
BACKUP: Other applications of reordering
can add reorder to any dataflow
will later discuss application for query systems that give aggressive partial results
also useful in batch query processing • sorting often used in query plans for performance gains• can replace by best-effort reordering• little performance hit, but plan is now pipelining• will not discuss further in this talk
spreadsheets
April 19, 2023 Interactive Query Processing 49
BACKUP: Estimators for rowPriority, incremental
worthiness easy, unless output row depends on new values
• currently use average of all possible rows• could estimate a distribution instead
April 19, 2023 Interactive Query Processing 50
BACKUP: Interactive query processing with non-pipelining operators
basic techniques independent of pipelined operators. • reordering --- can be used in general query plans,
although effectiveness may be hindered by blocking operators
• State Modules applicable in general -- helps with adaptivity
much work on making plan operators pipelining -- ripple join, xjoin, tuqwila• reordering itself can be used to avoid blocking sorts in
some cases
April 19, 2023 Interactive Query Processing 51
Query processing using SteMs (1) SteM operations -- works like index on a table
• build(tuple): add tuple to SteM• probe(tuple): find all matches among build values
bounce back tuple if all matches not found and probe tuple not cached elsewhere
performing joins by routing through SteMs• index joins
regular synchronous index joins easy -- have no state over distributed sources, lookups cached by building into SteM
• separating cache allows Eddy to distinguish access cost
• asynchronous index joins [GW’00] helps for high-throughput Internet sources SteM acts as rendezvous buffer for
pending probe resultsR
Eddy
SteMSSteMR
S
R bl
d.
S bl
d.R pr
obe
S pr
obe
R
April 19, 2023 Interactive Query Processing 52
Query processing using SteMs (2) performing joins by routing through SteMs (contd)
• doubly pipelined hash join• use two SteMs, one on each source
can also do extensions (e.g. Tuqwila [IFF+99]), and even non-pipelined joins, using appropriate SteMs
thus can simulate any static query plan by appropriate routing
R
Eddy
SteMSSteMR
R bl
d.
S bl
d.R pr
obe
S pr
obe
S
April 19, 2023 Interactive Query Processing 53
Query processing using SteMs (3) how to do this in general?
• don’t want Eddy to be a super-join• want it to route as per user desires and source properties,
independent of any join algorithm
routing constraint:• each tuple must have a non-zero probability of being sent to
every appropriate module• can avoid repeated looping by marking tuples
Theorem: any routing that satisfies above constraint will produce all query results and terminate
could get duplicates• prune at application, or• enforce atomicity in tuple routing
April 19, 2023 Interactive Query Processing 54
Benefits of SteMs adaptivity
• join algorithm and access path selection are determined by eddy routing
• hence Eddy can dynamically choose these based on source properties
• in fact, Eddy can do hybrid join algorithms adapt to dynamic delays, etc.
work sharing -- for query evolution, competition better information to the Eddy
• speeds, memory consumption
April 19, 2023 Interactive Query Processing 55
BACKUP: Lazy Evaluation
User navigates thru partial results• can see only a few at a time
exploit to delay processing• do on demand on the region user currently navigating
over
things to delay• expensive source lookups (e.g. map)• joins (e.g. …)• any other formatting etc.
April 19, 2023 Interactive Query Processing 56
Backup: giving probabilistic results
view partial result as somethng that throws light on outer-join space• positive result: probabilistic • negative result: deterministic
benefit of positive result:• weighted sum of cells it is likely to show; penalty for false
result• repudiation of earlier false results
benefit of negative result:• direct benefit unlikely• repudiation of earlier false results
complications• many tuples may contribute to single output tuple --
aggregation
April 19, 2023 Interactive Query Processing 57
Probabilistic partial results
partial result => may not apply all filters
still want to show result probabilistically• filters ill-specified• data on Internet often inconsistent • show all partial results: “full disjunction” (Galindo-Legaria
‘94)
positive and negative results
Outp
ut
Space
Oute
r J o
i n S
pace
April 19, 2023 Interactive Query Processing 58
What does user see?
user navigates thru partial results
with complete row partial results:• can either explore these results in detail (sort/scroll along
columns as in spreadsheet)• compute approx. aggregates on them in online fashion
errors shrinking as more results come in
with probabilistic, incomplete results?• easy if primary key in partial result• what to show otherwise?
April 19, 2023 Interactive Query Processing 59
Quantifying benefit of partial results
partial result sheds light on outer-join space• positive result: probabilistic • negative result: deterministic
benefit of positive result:• weighted sum of cells it is likely to show• penalty for false result ?• repudiation of earlier false results ?
benefit of negative result:• direct benefit unlikely• repudiation of earlier false results ?
April 19, 2023 Interactive Query Processing 60
Why probabilistic results are good
data on Internet is often inconsistent• hence even exact processing cannot give perfect answers
people are used to sloppy answers can allow negations in expert interfaces only
April 19, 2023 Interactive Query Processing 61
Data growth vs. Computer Speedup Moore’s Law -- # of transistors/chip doubles every 18 months
(1965) data growth
0
500
1000
1500
2000
2500
3000
3500
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
Year
Petabytes
Sales
Moore'sLaw
Source: J. Porter, Disk/Trend, Inc. (http://www.disktrend.com/pdf/portrpkg.pdf)
April 19, 2023 Interactive Query Processing 62
Disk Appetite, contd.
Greg Papadopoulos, CTO Sun:• Disk sales doubling every 9 months
similar results from Winter VLDB survey
time to process all your data doubles every 18 months!