interactive query processing vijayshankar raman computer science division university of california...

Interactive Query Processing

Vijayshankar RamanComputer Science Division

University of California at Berkeley

April 19, 2023 Interactive Query Processing 2

Motivation: Nature of querying

querying = extracting information from data sets different techniques in different settings

intrinsically slow significant user-system interaction

info. seekers work iteratively, gradually refining requests based on feedback


Problems with traditional solutions

mismatch between system functionality and mode of HCI

black boxes• do batch processing• frustrating delays in iterative process

process

query

exactanswer


Interactive processing

HCI requirements• users must get continual feedback on results of processing• allow users to control processing based on prior feedback

performance goals• not to minimize time to give complete results• give continually improving partial results• adapt to dynamically specified performance goals

process

query

exactanswer


Context and Talk Outline query processing over structured data processing = dataflow through

pipelining operators

outline of rest of talksupport for dynamic user control in traditional query proc.

architectures (static plans)• architecture for giving more aggressive partial results• policy for generating partial results

client user interface impact of user actions on query execution

other related issues• interactive data cleaning (A-B-C)• adaptive query processing

R S P Q


Design goals in supporting user control

make minimal change to system architecture

must be independent of particular query processing algorithms

no delay in processing


Online Reordering (Raman et al. ’99,’00)

users perceive data being processed over time• prioritize processing for “interesting” tuples• interest based on user-specified preferences

reorder dataflow so that interesting tuples go first

encapsulate reordering as pipelined dataflow operator

scan

reorder in

dex sca

n


Framework for Online Reordering

want no delay in processing in general, reordering can only be best-effort

typically process/consume slower than produce• exploit throughput difference to reorder

two aspects• mechanism for best-effort reordering• reordering policy

acddbadb...

abcdabc..

reorder consumeproduceprocess

f(t)

user interest


Juggle mechanism for reordering

continually prefetch from input, spooling onto auxiliary side disk if needed

juggle data between buffer and side disk, to keep buffer full of “interesting” items• getNext chooses best item currently on buffer

reordering policy determines what getNext returns, and enrich/spool decisions

buffer

spoolfetch enrich

getNextjuggle

side diskproduce

process/consume


Reordering policies

quality of feedback for a prefix t1t2

…tk

QOF(UP(t1), UP(t2

), … UP(tk )), UP = user preference

• determined by application

goodness of reordering: dQOF/dt implication for juggle mechanism

• process gets item from buffer that increases QOF the most

• juggle tries to maintain buffer with such items

time

QOFGOAL: “good” permutation of

items t1…tn to t1…tn


One application of reordering online aggregation [Hellerstein et. al ‘97,

Hellerstein and Haas’99 ]• for SQL aggregate queries, give gradually improving

estimates • with confidence intervals

• allow users to speed up estimate refinement for groups of interest

• prioritize for processing at a per-group granularity

SELECT AVG(gpa) FROM studentsGROUP BY college


Online Aggregation Screenshot

SELECT AVG(gpa) FROM studentsGROUP BY college


QOF in Online Aggregation

avg weighted confidence interval preference acts as weight on confidence interval

• QOF = UPi / ni , ni = number of tuples processed from group i

process pulls items from group with max UPi / nini

desired ratio of group i tuples on buffer = UPi2/3/ UPj

2/3

• juggle tries to maintain this by enrich/spool


Other Quality of Feedback functions

rate of processing (for a group) preference • QOF = (ni - nUPi)

2 (variance from ideal proportions)

process pulls items from group with max (nUPi - ni )

desired ratio of group i tuples in buffer = UPi

will see more complex QOF later


Results: Reordering in Online Aggregation

implemented in Informix UDO experiments with modified TPC-D queries questions:

• how much throughput difference is needed for reordering• can we reorder handle skewed data

one stress test: skew, minimal proc. cost

• index-only join• 5 orderpriorities, zipf distribution

A B C D E 1 1/2 1/3 1/4 1/5

consume

process

scan

juggle

index

SELECT AVG(o_totalprice), o_orderpriorityFROM order WHERE exists ( SELECT * FROM lineitem WHERE l_orderkey = o_orderkey)GROUP BY o_orderpriority


Performance results A B C D E

initial preferences 1 1 1 5 3

after T1 1 1 3.5 0.5 1

time

# t

up

les

pro

cess

ed

3 times faster for interesting groups 2% completion time overhead

E

C A

con

fid

en

ce

inte

rval

time


Findings

higher processing costs (index/hash join, subquery, …) make reordering easy

at very low processing costs, juggle constrained bydensity of interesting tuples in source• outlier groups hard to speed up

• better to use index on reordering column [HHW ‘97] reordering becomes easier over time

questions to answer:• where to place juggle• are groups fixed statically


Outline

motivation and context support for dynamic user control in traditional

query proc. architectures architecture for giving more aggressive partial

results policy for generating partial results

• user interface for displaying partial results • impact on routing

related work, work-in-progress, future work


Generating partial results traditional arch. also generate continual result tuples

• arises from continual dataflow thru pipelining operators• much work on pipelining joins [WA91, HH99, IFF+99, UF00]

this is too rigid • especially in distributed envirorments

Resu

l t S

pace


Context: Query processing in Telegraph

Telegraph: adaptive dataflow system to query diverse, distributed sources

much data available as services over Internet• currently only accessible by browse/search/forms

want to semi-automatically combine this data

examples:• campaign finance information (fff.cs.berkeley.edu)

multiple donor lists, home prices, census info., crime ratings, maps, celebrity lists, legislative records …

• restaurant information lists, reviews, addresses, maps, health inspection reports,

nutrition information …


More aggressive partial results complete results too rigid

• source latencies high, diverse• dynamic source variations, delays• query does not capture user desires

want to process queries flexibly, giving results asap• adapt dynamically to user preferences and source variations• allow user feedback to refine partial results


Correctness of partial results must contain some essential columns

• group-by columns / sort-by columns

full disjunction/outer-join semantics• appropriate for many Internet data sources

join semantics• cannot give partial results without ensuring that match

exists

aggregates: over-estimate, even with outer-join semantics• update early, and compensate later


Dynamic query plans

Eddy [Avnur and Hellerstein 2000]• router for directing data thru modules• dynamically chooses join and selection order• all partial tuples generatable

original routing policy: optimize completion time our focus: continual partial results, adaptation to

user preferences

Eddy

R S

T


Eddy Routing

Eddy must decidea) what tuple to route nextb) where to route it

based on user preferences, and source properties

need routing policy and a reordering mechanism• eddy memory and module queues bounded

. . . modules

R PS . . . inputs

Eddy


Prioritizing tuples

perform online reordering within Eddy• all incoming tuples placed in reorderer• when space available on queues to modules, Eddy takes tuple from

reorderer and routes it to modules

best-effort automatically reorder before slow modules

. . . modules

R PS . . . inputs

copy


Routing and reordering policy

GOAL: at any time, route to max. dQOF/dt = benefit of sending tuple to module / cost

cost: estimate data rates to/from module benefit: dependent on application and UI

• how partial results impact the UI• user preferences

QOFGOAL:

tim

e


Outline

motivation and context online reordering for user prioritization of partial

results architecture for generating more aggressive

partial resultspolicy for generating partial results

• Telegraph UI: displaying results and inferring preferences• impact on routing • experimental results

related work/work-in-progress/future work


Telegraph UI

screenshot


Telegraph UI

partial results displayed on screen as they arrive values clustered into groups

• can roll-up or drill-down to see different granularities

client has hash table mapping groups to values navigation

• vertical/horizontal scrolling• column (un)hiding• rollup/drilldown

different columns visible at different drilldown levels


Getting user preferences (1)

infer from navigation• row/column scrolling, group drill down and rollup

prioritize visible rows/columns

• “query evolution” subset “one-size-fits-all” queries future work: query expansion

explicit• at the cell level -- need for some expensive sources• can also have up/down buttons on columns


Getting user preferences (2)

preferences:• col priorities• {<predicate, row priority, row-col priorities>… }

e.g. {<1, 1.0, none>, <2, 1.7, none>,

<3, 1.3, map=high>, <8/33014, 1.0, none>, <8/60610, 1.0, none>}


Benefit of a partial result

depends on user preferences benefit of updating a cell in output

• row priority x column weight x cell worthiness

• column weight = max(col priority, row-col priority)• incremental cell worthiness: how much does one extra

update add to the cell’s value quality of feedback enumerations -- 1 aggregations -- change in confidence interval

informing user about execution progress• map cell worthiness to foreground color value


Benefit of Routing a Tuple route tuple according to expected benefit and cost

benefit of sending a tuple t to a module M and forming set T = cells c T benefit of updating c

estimate fanout by sampling previous values -- selectivity estimation

t M T


Granularity of Reordering/Routing

reorder and route tuples at granularity of a group group = <base relations, predicate>

groups created and deleted dynamically, as user navigates in UI

group predicate may be a subset of application-specified row predicate• final row depends on values this tuple joins with


Partial results, no reordering

Bush Contributors Income Crime Ratings

800200 400 6000

10000

20000

40000

30000

50000

time (s)

tota

l w

ort

hin

ess

with p

artia

l res

ults

without partial results


Partial results, no reordering

0 100 200 300

10000

20000

Bush Contributors Income Crime Ratings

time(s)

tota

l w

ort

hin

ess

100 200 3000

1000delay

with partial results

without

partial results


Results with row-wise reordering

scrolling: {AZ, AR, CA CO, CT}, LA,KY,MA,MD,MI}

0

100

40 80

Ag

gr.

Up

date

s

time

0

0.24

groups

rel. e

rror


Prioritizing particular columns

previous graph: Income updates much faster than Crime Ratings• can we prioritize Crime Ratings?

increase number of threads to probe it

if eddy sends na tuples to A, nb to B, per secondna <= threadsa/latency(A), nb <= threadsb/latency(B)

threadsa + threadsb <= Number of Threads Per Query

max. na worthiness(A) + nb worthiness(B)

more generally, optimize for multiple resources• good citizenship • costs


Outline

motivation and context online reordering for user prioritization of partial

results architecture for generating more aggressive

partial results policy for generating partial results conclusions and future work


Related Work

IR work• ranked retrieval, relevance feedback• search strategies, Berry Picking

making operators pipelining (Ripple Join, XJoin, Tuqwila) precomputed summaries (OLAP, materialized views, AQUA) top N / fast-first query processing

parachute queries, union queries

adaptivity• dynamic query plans• mid-query reoptimization (KD’98, IFF+99)• competition (DEC Rdb)


Summary query processor should care about how

user/application uses results online reordering effective way of supporting

dynamic user control give partial results as user wishes by embedding

reordering within dynamically controlled dataflow

system flexibility helpful hard to map these user-interaction needs into

concrete algorithm performance goals• wanted: benchmarks based on user/application traces

For more information:http://telegraph.cs.berkeley.edu/, http://control.cs.berkeley.edu/


Future Work

session-granularity query processing• query evolution• lazy evaluation: user/client navigates through partial results• closer dbms-application interaction

reduced operator set querying• execute query by appropriate routing through data sources

and state modules


Lessons Learned query processor should care about how

user/application uses results

system flexibility helpful

hard to map these user-interaction needs intoconcrete algorithm performance goals• wanted: benchmarks based on user/application traces

For more information:• http://telegraph.cs.berkeley.edu/

• http://control.cs.berkeley.edu/


relational operators: logical astractions• encapsulate multiple physical effects

inflexible in handling unexpected changes• information hiding

operator speeds resource consumption

• work sharing competitive access paths

• inflexible tuple routing suppose user asks for SQ tuples suppose there is a delay in P

want to encapsulate at level of physical operators: data sources, data structures

Granularity of Query Operators

R SP

Q

R

Eddy

S

Q

RS RS-PQ PQ PQij


State Modules

store state in separate State Modules (SteMs)• SteM like an index: supports builds and probes• unifies caches, join state, rendezvous buffers

Eddy routes tuples through SteMs and data sources• gets results from sources, or probe/build into SteMs

joins and data access performed in the process of routing

R S P Q

Q R

Eddy

SteMP SteMQSteMSSteMR

S P

QR

Eddy

S

Q

RS RS-PQ PQ PQij


HCI Motivation: Why Interactivity?

Berry picking (Bates ‘90, ‘93)

user studies• decision support (O’day and Jeffries ‘93)• relevance feedback (Koenemann and Belkin ‘96)


BACKUP: Disk data layout during juggling

performance goal:• favor early results: optimize Phase 1, at expense of Phase 2• spool : sequential I/O, enrich: random I/O + reordering in Phase 2 much easier than in Phase 1

enrich needs approx. index on side-disk• have hash-index on tuples, according to user interest• done at “group” granularity


BACKUP: Other applications of reordering

can add reorder to any dataflow

will later discuss application for query systems that give aggressive partial results

also useful in batch query processing • sorting often used in query plans for performance gains• can replace by best-effort reordering• little performance hit, but plan is now pipelining• will not discuss further in this talk

spreadsheets


BACKUP: Estimators for rowPriority, incremental

worthiness easy, unless output row depends on new values

• currently use average of all possible rows• could estimate a distribution instead


BACKUP: Interactive query processing with non-pipelining operators

basic techniques independent of pipelined operators. • reordering --- can be used in general query plans,

although effectiveness may be hindered by blocking operators

• State Modules applicable in general -- helps with adaptivity

much work on making plan operators pipelining -- ripple join, xjoin, tuqwila• reordering itself can be used to avoid blocking sorts in

some cases


Query processing using SteMs (1) SteM operations -- works like index on a table

• build(tuple): add tuple to SteM• probe(tuple): find all matches among build values

bounce back tuple if all matches not found and probe tuple not cached elsewhere

performing joins by routing through SteMs• index joins

regular synchronous index joins easy -- have no state over distributed sources, lookups cached by building into SteM

• separating cache allows Eddy to distinguish access cost

• asynchronous index joins [GW’00] helps for high-throughput Internet sources SteM acts as rendezvous buffer for

pending probe resultsR

Eddy

SteMSSteMR

S

R bl

d.

S bl

d.R pr

obe

S pr

obe

R


Query processing using SteMs (2) performing joins by routing through SteMs (contd)

• doubly pipelined hash join• use two SteMs, one on each source

can also do extensions (e.g. Tuqwila [IFF+99]), and even non-pipelined joins, using appropriate SteMs

thus can simulate any static query plan by appropriate routing

R

Eddy

SteMSSteMR

R bl

d.

S bl

d.R pr

obe

S pr

obe

S


Query processing using SteMs (3) how to do this in general?

• don’t want Eddy to be a super-join• want it to route as per user desires and source properties,

independent of any join algorithm

routing constraint:• each tuple must have a non-zero probability of being sent to

every appropriate module• can avoid repeated looping by marking tuples

Theorem: any routing that satisfies above constraint will produce all query results and terminate

could get duplicates• prune at application, or• enforce atomicity in tuple routing


Benefits of SteMs adaptivity

• join algorithm and access path selection are determined by eddy routing

• hence Eddy can dynamically choose these based on source properties

• in fact, Eddy can do hybrid join algorithms adapt to dynamic delays, etc.

work sharing -- for query evolution, competition better information to the Eddy

• speeds, memory consumption


BACKUP: Lazy Evaluation

User navigates thru partial results• can see only a few at a time

exploit to delay processing• do on demand on the region user currently navigating

over

things to delay• expensive source lookups (e.g. map)• joins (e.g. …)• any other formatting etc.


Backup: giving probabilistic results

view partial result as somethng that throws light on outer-join space• positive result: probabilistic • negative result: deterministic

benefit of positive result:• weighted sum of cells it is likely to show; penalty for false

result• repudiation of earlier false results

benefit of negative result:• direct benefit unlikely• repudiation of earlier false results

complications• many tuples may contribute to single output tuple --

aggregation


Probabilistic partial results

partial result => may not apply all filters

still want to show result probabilistically• filters ill-specified• data on Internet often inconsistent • show all partial results: “full disjunction” (Galindo-Legaria

‘94)

positive and negative results

Outp

ut

Space

Oute

r J o

i n S

pace


What does user see?

user navigates thru partial results

with complete row partial results:• can either explore these results in detail (sort/scroll along

columns as in spreadsheet)• compute approx. aggregates on them in online fashion

errors shrinking as more results come in

with probabilistic, incomplete results?• easy if primary key in partial result• what to show otherwise?


Quantifying benefit of partial results

partial result sheds light on outer-join space• positive result: probabilistic • negative result: deterministic

benefit of positive result:• weighted sum of cells it is likely to show• penalty for false result ?• repudiation of earlier false results ?

benefit of negative result:• direct benefit unlikely• repudiation of earlier false results ?


Why probabilistic results are good

data on Internet is often inconsistent• hence even exact processing cannot give perfect answers

people are used to sloppy answers can allow negations in expert interfaces only


Data growth vs. Computer Speedup Moore’s Law -- # of transistors/chip doubles every 18 months

(1965) data growth

0

500

1000

1500

2000

2500

3000

3500

1988

1989

1990

1991

1992

1993

1994

1995

1996

1997

1998

1999

2000

Year

Petabytes

Sales

Moore'sLaw

Source: J. Porter, Disk/Trend, Inc. (http://www.disktrend.com/pdf/portrpkg.pdf)


Disk Appetite, contd.

Greg Papadopoulos, CTO Sun:• Disk sales doubling every 9 months

similar results from Winter VLDB survey

time to process all your data doubles every 18 months!

interactive query processing vijayshankar raman computer science division university of california...

Documents

processing slide

outline query processing

traditional query

query execution

results of processing

feedback slide

structured data processing

buffer reordering policy