query processing and networking infrastructures day 1 of 2 joe hellerstein uc berkeley septemer 20,...

Query Processing and Networking Infrastructures

Day 1 of 2

Joe HellersteinUC Berkeley

Septemer 20, 2002

Two Goals

Day 1: Primer on query processing Targeted to networking/OS folk Bias: systems issues

Day 2: Seed some cross-fertilized research Especially with networking Thesis: dataflow convergence

query processing and routing Clearly other resonances here

Dataflow HW architectures Event-based systems designs ML and Control Theory Online Algorithms

(Sub)Space of Possible Topics

TraditionalRelational QP:Optimization & Execution Parallel QP

Distributed& Federated QP

Boolean Text Search

Traditional TextRanking

HypertextRanking

Indexing

Data Model& Query Language

Design

NFNF Data Models

(OO, XML, “Semistructured”)

Online and Approximate QP

Visual Querying &Data Visualization

Statistical Data Analysis

(“Mining”)

Active DBs(Trigger Systems)

Media Queries,Feature Extraction &

Similarity Search

Data Reduction

TransactionalStorage

& Networking

Compression

Data Streams& Continuous Queries

Adaptive QP

TransactionalStorage

& Networking

Likely Topics Here



Boolean Text Search


HypertextRanking

Indexing

Data Model& Query Language

Design

NFNF Data Models

(OO, XML, “Semistructured”)

Online and Approximate QP

Visual Querying &Data Visualization

Statistical Data Analysis

(“Mining”)

Active DBs(Trigger Systems)

Media Queries,Feature Extraction &

Similarity Search

Data Reduction

Compression

Data Streams& Continuous Queries

Adaptive QP

Plus Some Speculative Ones



Boolean Text Search


IndexingOnline and

Approximate QPData Streams

& Continuous Queries

Adaptive QP

Peer-to-PeerQP

SensornetQP

ContentRouting

NetworkMonitoring

IndirectionArchitectures

Outline

Day 1: Query Processing Crash Course Intro Queries as indirection How do relational databases run queries? How do search engines run queries? Scaling up: cluster parallelism and distribution

Day 2: Research Synergies w/Networking Queries as indirection, revisited Useful (?) analogies to networking research Some of our recent research at the seams Some of your research? Directions and collective discussion

Getting Off on the Right Foot

Roots: database and IR research

“Top-down” traditions (“applications”) Usually begins with semantics and models.

Common Misconceptions Query processing = Oracle or Google.

Need not be so heavyweight or monolithic! Many reusable lessons within

IR search and DB querying are fundamentally different

Very similar from a query processing perspective Many similarities in other data models as well

Querying is a synchronous, interactive process. Triggers, rules and "continuous queries" not so different

from plain old queries.

So… we’ll go bottom-up

Focus on resuable building blocks Attempt to be language- and model-agnostic illustrate with various querying

scenarios

Confession: Two Biases

Relational query engines Most mature and general query technology Best documented in the literature Conceptually general enough to “capture” most all

other models/schemes

Everybody does web searches So it’s both an important app, and an inescapable

usage bias we carry around It will inform our discussion. Shouldn’t skew it

Lots of other query systems/languages you can keep in mind as we go LDAP, DNS, XSL/Xpath/XQuery, Datalog

What Are Queries For? I

Obvious answer: search and analysis over big data sets Search: select data of interest

Boolean expressions over content sometimes with an implicit ordering on results

Analysis: construct new information from base data.

compute functions over each datum concatenate related records (join) partition into groups, summarize (aggregates) aside: “Mining” vs. “Querying”? As a rule of thumb, think

of mining as WYGIWIGY.

Not the most general, powerful answer…

What Are Queries For? II

Queries bridge a (large!) level of indirection Declarative programming: what you want, not how to get

it Easy (er) to express Allows the “how” to change under the covers

A critical issue! Not just for querying

Method invocation, data update, etc

?? !!

Motivation for this Indirection

Critical when rates of change differ across layers: In particular, when

dapp/dt << denvironment/dt E.g. DB apps are used for years, decades (!!) E.g. networked env: high rates of change (??)

DB lit calls this “data independence”

Data Independence: Background

Bad Old Days Hierarchical and “Network” (yep!) data models Nesting & pointers mean that apps explicitly

traverse data, become brittle when data layouts change

Apps with persistent data have slow dapp/dt And the database environments change faster!

Logical changes to representation (schema) Physical changes in storage (indexes, layouts, HW)

DBs often shared by multiple apps! In B.O.D., all apps had to be rewritten on change

It’s a SW Engineering Thing

Analogy: imagine if your C structs were to survive for decades you’d keep them very simple encapsulation to allow future mods

Similar Analogy to NWs protocol simplicity is good soft state is good (discourages hardcoded refs

to transient resources)

But the fun systems part follows directly: Achieve the goal w/respectable performance

over a dynamic execution environment

Codd’s Data Independence

Ted Codd, IBM c. 1969 and forward Turing award 1981

Two layers of indirection

Applications

Logical Representation(schema)

Physical Representation(storage)

Spanned by viewsand query rewriting

Spanned by queryoptimization and

execution

LogicalIndependence

PhysicalIndependence

A More Architectural Picture

Query RewriterQuery Processor

Optimizer

Executor

Declarative queryover views

Declarative queryover base tables

Query Plan(Procedural)

Bridges logical independence

Bridges physical independence

N.B.: This classical QParchitecture raises some

problems. To be revisited!Access Methods

IteratorAPI

Access Methods & Indexing

Access Methods

Base data access layerModel: Data stored in unordered collections Relations, tables, one type per collection

Interface: iterators Open(predicate) -> cursor

Usually simple predicates: attribute op constant op usually arithmetic (<, >, =), though we’ll see

extensions (e.g. multi-d ops) Next(cursor) -> datum (of known type) Close(cursor) Insert(datum of correct type) Delete(cursor)

Typical Access Methods

“Heap” files unordered array of records usually sequential on disk predicates just save cross-layer costs

Traditional Index AMs B-trees

actually, “B+”-trees: all data at leaves Can scan across leaves for range search

predicates (<,>,=, between) result in fewer I/Os random I/Os (at least to find beginning of range)

Linear Hash index Litwin ‘78. Supports equality predicates only.

This is it for IR and standard relational DBs Though when IR folks say “indexing”, they sometimes mean all

of query processing

Primary & Secondary Indexes

Directory

Data

Directory

(key, ptr) pairs

Data

Primary & Secondary Indexes

Directory

Data

Directory

(key, ptr) pairs

Directory

Data

An Exotic Forest of Search Trees

Multi-dimensional indexes For geodata, multimedia search, etc. Dozens! E.g. R-tree family, disk-based Quad-Trees,

kdB-trees And of course “linearizations” with B-trees

Path indexes For XML and OO path queries E.g. Xfilter

Etc. Lots of one-off indexes, often many per workload No clear winners here Extensible indexing scheme would be nice

Generalized Search Trees (GiST)

What is a (tree-based) DB index? Typically: A clustering of data into leaf blocks Hierarchical summaries

(subtree predicates -- SPs) for pointers in directory blocks p1 p2 p3 …

[Hellerstein et al., VLDB 95]

Generalized Search Trees (GiST)

Can realize that abstraction with simple interface: User registers opaque SP objects with a few methods

Consistent(q, p): should query q traverse subtree? Penalty(d, p): how bad is it to insert d below p Union (p1, p2): form SP that includes p1, p2 PickSplit({p1, …, pn}): partition SPs into 2

Tree maintenance, concurrency, recovery all doable under the coversCovers many popular multi-dimensional indexes Most of which had no concurrency/recovery story

http://gist.cs.berkeley.edu

Some Additional Indexing Tricks

Bitmap indexing Many matches per value in (secondary) index?

Rather than storing pointers to heap file in leaves, store a bitmap of matches in a (sorted) heap file.

Only works if file reorg is infrequent Can make intersection, COUNT, etc. quicker during query

processing Can mix/match bitmaps and lists in a single index

Works with any (secondary) index with duplicate matches

“Vertical Partitioning” / “Columnar storage” Again, for sorted, relatively static files

Bit-slice indexes

[O’Neil/Quass, SIGMOD 97]

Query Processing Dataflow Infrastructures

Dataflow Infrastructure

Dataflow abstraction is very simple “box-and-arrow” diagrams (typed) collections of objects flow along edges

Details can be tricky “Push” or “Pull”?

More to it than that How do control-flow and

dataflow interact? Where does the data live?

Don’t want to copy data If passing pointers, where does

the “real” data live?

Iterators

Most uniprocessor DB engines use iterators Open() -> cursor Next(cursor) -> typed record Close(cursor)

Simple and elegant Control-flow and dataflow coupled Familiar single-threaded, procedure-call API

Data refs passed on stack, no buffering

Blocking-agnostic Works w/blocking ops -- e.g. Sort Works w/pipelined ops

Note: well-behaved iterators “come up for air” in inner loops E.g. for interrupt handling

g

R

f S

Where is the In-Flight Data?

In standard DBMS, raw data lives in disk format, in shared Buffer PoolIterators pass references to BufPool A tuple “slot” per iterator input Never copy along edges of dataflow Join results are arrays of refs to base tables

Operators may “pin” pages in BufPool BufPool never replaces pinned pages Ops should release pins ASAP (esp. across Next()

calls!!) Some operators copy data into their internal state Can“spill” this state to private disk space

Weaknesses of Simple Iterators

Evolution of uniprocessor archs to parallel archs esp. “shared-nothing” clusters

Opportunity for pipelined parallelismOpportunity for partition parallelism Take a single “box” in the dataflow, and split it across multiple

machines

Problems with iterators in this environment Spoils pipelined parallelism opportunity Polling (Next()) across the network is inefficient

Nodes sit idle until polled, and during comm A blocking producer blocks its consumer

But would like to keep iterator abstraction Especially to save legacy query processor code And simplify debugging (single-threaded, synchronous)

Exchange

Encapsulate partition parallelism & asynchrony Keep the iterator API between ops Exchange operator partitions input data by content

E.g. join or sort keys

Note basic architectural idea! Encapsulate dataflow

tricks in operators, leavinginfrastructure untouched

We’ll see this again next week, e.g. in Eddies

[Graefe, SIGMOD 90]

Exchange Internals

route

XOUT

XIN

Really 2 operators, XIN and XOUT XIN is “top” of a plan, and pulls,

pushing results to XOUT queue XOUT spins on its local queue

One thread becomes two Producer graph & XIN Consumer graph & XOUT

Routing table/fn in XINsupports partition parallelism E.g. for || sort, join, etc.

Producer and consumer see iterator APIQueue + thread barrier turns NW-based “push” into iterator-style “pull”

Exchange

Exchange Benefits?

Remember Iterator limitations? “Spoils pipelined parallelism opportunity”

solved by Exchange thread boundary “Polling (Next()) across the network is

inefficient” Solved by XIN pushing to XOUT queue

“A blocking producer blocks its consumer” Still a problem!

Exchange Limitations

Doesn’t allow consumer work to overlap w/blocking producers E.g. streaming data sources, events E.g. sort, some join algs Entire consumer graph blocks if XOUT queue empty

Control flow coupled to dataflow, so XOUT won’t return without data

Queue is encapsulated from consumer

But … Note that exchange model is fine for most

traditional DB Query Processing May need to be extended for new settings…

iteratorexchange

Fjords

Thread of control per operatorQueues between each operatorAsynch or synch calls Can do asynch poll-and-yield iteration in each operator (for

both consumer and producer) Or can do synchronous get_next iteration

Can get traditional behavior if you want: Synch polls + queue of size 1

Iterators Synch consumer, asynch producer

= Exchange Asynch calls solve the blocking

problem of Exchange

[Madden/Franklin, ICDE 01]

Fjords

Disadvantages: Lots of “threads”

Best done in an event-programming style, not OS threads

Operators really have to “come up for air” (“yield”) Need to write your own scheduler Harder to debug

But: Maximizes flexibility for operators at the

endpoints Still provides a fairly simple interface for

operator-writers

Basic Relational Operators and Implementation

Relational Algebra Semantics

Selection:p(R) Returns all rows in R that satisfy p

Projection: C(R) Returns all rows in R projected to columns in C

In strict relational model, remove duplicate rows In SQL, preserve duplicates (multiset semantics)

Cartesian Product: R SUnion: R S Difference: R — S Note: R, S must have matching schemata

Join: R p S = p(R S)

Missing: Grouping & Aggregation, Sorting

Operator Overview: Basics

Selection Typically “free”, so “pushed down” Often omitted from diagrams

Projection In SQL, typically “free”, so “pushed down”

No duplicate elimination Always pass the minimal set of columns downstream

Typically omitted from diagrams

Cartesian Product Unavoidable nested loop to generate output

Union: Concat, or concat followed by dup. elim.

Operator Overview, Cont.

Unary operators: Grouping & Sorting Grouping can be done with hash or sort schemes

(as we’ll see)

Binary matching: Joins/Intersections Alternative algorithms:

Nested loops Loop with index lookup (Index N.L.) Sort-merge Hash Join

Don’t forget: have to write as iterators Every time you get called with Next(), you adjust

your state and produce an output record

Unary External Hashing

E.g. GROUP BY, DISTINCTTwo hash functions, hc (coarse) and hf (fine)Two phases: Phase 1: for each tuple of input, hash via hc into a

“spill” partition to be put on disk B-1 blocks of memory used to hold output buffers for

writing a block at a time per partition

B main memory buffers DiskDisk

Original Relation OUTPUT

2INPUT

1

hashfunction

hc B-1

Partitions

1

2

B-1

. . .

[Bratbergsengen, VLDB 84]

Unary External Hashing

PartitionsHash table for partition

Ri (k < B pages)

B main memory buffersDisk

Output buffer

Result

hashfn

hf

Phase 2: for each partition, read off disk and hash into a main-memory hashtable via hf

For distinct, when you find a value already in hashtable, discard the copy

For GROUP BY, associate some agg state (e.g. running SUM) with each group in the hash table, and maintain

External Hashing: Analysis

To utilize memory well in Phase 2, would like each partition to be ~ B blocks big Hence works in two phases when B >= |R|

Same req as external sorting! Else can recursively partition the partitions

in Phase 2

Can be made to pipeline, to adapt nicely to small data sets, etc.

Hash Join (GRACE)

Phase 1: partition each relation on the join key with hc, spilling to diskPhase 2: build each partition of smaller relation into a hashtable via hf scan matching partition of bigger relation, and for each tuple

probe the hashtable via hf for matches

Would like each partition of smaller relation to fit in memory So works well if B >= |smaller| Size of bigger is irrelevant!! (Vs. sort-merge join)

Popular optimization: Hybrid hash join Partition #0 doesn’t spill -- it builds and probes immediately Partitions 1 through n use rest of memory for output buffers [DeWitt/Katz/Olken/Shapiro/Stonebraker/Wood, SIGMOD 84]

[Fushimi, et al., VLDB 84]

Hash-Join

Partitionsof R & S

Input bufferfor Si

Hash table for partitionRi (k < B-1 pages)

B main memory buffersDisk

Output buffer

Join Result

hashfn

hf

hf

B main memory buffers DiskDisk

Original Relations OUTPUT

2INPUT

1

hashfunction

hc B-1

Partitions

1

2

B-1

. . .

Symmetric Hash Join

Pipelining, in-core variantBuild and probe symmetrically Correctness: Each output tuple

generated when its last-arriving component appears

Can be extended to out-of-core case Tukwila [Ives & HaLevy, SIGMOD ‘99] Xjoin: Spill and read partitions multiple times

Correctness guaranteed by timestamping tuples and partitions

[Urhan & Franklin, DEBull ‘00]

[Mikillineni & Su, TOSE 88][Wilschut & Apers, PDIS 91]

Relational Query Engines

SELECT [DISTINCT] <output expressions> FROM <tables>[WHERE <predicates>][GROUP BY <gb-expression> [HAVING <h-predicates>]][ORDER BY <expression>]

A Basic SQL primer

Join tables in FROM clause applying predicates in WHERE clause

If GROUP BY, partition results by GROUP And maintain aggregate output expressions per group Delete groups that don’t satisfy HAVING clause

If ORDER BY, sort output accordingly

Examples

Single-table S-F-W DISTINCT, ORDER BY

Multi-table S-F-W And self-join

Scalar output expressionsAggregate output expressions With and without DISTINCT

Group ByHavingNested queries Uncorrelated and correlated

A Dopey Query Optimizer

For each S-F-W query block Create a plan that:

Forms the cartesian product of the FROM clause

Applies the WHERE clause Incredibly inefficient

Huge intermediate results!

Then, as needed: Apply the GROUP BY clause Apply the HAVING clause Apply any projections and output expressions Apply duplicate elimination and/or ORDER BY

predicates

tables

…

An Oracular Query Optimizer

For each possible correct plan: Run the plan (infinitely fast) Measure its performance in reality

Pick the best plan, and run it in reality

A Standard Query Optimizer

Three aspects to the problem Legal plan space (transformation

rules) Cost model Search Strategy

Plan Space

Many legal algebraic transformations, e.g.: Cartesian product followed by selection can be

rewritten as join Join is commutative and associative

Can reorder the join tree arbitrarily NP-hard to find best join tree in general

Selections should (usually) be “pushed down” Projections can be “pushed down”

And “physical” choices Choice of Access Methods Choice of Join algorithms Taking advantage of sorted nature of some streams

Complicates Dynamic Programming, as we’ll see

Cost Model & Selectivity Estimation

Cost of a physical operator can be modeled fairly accurately: E.g. number of random and sequential I/Os Requires metadata about input tables:

Number of rows (cardinality) Bytes per tuple (physical schema)

In a query pipeline, metadata on intermediate tables is trickier Cardinality? Requires “selectivity” (COUNT) estimation

Wet-finger estimates Histograms, joint distributions and other summaries Sampling

Search Strategy

Dynamic Programming Used in most commercial systems IBM’s System R [Selinger, et al. SIGMOD 79]

Top-Down Branch and bound with memoization Exodus, Volcano & Cascades [Graefe, SIGMOD 87,

ICDE 93, DEBull 95] Used in a few commercial systems (Microsoft SQL

Server, especially)

Randomized Simulated Annealing, etc. [Ioannidis & Kang

SIGMOD 90]

Dynamic Programming

Use principle of optimality Any subtree of the optimal plan is itself optimal for its sub-

expression

Plans enumerated in N passes (if N relations joined): Pass 1: Find best 1-relation plan for each relation. Pass 2: Find best way to join result of each 1-relation plan (as

outer) to another relation. (All 2-relation plans.) Pass N: Find best way to join result of a (N-1)-relation plan (as

outer) to the N’th relation. (All N-relation plans.) This gives all left-deep plans. Generalization is easy…

A wrinkle: physical properties (e.g. sort orders) violate principle of optimality! Use partial-order dynamic programming

I.e. keep undominated plans at each step -- optimal for each setting of the physical properties (each “interesting order”)

Relational Architecture Review

Query Parsing and Optimization

Query Executor

Access Methods

Buffer Management

Disk Space Management

DB

Lock Manager

Log Manager

Text Search

Information Retrieval

A research field traditionally separate from Databases Goes back to IBM, Rand and Lockheed in the 50’s G. Salton at Cornell in the 60’s Lots of research since then

Products traditionally separate Originally, document management systems for

libraries, government, law, etc. Gained prominence in recent years due to web search

Today: simple IR techniques Show similarities to DBMS techniques you already

know

IR vs. DBMS

Seem like very different beasts

Under the hood, not as different as they might seem But in practice, you have to choose between the 2

IR DBMSImprecise Semantics Precise Semantics

Keyword search SQL

Unstructured data format

Structured data

Read-Mostly. Add docs occasionally

Expect reasonable number of updates

Page through top k results

Generate full answer

IR’s “Bag of Words” Model

Typical IR data model: Each document is just a bag of words (“terms”)

Detail 1: “Stop Words” Certain words are considered irrelevant and not placed in the

bag e.g. “the” e.g. HTML tags like <H1>

Detail 2: “Stemming” Using English-specific rules, convert words to their basic form e.g. “surfing”, “surfed” --> “surf”

Detail 3: we may decorate the words with other attributes E.g. position, font info, etc. Not exactly “bag of words” after all

Boolean Text Search

Find all documents that match a Boolean containment expression: “Windows”

AND (“Glass” OR “Door”) AND NOT “Microsoft”

Note: query terms are also filtered via stemming and stop wordsWhen web search engines say “10,000 documents found”, that’s the Boolean search result size.

Text “Indexes”

When IR folks say “index” or “indexing” … Usually mean more than what DB people

mean

In our terms, both “tables” and indexes Really a logical schema (i.e. tables) With a physical schema (i.e. indexes) Usually not stored in a DBMS

Tables implemented as files in a file system

A Simple Relational Text Index

Create and populate a tableInvertedFile(term string, docID int64)

Build a B+-tree or Hash index on InvertedFile.term May be lots of duplicate docIDs per term Secondary index: list compression per term possible

This is often called an “inverted file” or “inverted index” Maps from words -> docs

whereas normal files map docs to the words in the doc (?!)

Can now do single-word text search queries

Handling Boolean Logic

How to do “term1” OR “term2”? Union of two docID sets

How to do “term1” AND “term2”? Intersection (ID join) of two DocID sets!

How to do “term1” AND NOT “term2”? Set subtraction

Also a join algorithm

How to do “term1” OR NOT “term2” Union of “term1” and “NOT term2”.

“Not term2” = all docs not containing term2. Yuck! Usually forbidden at UI/parser

Refinement: what order to handle terms if you have many ANDs/NOTs?

Boolean Search in SQL

(SELECT docID FROM InvertedFile WHERE word = “window” INTERSECT SELECT docID FROM InvertedFile WHERE word = “glass” OR word = “door”)EXCEPTSELECT docID FROM InvertedFile WHERE word=“Microsoft”ORDER BY magic_rank()

Really there’s only one query (template) in IR Single-table selects, UNION, INTERSECT, EXCEPT Note that INTERSECT is a shorthand for equijoin on a key Often there’s only one query plan in the system, too!

magic_rank() is the “secret sauce” in the search engines

“Windows” AND (“Glass” OR “Door”) AND NOT “Microsoft”

Fancier: Phrases and “Near”

Suppose you want a phrase E.g. “Happy Days”

Add a position attribute to the schema: InvertedFile (term string, docID int64, position int) Index on term

Enhance join condition in query Can’t use INTERSECT syntax, but query is nearly the sameSELECT I1.docID

FROM InvertedFile I1, InvertedFile I WHERE I1.word = “HAPPY”

AND I2.word = “DAYS” AND I1.docID = I2.docID AND I2.position - I1.position = 1ORDER BY magic_rank()

Can relax to “term1” NEAR “term2” Position < k off

Classical Document RankingTF IDF (Term Freq. Inverse Doc Freq.) For each term t in the query

QueryTermRank = #occurrences of t in q TF log((total #docs)/(#docs with this term)) IDF normalization-factor

For each doc d in the boolean result DocTermRank = #occurrences of t in d TF

log((total #docs)/(#docs with this term)) IDF normalization-factor

Rank += DocTermRank*QueryTermRank

Requires more to our schema InvertedFile (term string, docID int64, position int, DocTermRank

float) TermInfo(term string, numDocs int) Can compress DocTermRank non-relationally

This basically works fine for raw text There are other schemes, but this is the standard

Some Additional Ranking Tricks

Phrases/Proximity Ranking function can incorporate position

Query expansion, suggestions Can keep a similarity matrix on terms, and

expand/modify people’s queries

Document expansion Can add terms to a doc

E.g. in “anchor text” of refs to the doc

Not all occurrences are created equal Mess with DocTermRank based on:

Fonts, position in doc (title, etc.)

Hypertext Ranking

Also factor in graph structure Social Network Theory (Citation Analysis) “Hubs and Authorities” (Clever),

“PageRank” (Google) Intuition: recursively weighted in-degrees,

out-degrees Math: eigenvector computation

PageRank sure seems to help Though word on the street is that other

factors matter as much Anchor text, title/bold text, etc.

Updates and Text Search

Text search engines are designed to be query-mostly Deletes and modifications are rare Can postpone updates (nobody notices, no transactions!)

Updates done in batch (rebuild the index) Can’t afford to go offline for an update?

Create a 2nd index on a separate machine Replace the 1st index with the 2nd Can do this incrementally with a level of indirection

So no concurrency control problems Can compress to search-friendly, update-unfriendly format

For these reasons, text search engines and DBMSs are usually separate products Also, text-search engines tune that one SQL query to death! The benefits of a special-case workload.

{

Architectural Comparison

The Access Method

Buffer Management


DB

OS

“The Query”

Search String Modifier

Simple DBMS}

Ranking Algorithm

Query Optimizationand Execution

Relational Operators

Files and Access Methods

Buffer Management


DB

Concurrencyand

RecoveryNeeded

DBMS Search Engine

Revisiting Our IR/DBMS Distinctions

Data Modeling & Query Complexity DBMS supports any schema & queries

Requires you to define schema Complex query language (hard for folks to learn) Multiple applications at output

RowSet API (cursors) IR supports only one schema & query

No schema design required (unstructured text) Trivial query language Single application behavior

Page through output in rank order, ignore most of output

Storage Semantics DBMS: online, transactional storage IR: batch, unreliable storage

Distribution & Parallelism

Roots

Distributed QP vs. Parallel QP Distributed QP envisioned as a k-node intranet

for k ~= 10 Sound old-fashioned? Think of multiple hosting sites

(e.g. one per continent) Parallel QP grew out of DB Machine research

All in one room, one administrator

Parallel DBMS architecture options Shared-Nothing Shared-Everything Shared-Disk

Shared-nothing is most general, most scalable

Distributed QP: Semi-Joins

Main query processing issue in distributed DB lit: use semi-joins R S = R(R S) Observe that R S (R (S)) S

Assume each table lives at one site, R is bigger. To reduce communication:

Ship S’s join columns to R’s site, do semijoin there, ship result to S’s site for the join

Notes I’m sloppy about dups in my def’ns above Semi-joins aren’t always a win

Extra cost estimation task for a distributed optimizer

[Bernstein/Goodman ‘79]

Bloom Joins

A constant optimization on semi-joins Idea: (R (S)) S is redundant

Semi-join can safely return “false hits” from R Rather than shipping (S), ship a superset

A particular kind of lossy set compression allowed

Bloom Filter (B. Bloom, 1970) Hash each value in a set via k independent hash

functions onto an array of n bits Check membership correspondingly By tuning k and n, can control false hit rate Rediscovered recently in web lit, some new wrinkles

(Mitzenmacher’s compressed B.F.’s, Rhea’s attenuated B.F.’s)

[Babb, TODS 79]

Sideways Information Passing

These ideas generalize more broadly

Set o’ Stuff

CostlySet Generator

COMBINER

Sideways Information Passing

These ideas generalize more broadly E.g. “magic sets” rewriting in datalog & SQL Tricky to do optimally in those settings, but

wins can be very big

Set o’ Stuff

CostlySet Generator

COMBINER

Less CostlySet Generator

Parallelism 101

Pipelined vs. Partitioned Pipelined typically inter-operator

Nominal benefits in a dataflow Partition typically intra-operator

E.g. hash join or sort using k nodes

Speedup & Scaleup Speedup: x=old_time/new_time

Ideal: linear Scaleup: small_sys_elapsed_small_problem /

big_sys_elapse_big_problem Ideal: 1 Transaction scaleup: N times as many TPC-C’s for N machines Batch scaleup: N times as big a DB for a query on N machines

[See DeWitt & Gray, CACM 92]

Impediments to Good Parallelism

Startup overheads Amortized for big queries

Interference usually the result of unpredictable

communication delays (comm cost, empty pipelines)

Skew

Of these, skew is the real issue in DBs “Embarrassingly parallel” I.e. it works

Data Layout

Horizontal Partitioning For each table, assign rows to

machines by some key Or assign arbitrarily (round-robin)

Vertical Partitioning Sort table, and slice off columns Usually not a parallelism trick

But nice for processing queries on read-mostly data (projection is free!)

Intra-Operator Parallelism

E.g. for Hash Join Every site with a horizontal partition

of either R or S fires off a scan thread Every storage site reroutes its data

among join nodes based on hash of the join column(s)

Upon receipt, each site does local hash join

Recall Exchange!

Skew Handling

Skew happens Even when hashing? Yep.

Can pre-sample and/or pre-summarize data to partition betterSolving skew on the fly is harder Need to migrate accumulated dataflow

state FLuX: Fault-Tolerant, Load-balancing

eXchange

In Current Architectures

All DBMSs can run on shared memory, many on shared-nothing The high end belongs to clusters

The biggest web-search engines run on clusters (Google, Inktomi) And use pretty textbook DB stuff for

Boolean search Fun tradeoffs between answer quality and

availability/management here (the Inktomi story)

Precomputations

Views and Materialization

A view is a logical table A query with a name In general, not updatable

If to be used often, could be materialized Pre-compute and/or cache result

Could even choose to do this for common query sub-expressions Needn’t require a DBA to say “this is a view”

Challenges in Materialized Views

Three main issues: Given a workload, which views should

be materialized Given a query, how can mat-views be

incorporated into query optimizer As base tables are updated, how can

views be incrementally maintained?

See readings book, Gupta & Mumick

Precomputation in IR

Often want to save results of common queries E.g. no point re-running “Britney Spears” or

“Harry Potter” as Boolean queries

Can also use as subquery results E.g. the query “Harry Potter Loves Britney

Spears” can use the “Harry Potter” and “Britney Spears” results

Constrained version of mat-views No surprise -- constrained relational workload And consistency of matview with raw tables is not

critical, so maintenance not such an issue.

Precomputed Aggregates

Aggregation queries work on numerical sets of data Math tricks apply here

Some trivial, some fancy Theme: replace the raw data with small statistical

summaries, get approximate results Histograms, wavelets, samples, dependency-based

models, random projections, etc. Heavily used in query optimizers for selectivity

estimation (a COUNT aggregate) Spate of recent work on approximate query processing

for AVG, SUM, etc.

[Garofalakis, Gehrke, Rastogi tutorial, SIGMOD ‘02]

A Taste of Next Week

Query Dataflows Meet NWs

Some more presentation Indirection in space and time Thematic similarities, differences in NW/QP

Revisit a NW “classic” through a DB lens Adaptive QP In Telegraph

Eddies, Stems, FluX A taste of QP in Sensor Networks

TinyDB (TAG), Directed Diffusion A taste of QP in p2p

PIER project at Berkeley: parallel QP over DHTs

Presentations from MITers?Open Discussion

Contact

[email protected]://www.cs.berkeley.edu/~jmh

query processing and networking infrastructures day 1 of 2 joe hellerstein uc berkeley septemer 20,...

Documents

query processing perspective

ir search

plain old queries

ir research

right foot slide

collective discussion

networking infrastructures

general query technology