query processing and networking infrastructures day 1 of 2 joe hellerstein uc berkeley septemer 20,...
Post on 20-Dec-2015
216 views
TRANSCRIPT
Query Processing and Networking Infrastructures
Day 1 of 2
Joe HellersteinUC Berkeley
Septemer 20, 2002
Two Goals
Day 1: Primer on query processing Targeted to networking/OS folk Bias: systems issues
Day 2: Seed some cross-fertilized research Especially with networking Thesis: dataflow convergence
query processing and routing Clearly other resonances here
Dataflow HW architectures Event-based systems designs ML and Control Theory Online Algorithms
(Sub)Space of Possible Topics
TraditionalRelational QP:Optimization & Execution Parallel QP
Distributed& Federated QP
Boolean Text Search
Traditional TextRanking
HypertextRanking
Indexing
Data Model& Query Language
Design
NFNF Data Models
(OO, XML, “Semistructured”)
Online and Approximate QP
Visual Querying &Data Visualization
Statistical Data Analysis
(“Mining”)
Active DBs(Trigger Systems)
Media Queries,Feature Extraction &
Similarity Search
Data Reduction
TransactionalStorage
& Networking
Compression
Data Streams& Continuous Queries
Adaptive QP
TransactionalStorage
& Networking
Likely Topics Here
TraditionalRelational QP:Optimization & Execution Parallel QP
Distributed& Federated QP
Boolean Text Search
Traditional TextRanking
HypertextRanking
Indexing
Data Model& Query Language
Design
NFNF Data Models
(OO, XML, “Semistructured”)
Online and Approximate QP
Visual Querying &Data Visualization
Statistical Data Analysis
(“Mining”)
Active DBs(Trigger Systems)
Media Queries,Feature Extraction &
Similarity Search
Data Reduction
Compression
Data Streams& Continuous Queries
Adaptive QP
Plus Some Speculative Ones
TraditionalRelational QP:Optimization & Execution Parallel QP
Distributed& Federated QP
Boolean Text Search
Traditional TextRanking
IndexingOnline and
Approximate QPData Streams
& Continuous Queries
Adaptive QP
Peer-to-PeerQP
SensornetQP
ContentRouting
NetworkMonitoring
IndirectionArchitectures
Outline
Day 1: Query Processing Crash Course Intro Queries as indirection How do relational databases run queries? How do search engines run queries? Scaling up: cluster parallelism and distribution
Day 2: Research Synergies w/Networking Queries as indirection, revisited Useful (?) analogies to networking research Some of our recent research at the seams Some of your research? Directions and collective discussion
Getting Off on the Right Foot
Roots: database and IR research
“Top-down” traditions (“applications”) Usually begins with semantics and models.
Common Misconceptions Query processing = Oracle or Google.
Need not be so heavyweight or monolithic! Many reusable lessons within
IR search and DB querying are fundamentally different
Very similar from a query processing perspective Many similarities in other data models as well
Querying is a synchronous, interactive process. Triggers, rules and "continuous queries" not so different
from plain old queries.
So… we’ll go bottom-up
Focus on resuable building blocks Attempt to be language- and model-agnostic illustrate with various querying
scenarios
Confession: Two Biases
Relational query engines Most mature and general query technology Best documented in the literature Conceptually general enough to “capture” most all
other models/schemes
Everybody does web searches So it’s both an important app, and an inescapable
usage bias we carry around It will inform our discussion. Shouldn’t skew it
Lots of other query systems/languages you can keep in mind as we go LDAP, DNS, XSL/Xpath/XQuery, Datalog
What Are Queries For? I
Obvious answer: search and analysis over big data sets Search: select data of interest
Boolean expressions over content sometimes with an implicit ordering on results
Analysis: construct new information from base data.
compute functions over each datum concatenate related records (join) partition into groups, summarize (aggregates) aside: “Mining” vs. “Querying”? As a rule of thumb, think
of mining as WYGIWIGY.
Not the most general, powerful answer…
What Are Queries For? II
Queries bridge a (large!) level of indirection Declarative programming: what you want, not how to get
it Easy (er) to express Allows the “how” to change under the covers
A critical issue! Not just for querying
Method invocation, data update, etc
?? !!
Motivation for this Indirection
Critical when rates of change differ across layers: In particular, when
dapp/dt << denvironment/dt E.g. DB apps are used for years, decades (!!) E.g. networked env: high rates of change (??)
DB lit calls this “data independence”
Data Independence: Background
Bad Old Days Hierarchical and “Network” (yep!) data models Nesting & pointers mean that apps explicitly
traverse data, become brittle when data layouts change
Apps with persistent data have slow dapp/dt And the database environments change faster!
Logical changes to representation (schema) Physical changes in storage (indexes, layouts, HW)
DBs often shared by multiple apps! In B.O.D., all apps had to be rewritten on change
It’s a SW Engineering Thing
Analogy: imagine if your C structs were to survive for decades you’d keep them very simple encapsulation to allow future mods
Similar Analogy to NWs protocol simplicity is good soft state is good (discourages hardcoded refs
to transient resources)
But the fun systems part follows directly: Achieve the goal w/respectable performance
over a dynamic execution environment
Codd’s Data Independence
Ted Codd, IBM c. 1969 and forward Turing award 1981
Two layers of indirection
Applications
Logical Representation(schema)
Physical Representation(storage)
Spanned by viewsand query rewriting
Spanned by queryoptimization and
execution
LogicalIndependence
PhysicalIndependence
A More Architectural Picture
Query RewriterQuery Processor
Optimizer
Executor
Declarative queryover views
Declarative queryover base tables
Query Plan(Procedural)
Bridges logical independence
Bridges physical independence
N.B.: This classical QParchitecture raises some
problems. To be revisited!Access Methods
IteratorAPI
Access Methods & Indexing
Access Methods
Base data access layerModel: Data stored in unordered collections Relations, tables, one type per collection
Interface: iterators Open(predicate) -> cursor
Usually simple predicates: attribute op constant op usually arithmetic (<, >, =), though we’ll see
extensions (e.g. multi-d ops) Next(cursor) -> datum (of known type) Close(cursor) Insert(datum of correct type) Delete(cursor)
Typical Access Methods
“Heap” files unordered array of records usually sequential on disk predicates just save cross-layer costs
Traditional Index AMs B-trees
actually, “B+”-trees: all data at leaves Can scan across leaves for range search
predicates (<,>,=, between) result in fewer I/Os random I/Os (at least to find beginning of range)
Linear Hash index Litwin ‘78. Supports equality predicates only.
This is it for IR and standard relational DBs Though when IR folks say “indexing”, they sometimes mean all
of query processing
Primary & Secondary Indexes
Directory
Data
Directory
(key, ptr) pairs
Data
Primary & Secondary Indexes
Directory
Data
Directory
(key, ptr) pairs
Directory
Data
An Exotic Forest of Search Trees
Multi-dimensional indexes For geodata, multimedia search, etc. Dozens! E.g. R-tree family, disk-based Quad-Trees,
kdB-trees And of course “linearizations” with B-trees
Path indexes For XML and OO path queries E.g. Xfilter
Etc. Lots of one-off indexes, often many per workload No clear winners here Extensible indexing scheme would be nice
Generalized Search Trees (GiST)
What is a (tree-based) DB index? Typically: A clustering of data into leaf blocks Hierarchical summaries
(subtree predicates -- SPs) for pointers in directory blocks p1 p2 p3 …
[Hellerstein et al., VLDB 95]
Generalized Search Trees (GiST)
Can realize that abstraction with simple interface: User registers opaque SP objects with a few methods
Consistent(q, p): should query q traverse subtree? Penalty(d, p): how bad is it to insert d below p Union (p1, p2): form SP that includes p1, p2 PickSplit({p1, …, pn}): partition SPs into 2
Tree maintenance, concurrency, recovery all doable under the coversCovers many popular multi-dimensional indexes Most of which had no concurrency/recovery story
http://gist.cs.berkeley.edu
Some Additional Indexing Tricks
Bitmap indexing Many matches per value in (secondary) index?
Rather than storing pointers to heap file in leaves, store a bitmap of matches in a (sorted) heap file.
Only works if file reorg is infrequent Can make intersection, COUNT, etc. quicker during query
processing Can mix/match bitmaps and lists in a single index
Works with any (secondary) index with duplicate matches
“Vertical Partitioning” / “Columnar storage” Again, for sorted, relatively static files
Bit-slice indexes
[O’Neil/Quass, SIGMOD 97]
Query Processing Dataflow Infrastructures
Dataflow Infrastructure
Dataflow abstraction is very simple “box-and-arrow” diagrams (typed) collections of objects flow along edges
Details can be tricky “Push” or “Pull”?
More to it than that How do control-flow and
dataflow interact? Where does the data live?
Don’t want to copy data If passing pointers, where does
the “real” data live?
Iterators
Most uniprocessor DB engines use iterators Open() -> cursor Next(cursor) -> typed record Close(cursor)
Simple and elegant Control-flow and dataflow coupled Familiar single-threaded, procedure-call API
Data refs passed on stack, no buffering
Blocking-agnostic Works w/blocking ops -- e.g. Sort Works w/pipelined ops
Note: well-behaved iterators “come up for air” in inner loops E.g. for interrupt handling
g
R
f S
Where is the In-Flight Data?
In standard DBMS, raw data lives in disk format, in shared Buffer PoolIterators pass references to BufPool A tuple “slot” per iterator input Never copy along edges of dataflow Join results are arrays of refs to base tables
Operators may “pin” pages in BufPool BufPool never replaces pinned pages Ops should release pins ASAP (esp. across Next()
calls!!) Some operators copy data into their internal state Can“spill” this state to private disk space
Weaknesses of Simple Iterators
Evolution of uniprocessor archs to parallel archs esp. “shared-nothing” clusters
Opportunity for pipelined parallelismOpportunity for partition parallelism Take a single “box” in the dataflow, and split it across multiple
machines
Problems with iterators in this environment Spoils pipelined parallelism opportunity Polling (Next()) across the network is inefficient
Nodes sit idle until polled, and during comm A blocking producer blocks its consumer
But would like to keep iterator abstraction Especially to save legacy query processor code And simplify debugging (single-threaded, synchronous)
Exchange
Encapsulate partition parallelism & asynchrony Keep the iterator API between ops Exchange operator partitions input data by content
E.g. join or sort keys
Note basic architectural idea! Encapsulate dataflow
tricks in operators, leavinginfrastructure untouched
We’ll see this again next week, e.g. in Eddies
[Graefe, SIGMOD 90]
Exchange Internals
route
XOUT
XIN
Really 2 operators, XIN and XOUT XIN is “top” of a plan, and pulls,
pushing results to XOUT queue XOUT spins on its local queue
One thread becomes two Producer graph & XIN Consumer graph & XOUT
Routing table/fn in XINsupports partition parallelism E.g. for || sort, join, etc.
Producer and consumer see iterator APIQueue + thread barrier turns NW-based “push” into iterator-style “pull”
Exchange
Exchange Benefits?
Remember Iterator limitations? “Spoils pipelined parallelism opportunity”
solved by Exchange thread boundary “Polling (Next()) across the network is
inefficient” Solved by XIN pushing to XOUT queue
“A blocking producer blocks its consumer” Still a problem!
Exchange Limitations
Doesn’t allow consumer work to overlap w/blocking producers E.g. streaming data sources, events E.g. sort, some join algs Entire consumer graph blocks if XOUT queue empty
Control flow coupled to dataflow, so XOUT won’t return without data
Queue is encapsulated from consumer
But … Note that exchange model is fine for most
traditional DB Query Processing May need to be extended for new settings…
iteratorexchange
Fjords
Thread of control per operatorQueues between each operatorAsynch or synch calls Can do asynch poll-and-yield iteration in each operator (for
both consumer and producer) Or can do synchronous get_next iteration
Can get traditional behavior if you want: Synch polls + queue of size 1
Iterators Synch consumer, asynch producer
= Exchange Asynch calls solve the blocking
problem of Exchange
[Madden/Franklin, ICDE 01]
Fjords
Disadvantages: Lots of “threads”
Best done in an event-programming style, not OS threads
Operators really have to “come up for air” (“yield”) Need to write your own scheduler Harder to debug
But: Maximizes flexibility for operators at the
endpoints Still provides a fairly simple interface for
operator-writers
Basic Relational Operators and Implementation
Relational Algebra Semantics
Selection:p(R) Returns all rows in R that satisfy p
Projection: C(R) Returns all rows in R projected to columns in C
In strict relational model, remove duplicate rows In SQL, preserve duplicates (multiset semantics)
Cartesian Product: R SUnion: R S Difference: R — S Note: R, S must have matching schemata
Join: R p S = p(R S)
Missing: Grouping & Aggregation, Sorting
Operator Overview: Basics
Selection Typically “free”, so “pushed down” Often omitted from diagrams
Projection In SQL, typically “free”, so “pushed down”
No duplicate elimination Always pass the minimal set of columns downstream
Typically omitted from diagrams
Cartesian Product Unavoidable nested loop to generate output
Union: Concat, or concat followed by dup. elim.
Operator Overview, Cont.
Unary operators: Grouping & Sorting Grouping can be done with hash or sort schemes
(as we’ll see)
Binary matching: Joins/Intersections Alternative algorithms:
Nested loops Loop with index lookup (Index N.L.) Sort-merge Hash Join
Don’t forget: have to write as iterators Every time you get called with Next(), you adjust
your state and produce an output record
Unary External Hashing
E.g. GROUP BY, DISTINCTTwo hash functions, hc (coarse) and hf (fine)Two phases: Phase 1: for each tuple of input, hash via hc into a
“spill” partition to be put on disk B-1 blocks of memory used to hold output buffers for
writing a block at a time per partition
B main memory buffers DiskDisk
Original Relation OUTPUT
2INPUT
1
hashfunction
hc B-1
Partitions
1
2
B-1
. . .
[Bratbergsengen, VLDB 84]
Unary External Hashing
PartitionsHash table for partition
Ri (k < B pages)
B main memory buffersDisk
Output buffer
Result
hashfn
hf
Phase 2: for each partition, read off disk and hash into a main-memory hashtable via hf
For distinct, when you find a value already in hashtable, discard the copy
For GROUP BY, associate some agg state (e.g. running SUM) with each group in the hash table, and maintain
External Hashing: Analysis
To utilize memory well in Phase 2, would like each partition to be ~ B blocks big Hence works in two phases when B >= |R|
Same req as external sorting! Else can recursively partition the partitions
in Phase 2
Can be made to pipeline, to adapt nicely to small data sets, etc.
Hash Join (GRACE)
Phase 1: partition each relation on the join key with hc, spilling to diskPhase 2: build each partition of smaller relation into a hashtable via hf scan matching partition of bigger relation, and for each tuple
probe the hashtable via hf for matches
Would like each partition of smaller relation to fit in memory So works well if B >= |smaller| Size of bigger is irrelevant!! (Vs. sort-merge join)
Popular optimization: Hybrid hash join Partition #0 doesn’t spill -- it builds and probes immediately Partitions 1 through n use rest of memory for output buffers [DeWitt/Katz/Olken/Shapiro/Stonebraker/Wood, SIGMOD 84]
[Fushimi, et al., VLDB 84]
Hash-Join
Partitionsof R & S
Input bufferfor Si
Hash table for partitionRi (k < B-1 pages)
B main memory buffersDisk
Output buffer
Join Result
hashfn
hf
hf
B main memory buffers DiskDisk
Original Relations OUTPUT
2INPUT
1
hashfunction
hc B-1
Partitions
1
2
B-1
. . .
Symmetric Hash Join
Pipelining, in-core variantBuild and probe symmetrically Correctness: Each output tuple
generated when its last-arriving component appears
Can be extended to out-of-core case Tukwila [Ives & HaLevy, SIGMOD ‘99] Xjoin: Spill and read partitions multiple times
Correctness guaranteed by timestamping tuples and partitions
[Urhan & Franklin, DEBull ‘00]
[Mikillineni & Su, TOSE 88][Wilschut & Apers, PDIS 91]
Relational Query Engines
SELECT [DISTINCT] <output expressions> FROM <tables>[WHERE <predicates>][GROUP BY <gb-expression> [HAVING <h-predicates>]][ORDER BY <expression>]
A Basic SQL primer
Join tables in FROM clause applying predicates in WHERE clause
If GROUP BY, partition results by GROUP And maintain aggregate output expressions per group Delete groups that don’t satisfy HAVING clause
If ORDER BY, sort output accordingly
Examples
Single-table S-F-W DISTINCT, ORDER BY
Multi-table S-F-W And self-join
Scalar output expressionsAggregate output expressions With and without DISTINCT
Group ByHavingNested queries Uncorrelated and correlated
A Dopey Query Optimizer
For each S-F-W query block Create a plan that:
Forms the cartesian product of the FROM clause
Applies the WHERE clause Incredibly inefficient
Huge intermediate results!
Then, as needed: Apply the GROUP BY clause Apply the HAVING clause Apply any projections and output expressions Apply duplicate elimination and/or ORDER BY
predicates
tables
…
An Oracular Query Optimizer
For each possible correct plan: Run the plan (infinitely fast) Measure its performance in reality
Pick the best plan, and run it in reality
A Standard Query Optimizer
Three aspects to the problem Legal plan space (transformation
rules) Cost model Search Strategy
Plan Space
Many legal algebraic transformations, e.g.: Cartesian product followed by selection can be
rewritten as join Join is commutative and associative
Can reorder the join tree arbitrarily NP-hard to find best join tree in general
Selections should (usually) be “pushed down” Projections can be “pushed down”
And “physical” choices Choice of Access Methods Choice of Join algorithms Taking advantage of sorted nature of some streams
Complicates Dynamic Programming, as we’ll see
Cost Model & Selectivity Estimation
Cost of a physical operator can be modeled fairly accurately: E.g. number of random and sequential I/Os Requires metadata about input tables:
Number of rows (cardinality) Bytes per tuple (physical schema)
In a query pipeline, metadata on intermediate tables is trickier Cardinality? Requires “selectivity” (COUNT) estimation
Wet-finger estimates Histograms, joint distributions and other summaries Sampling
Search Strategy
Dynamic Programming Used in most commercial systems IBM’s System R [Selinger, et al. SIGMOD 79]
Top-Down Branch and bound with memoization Exodus, Volcano & Cascades [Graefe, SIGMOD 87,
ICDE 93, DEBull 95] Used in a few commercial systems (Microsoft SQL
Server, especially)
Randomized Simulated Annealing, etc. [Ioannidis & Kang
SIGMOD 90]
Dynamic Programming
Use principle of optimality Any subtree of the optimal plan is itself optimal for its sub-
expression
Plans enumerated in N passes (if N relations joined): Pass 1: Find best 1-relation plan for each relation. Pass 2: Find best way to join result of each 1-relation plan (as
outer) to another relation. (All 2-relation plans.) Pass N: Find best way to join result of a (N-1)-relation plan (as
outer) to the N’th relation. (All N-relation plans.) This gives all left-deep plans. Generalization is easy…
A wrinkle: physical properties (e.g. sort orders) violate principle of optimality! Use partial-order dynamic programming
I.e. keep undominated plans at each step -- optimal for each setting of the physical properties (each “interesting order”)
Relational Architecture Review
Query Parsing and Optimization
Query Executor
Access Methods
Buffer Management
Disk Space Management
DB
Lock Manager
Log Manager
Text Search
Information Retrieval
A research field traditionally separate from Databases Goes back to IBM, Rand and Lockheed in the 50’s G. Salton at Cornell in the 60’s Lots of research since then
Products traditionally separate Originally, document management systems for
libraries, government, law, etc. Gained prominence in recent years due to web search
Today: simple IR techniques Show similarities to DBMS techniques you already
know
IR vs. DBMS
Seem like very different beasts
Under the hood, not as different as they might seem But in practice, you have to choose between the 2
IR DBMSImprecise Semantics Precise Semantics
Keyword search SQL
Unstructured data format
Structured data
Read-Mostly. Add docs occasionally
Expect reasonable number of updates
Page through top k results
Generate full answer
IR’s “Bag of Words” Model
Typical IR data model: Each document is just a bag of words (“terms”)
Detail 1: “Stop Words” Certain words are considered irrelevant and not placed in the
bag e.g. “the” e.g. HTML tags like <H1>
Detail 2: “Stemming” Using English-specific rules, convert words to their basic form e.g. “surfing”, “surfed” --> “surf”
Detail 3: we may decorate the words with other attributes E.g. position, font info, etc. Not exactly “bag of words” after all
Boolean Text Search
Find all documents that match a Boolean containment expression: “Windows”
AND (“Glass” OR “Door”) AND NOT “Microsoft”
Note: query terms are also filtered via stemming and stop wordsWhen web search engines say “10,000 documents found”, that’s the Boolean search result size.
Text “Indexes”
When IR folks say “index” or “indexing” … Usually mean more than what DB people
mean
In our terms, both “tables” and indexes Really a logical schema (i.e. tables) With a physical schema (i.e. indexes) Usually not stored in a DBMS
Tables implemented as files in a file system
A Simple Relational Text Index
Create and populate a tableInvertedFile(term string, docID int64)
Build a B+-tree or Hash index on InvertedFile.term May be lots of duplicate docIDs per term Secondary index: list compression per term possible
This is often called an “inverted file” or “inverted index” Maps from words -> docs
whereas normal files map docs to the words in the doc (?!)
Can now do single-word text search queries
Handling Boolean Logic
How to do “term1” OR “term2”? Union of two docID sets
How to do “term1” AND “term2”? Intersection (ID join) of two DocID sets!
How to do “term1” AND NOT “term2”? Set subtraction
Also a join algorithm
How to do “term1” OR NOT “term2” Union of “term1” and “NOT term2”.
“Not term2” = all docs not containing term2. Yuck! Usually forbidden at UI/parser
Refinement: what order to handle terms if you have many ANDs/NOTs?
Boolean Search in SQL
(SELECT docID FROM InvertedFile WHERE word = “window” INTERSECT SELECT docID FROM InvertedFile WHERE word = “glass” OR word = “door”)EXCEPTSELECT docID FROM InvertedFile WHERE word=“Microsoft”ORDER BY magic_rank()
Really there’s only one query (template) in IR Single-table selects, UNION, INTERSECT, EXCEPT Note that INTERSECT is a shorthand for equijoin on a key Often there’s only one query plan in the system, too!
magic_rank() is the “secret sauce” in the search engines
“Windows” AND (“Glass” OR “Door”) AND NOT “Microsoft”
Fancier: Phrases and “Near”
Suppose you want a phrase E.g. “Happy Days”
Add a position attribute to the schema: InvertedFile (term string, docID int64, position int) Index on term
Enhance join condition in query Can’t use INTERSECT syntax, but query is nearly the sameSELECT I1.docID
FROM InvertedFile I1, InvertedFile I WHERE I1.word = “HAPPY”
AND I2.word = “DAYS” AND I1.docID = I2.docID AND I2.position - I1.position = 1ORDER BY magic_rank()
Can relax to “term1” NEAR “term2” Position < k off
Classical Document RankingTF IDF (Term Freq. Inverse Doc Freq.) For each term t in the query
QueryTermRank = #occurrences of t in q TF log((total #docs)/(#docs with this term)) IDF normalization-factor
For each doc d in the boolean result DocTermRank = #occurrences of t in d TF
log((total #docs)/(#docs with this term)) IDF normalization-factor
Rank += DocTermRank*QueryTermRank
Requires more to our schema InvertedFile (term string, docID int64, position int, DocTermRank
float) TermInfo(term string, numDocs int) Can compress DocTermRank non-relationally
This basically works fine for raw text There are other schemes, but this is the standard
Some Additional Ranking Tricks
Phrases/Proximity Ranking function can incorporate position
Query expansion, suggestions Can keep a similarity matrix on terms, and
expand/modify people’s queries
Document expansion Can add terms to a doc
E.g. in “anchor text” of refs to the doc
Not all occurrences are created equal Mess with DocTermRank based on:
Fonts, position in doc (title, etc.)
Hypertext Ranking
Also factor in graph structure Social Network Theory (Citation Analysis) “Hubs and Authorities” (Clever),
“PageRank” (Google) Intuition: recursively weighted in-degrees,
out-degrees Math: eigenvector computation
PageRank sure seems to help Though word on the street is that other
factors matter as much Anchor text, title/bold text, etc.
Updates and Text Search
Text search engines are designed to be query-mostly Deletes and modifications are rare Can postpone updates (nobody notices, no transactions!)
Updates done in batch (rebuild the index) Can’t afford to go offline for an update?
Create a 2nd index on a separate machine Replace the 1st index with the 2nd Can do this incrementally with a level of indirection
So no concurrency control problems Can compress to search-friendly, update-unfriendly format
For these reasons, text search engines and DBMSs are usually separate products Also, text-search engines tune that one SQL query to death! The benefits of a special-case workload.
{
Architectural Comparison
The Access Method
Buffer Management
Disk Space Management
DB
OS
“The Query”
Search String Modifier
Simple DBMS}
Ranking Algorithm
Query Optimizationand Execution
Relational Operators
Files and Access Methods
Buffer Management
Disk Space Management
DB
Concurrencyand
RecoveryNeeded
DBMS Search Engine
Revisiting Our IR/DBMS Distinctions
Data Modeling & Query Complexity DBMS supports any schema & queries
Requires you to define schema Complex query language (hard for folks to learn) Multiple applications at output
RowSet API (cursors) IR supports only one schema & query
No schema design required (unstructured text) Trivial query language Single application behavior
Page through output in rank order, ignore most of output
Storage Semantics DBMS: online, transactional storage IR: batch, unreliable storage
Distribution & Parallelism
Roots
Distributed QP vs. Parallel QP Distributed QP envisioned as a k-node intranet
for k ~= 10 Sound old-fashioned? Think of multiple hosting sites
(e.g. one per continent) Parallel QP grew out of DB Machine research
All in one room, one administrator
Parallel DBMS architecture options Shared-Nothing Shared-Everything Shared-Disk
Shared-nothing is most general, most scalable
Distributed QP: Semi-Joins
Main query processing issue in distributed DB lit: use semi-joins R S = R(R S) Observe that R S (R (S)) S
Assume each table lives at one site, R is bigger. To reduce communication:
Ship S’s join columns to R’s site, do semijoin there, ship result to S’s site for the join
Notes I’m sloppy about dups in my def’ns above Semi-joins aren’t always a win
Extra cost estimation task for a distributed optimizer
[Bernstein/Goodman ‘79]
Bloom Joins
A constant optimization on semi-joins Idea: (R (S)) S is redundant
Semi-join can safely return “false hits” from R Rather than shipping (S), ship a superset
A particular kind of lossy set compression allowed
Bloom Filter (B. Bloom, 1970) Hash each value in a set via k independent hash
functions onto an array of n bits Check membership correspondingly By tuning k and n, can control false hit rate Rediscovered recently in web lit, some new wrinkles
(Mitzenmacher’s compressed B.F.’s, Rhea’s attenuated B.F.’s)
[Babb, TODS 79]
Sideways Information Passing
These ideas generalize more broadly
Set o’ Stuff
CostlySet Generator
COMBINER
Sideways Information Passing
These ideas generalize more broadly E.g. “magic sets” rewriting in datalog & SQL Tricky to do optimally in those settings, but
wins can be very big
Set o’ Stuff
CostlySet Generator
COMBINER
Less CostlySet Generator
Parallelism 101
Pipelined vs. Partitioned Pipelined typically inter-operator
Nominal benefits in a dataflow Partition typically intra-operator
E.g. hash join or sort using k nodes
Speedup & Scaleup Speedup: x=old_time/new_time
Ideal: linear Scaleup: small_sys_elapsed_small_problem /
big_sys_elapse_big_problem Ideal: 1 Transaction scaleup: N times as many TPC-C’s for N machines Batch scaleup: N times as big a DB for a query on N machines
[See DeWitt & Gray, CACM 92]
Impediments to Good Parallelism
Startup overheads Amortized for big queries
Interference usually the result of unpredictable
communication delays (comm cost, empty pipelines)
Skew
Of these, skew is the real issue in DBs “Embarrassingly parallel” I.e. it works
Data Layout
Horizontal Partitioning For each table, assign rows to
machines by some key Or assign arbitrarily (round-robin)
Vertical Partitioning Sort table, and slice off columns Usually not a parallelism trick
But nice for processing queries on read-mostly data (projection is free!)
Intra-Operator Parallelism
E.g. for Hash Join Every site with a horizontal partition
of either R or S fires off a scan thread Every storage site reroutes its data
among join nodes based on hash of the join column(s)
Upon receipt, each site does local hash join
Recall Exchange!
Skew Handling
Skew happens Even when hashing? Yep.
Can pre-sample and/or pre-summarize data to partition betterSolving skew on the fly is harder Need to migrate accumulated dataflow
state FLuX: Fault-Tolerant, Load-balancing
eXchange
In Current Architectures
All DBMSs can run on shared memory, many on shared-nothing The high end belongs to clusters
The biggest web-search engines run on clusters (Google, Inktomi) And use pretty textbook DB stuff for
Boolean search Fun tradeoffs between answer quality and
availability/management here (the Inktomi story)
Precomputations
Views and Materialization
A view is a logical table A query with a name In general, not updatable
If to be used often, could be materialized Pre-compute and/or cache result
Could even choose to do this for common query sub-expressions Needn’t require a DBA to say “this is a view”
Challenges in Materialized Views
Three main issues: Given a workload, which views should
be materialized Given a query, how can mat-views be
incorporated into query optimizer As base tables are updated, how can
views be incrementally maintained?
See readings book, Gupta & Mumick
Precomputation in IR
Often want to save results of common queries E.g. no point re-running “Britney Spears” or
“Harry Potter” as Boolean queries
Can also use as subquery results E.g. the query “Harry Potter Loves Britney
Spears” can use the “Harry Potter” and “Britney Spears” results
Constrained version of mat-views No surprise -- constrained relational workload And consistency of matview with raw tables is not
critical, so maintenance not such an issue.
Precomputed Aggregates
Aggregation queries work on numerical sets of data Math tricks apply here
Some trivial, some fancy Theme: replace the raw data with small statistical
summaries, get approximate results Histograms, wavelets, samples, dependency-based
models, random projections, etc. Heavily used in query optimizers for selectivity
estimation (a COUNT aggregate) Spate of recent work on approximate query processing
for AVG, SUM, etc.
[Garofalakis, Gehrke, Rastogi tutorial, SIGMOD ‘02]
A Taste of Next Week
Query Dataflows Meet NWs
Some more presentation Indirection in space and time Thematic similarities, differences in NW/QP
Revisit a NW “classic” through a DB lens Adaptive QP In Telegraph
Eddies, Stems, FluX A taste of QP in Sensor Networks
TinyDB (TAG), Directed Diffusion A taste of QP in p2p
PIER project at Berkeley: parallel QP over DHTs
Presentations from MITers?Open Discussion
Contact
[email protected]://www.cs.berkeley.edu/~jmh