GALT WorkshopOct 16, 2003
1
Languages for Distributed Information
Val Tannen
Computer and Information Science Department
University of Pennsylvania
Acknowledgements:
• Philippa Gardner and Sergio Maffeis, Imperial College• Arnaud Sahuguet, Bell Labs (formerly at UPenn)• Zack Ives and Benjamin Pierce, UPenn
GALT WorkshopOct 16, 2003
2
• Past: because it did not fit on one system; data
placement was a big topic.
•Present: independent data sources agree to collaborate
• Queries that require integration across multiple sources.
• Heterogeneity. Is it solved by XML? Results are mixed.
• Redundancy is what makes things really interesting!
Distributed information
GALT WorkshopOct 16, 2003
3
A query arrives… and demands satisfaction
query process
Arrives in any node. Refers to names. Names relate to data
in various nodes. For example:
node A: proj (join R (sel S))
Need to discover that R is at node A and S is at node B.
Need a plan to compute the result of
proj (join tableR@A (sel tableS@B)
and return the answer to the client that asked the query.
GALT WorkshopOct 16, 2003
4
One language for queries, views, and query plans
data languagequery language
(SQL, XQuery)
view language
query plan language
processes
GALT WorkshopOct 16, 2003
5
Motives and intentions
Existing view and plan languages are limited.
Capture distributed query processing in a high-level language.
(SQL vs. C++)
Easier to program self-tuning features.
Easier to generate automatically, eg., by optimizers.
Easier to model toward formal verification.
My ambition:) is to convince
(1) database people that process calculi are useful,
(2) process calculi people to look at some new problems.
GALT WorkshopOct 16, 2003
6
Concurrency and process calculi
Why process calculi?
Because queries arrive asynchronously and must be processed
concurrently within a node (and across nodes).
Composition primitives:
e1 | e2 parallel
e1 ; e2 sequential
e1 or e2 nondeterministic choice
e1, e2 expressions denoting processes
GALT WorkshopOct 16, 2003
7
Synchronization and communication
CCS notifL ; e1 | waitL ; e2 e1 | e2
pi-calculus
new c . ( send c v ; e1 | recv c x. e2(x) )
new c . e1 | e2(v)
Anticipation: the blocking operational semantics does not
capture well query execution. The old dataflow paradigm might
be better for that. But channels are essential.
blocking synchro
+ comm on channel c
GALT WorkshopOct 16, 2003
8
Capturing query plans
Original query may be “decomposed” into subordinate
queries running at different nodes. Recall
node A: proj (join R (sel S))
B
sel
A
proj join
SR
Client
plan:
Get from here… … to here
GALT WorkshopOct 16, 2003
9
Query plan “development”
node A: proj (join R (sel S))
> meta-step (discovery)>
node A: proj (join R@A (sel S@B))
> meta-step (discovery)>
node A: proj (join tableR (sel S@B))
> meta-step (optimization)>
node A: new c . proj (join tableR (split c (sel S@B)))
>operational semantics step>
node A: new c . proj (join tableR (recv c))
| send c (sel S@B)
GALT WorkshopOct 16, 2003
10
A new primitive: split
new c . e1(split c (e2))
new c . e1(recv c) | send c e2
(We are using a simplified form of recv since the value passed
through the channel is consumed in just one place.)
split is typically introduced by optimization steps.
Alternative (query shipping vs. data shipping):
node A: new c . proj (join tableR (sel (split c S@B)))
GALT WorkshopOct 16, 2003
11
Distribution and migration
Each node handles a “soup” of parallel processes:
( node A: e1 | e2 ) || ( node B: e3 | e4 | e5 )
Process migration:
( node A: e1 | migrate B e2) || ( node B: e3 )
( node A: e1 ) || ( node B: e3 | e2 )
Will be used to move subordinate queries.
GALT WorkshopOct 16, 2003
12
Query plan, continued
node A: new c . proj (join tableR (recv c))
| send c (sel S@B)
>meta-step (subordination)>
node A: new c . proj (join tableR (recv c))
| migrate B (send c (sel S@B))
>operational semantics step>
new c@A . ( node A: proj (join tableR (recv c))
|| node B: send c@A (sel S@B) )
Global channels vs. located channels.
We can get by with located channels.
GALT WorkshopOct 16, 2003
13
Distributed query evaluation
>meta-step (discovery)>
new c@A . ( node A: proj (join tableR (recv c))
|| node B: send c@A (sel tableS) )
Should not block send/recv on the entire table (all-push). Data
should be streamed.
All-pull (“iterators”) blocks recv/send on each tuple: not good.
Push-pull (queued) is most reasonable.
B
sel
A
proj join
SR
Client
GALT WorkshopOct 16, 2003
14
“Lay down the pipes … Turn on the faucets”
The approach of ubQL [Sahuguet&T., 2001]. Separate
• deployment phase: queries are split
channels are created
subordinate query processes migrate
carrying channel names with them
• execution phase: data is streamed through channels
processed according to subord. queries
• adaptive behavior: execution can be interrupted and reset
channels can be flushed and reset
alternative plans can be started
GALT WorkshopOct 16, 2003
15
Adaptive query processing
Another new primitive (like Cardelli’s “Web combinators”)
adapt stream AltPlan stream if adequate
AltPlan if drying up
redundancy
Example:
new c@A .
node A: proj (join tableR (adapt (recv c) AltPlan))
|| node B: send c@A (sel tableS)
where, eg., AltPlan new c’@A. split c’ (sel S@C)
GALT WorkshopOct 16, 2003
16
Distributed databases systems: 25+ years
A lot of useful techniques, especially optimizations to reduce
bandwidth. Eg., use of semijoins.
Essential limitation: need powerful, all-knowing node, to generate the query plans.
proj join
sel
B
A
proj SR
Client
These are also subordinatequeries. Not quite decomposition…
still easily expressible
GALT WorkshopOct 16, 2003
17
Where is the data? Discovery!
• Distributed Database Systems
centralized, complete, consistent knowledge.
• No clue, go out and search everywhere (Gnutella)!
• Keyword-node indexes (catalogs) in each node.
• Finally something clever: Distributed Hash Tables
• Layered organization using views
(simple version in Mediated Information Integration: Tsimmis,
Kleisli, Information Manifold, K2, Garlic, etc.)
GALT WorkshopOct 16, 2003
18
Views can organize distributed data
C
S1 = expr( localTables )
S2 = expr( localTables )
B
R = expr( localTables, S2@C)
A
query = expr( R@B, S1@C)
V = expr(S1@C, S2@C)
composition-with-views
rewriting-with-views
GALT WorkshopOct 16, 2003
19
New kinds of viewsGeneralize split:
e1(spawn e2 e3) e1(e2) | e3
split c e spawn (recv c) (send c e)
View definition that invokes a remote continuous query:
R = new cr. spawn (recv cr) (send cq cr)
Continuous query “installed” elsewhere:
recv* cq x. send x (sel tableT)
cr is a data stream channelcq is a standard
pi-calculus channel
GALT WorkshopOct 16, 2003
20
Some tentative conclusions
• Very useful process primitives: parallel and sequential
composition, migration, channels.
• Need a new behavior for certain channels: streaming data (at
least at the language level; underneath we have bounded
buffers/queues ).
• Standard channel semantics is still useful for passing small
values or channel names, eg., for continuous queries.
• Some new primitives should be considered split/spawn,
adapt.
GALT WorkshopOct 16, 2003
21
Where to?
Service calls, active (dynamic) data, done by Gardner and
Maffeis.
Clarify semantics of execution phase so we can apply
verification techniques for adaptive query plans.
In the presence of redundancy, building globally optimal plans
seems hopeless. How to define local optimality? Is there even a
concept of “good enough”?
Combine layers of views with distributed hashing at the level of
schema?
GALT WorkshopOct 16, 2003
22
GALT WorkshopOct 16, 2003
23
An interesting complication: services
Service calls in query:
to applications that provide additional processing
(that cannot be programmed in the query language).
Eg., analysis tools, scientific (BLAST) or financial
Service calls in data (active data):
part of the data is to be retrieved only on demand
Eg., Active XML
Service definition, result computation, and result consumption
all in potentially different nodes.
GALT WorkshopOct 16, 2003
24
Mixing query features with process features
In principle we can take any query language (SQL, OQL,
XQuery) and add the process primitives we saw.
Really? The query language better be nicely “compositional”
and “referentially transparent”. SQL: lots of special cases…
Even better is to also have a robust static type system.
Things can get complicated:
foreach r in Dept@A collect foreach s in Emp@B where s.name=r.manager collect {s.age}
GALT WorkshopOct 16, 2003
25
Where is the relevant data? (2)
Basic algorithm can be iterated and made useful to databases
(Ives). But, what to do with irregular pre-existing data with
complex schema?
GALT WorkshopOct 16, 2003
26
Some tentative conclusions (2)
• Services are modeled in the same framework with little
effort. Service call arguments can also be carried as part of
migrating processes.
• Did not show it, but subscriptions, continuous queries,
caching, proxying (query redirecting) can be approached too.
Useful feature from process calculi: replicated processes.
• Well-designed query languages will mix nicely with process
calculi primitives.
• Operational semantics steps and meta-steps must be
interleaved (this makes me uncomfortable!)