lecture 25 15-829a/18-849b/95-811a/19-729a internet-scale sensor systems: design and policy lecture...
Post on 25-Dec-2015
214 Views
Preview:
TRANSCRIPT
Lecture 25
15-829A/18-849B/95-811A/19-729A15-829A/18-849B/95-811A/19-729A
Internet-Scale Sensor Systems: Internet-Scale Sensor Systems: Design and PolicyDesign and Policy
Lecture 25
XML Query Processing &Historical Synopses
Phil Gibbons
April 17, 2003
04-17-03 2 Lecture 25
OutlineOutline
•Last time: XML Query Processing (Part 1)• Shanmugasundaram et al, “Relational Databases for
Querying XML Documents: Limitations and Opportunities”, VLDB ’99
•XML Query Processing (Part 2)• Tatarinov, Ives, Halevy, Weld, “Updating XML”, SIGMOD
’01
•Synopses for Historical Queries • Non-decaying
• Sliding window
• More general decaying: Cohen, Strauss, “Maintaining Time-Decaying Stream Aggregates”, PODS ’03
04-17-03 3 Lecture 25
RecapRecap
•XML for sensor systems+ Good for rich, heterogeneous data
+ Supports on-the-fly schema changes
+ Good for hierarchical data
+ Standard data exchange format
- Query processing is SLOW!
- In contrast, relational DBMS are highly reliable, scalable, optimized for performance, & have advanced functionality
Key research question: Can we store XML in a relational DB,and use a relational database system to process queries?
04-17-03 4 Lecture 25
OutlineOutline
•Last time: XML Query Processing (Part 1)• Shanmugasundaram et al, “Relational Databases for
Querying XML Documents: Limitations and Opportunities”, VLDB ’99
•XML Query Processing (Part 2)• Tatarinov, Ives, Halevy, Weld, “Updating XML”, SIGMOD
’01
•Synopses for Historical Queries• Non-decaying
• Sliding window
• More general decaying: Cohen, Strauss, “Maintaining Time-Decaying Stream Aggregates”, PODS ’03
04-17-03 5 Lecture 25
XML Update OperationsXML Update Operations
• Update primitives• Delete (recursive)
• Insert
• Simple (atomic values, literal content)
• Complex (copying)
• Replace (delete & insert not efficient)
• Move (copy & delete not always feasible)
• Multiple operations in a statement
• Nested updates
• Focus on semantics rather than syntax
Adapted from slides ©Igor Tatarinov
XML Data TreeXML Data Tree
bookdb
book
title author
name
author
name
Nath Ke
publisher
IrisNet Stinks
[publisher=“spi”]
name
Student Publishing, Inc
[ID=“spi”]IDREF
04-17-03 7 Lecture 25
Querying XML Data: XQueryQuerying XML Data: XQuery
FOR $b IN document(“bookdb.xml”)/bookdb/bookWHERE $b/author/name = “Nath”RETURN $b bookdb
book
title author
name
author
name
Nath Ke
publisher
IrisNet Stinks
[publisher=“spi”]
name
Student Publishing, Inc
[ID=“spi”]
04-17-03 8 Lecture 25
Multiple Update OperationsMultiple Update Operations
• Need to DELETE/INSERT/REPLACE in a single
update
FOR $book IN document(“bookdb.xml”)/bookdb/book, $price IN $book/price, WHERE $price > 100UPDATE $book {
REPLACE $price WITH $price*0.50INSERT <comment>closeout</comment>
}
Adapted from slides ©Igor Tatarinov
04-17-03 9 Lecture 25
Semantic Issues in XML UpdatesSemantic Issues in XML Updates
• IDs and IDREFs
• Can’t duplicate XML IDs
• Can’t leave dangling references
• Non-deterministic (ambiguous) updates
• There is more than one way (XPath expression) to get to an XML element
• Would like to detect it at compile time
Adapted from slides ©Igor Tatarinov
04-17-03 10 Lecture 25
Deletion: IDREFsDeletion: IDREFs
Solutions:
1. Don’t allow delete at all
2. Remove incoming refs
3. Delete entire referrers
4. Reattach to new parent?
(schema violation likely)
Adapted from slides ©Igor Tatarinov
04-17-03 11 Lecture 25
Insertion (Copying): IDs & IDREFsInsertion (Copying): IDs & IDREFs
• Can’t duplicate IDs• Elements with IDs can’t be copied
• XML IDs are created by user/app => can’t generate new IDs
• OK to move (or copy into a different document)
• Can copy IDREFs• Cases 1,2 – illegal (ID copying)
• Case 3 - copy the reference
31
2Adapted from slides ©Igor Tatarinov
04-17-03 12 Lecture 25
Ambiguous UpdatesAmbiguous Updates
• Multiple paths to the same element =>
non-deterministic update
• Solution: constrain the language• Update operations can only modify the children of the
element being updated
• Limit XPath expressions that can be used
Book/Price += 10
Book/Price *= 0.8
bookIDREF
Adapted from slides ©Igor Tatarinov
04-17-03 13 Lecture 25
Implementing XML updates over a Implementing XML updates over a relational DBMSrelational DBMS
• Map XML schema (DTD) to relations• Shared Inlining method [SGT+99]
In our example:
Book(tuple_id, title, price)
Author(parent_id, name)
tuple_id and parent_id link child tuples to their parents
Tuple ids are different from XML IDs!!
• Map XQuery updates to SQL• Minimize the number of SQL statements
Adapted from slides ©Igor Tatarinov
04-17-03 14 Lecture 25
Deletion MethodsDeletion Methods
• Trigger-based• Using per-tuple triggers
• Using per-statement triggers
• Cascading delete
• Using Access Support Relations (ASRs)
Per-tuple trigger on table Book:
DELETE FROM Author
WHERE parent_id = deleted.tuple_id
Can be executed through index lookup – fast!Adapted from slides ©Igor Tatarinov
04-17-03 15 Lecture 25
Insertion MethodsInsertion Methods
Book Author
70IrisNet Stinks17
……
75IrisNet Stinks1
PriceTitletid
Nath17
Ke17
……
Ke1
Nath1
Nameparent_id
Need to remember parent mapping (1, 17) when copying children!
• Copying is the hard (interesting) case
• Challenge: assign new tuple id’s so that the new
parent/child pairs are linked correctly
Adapted from slides ©Igor Tatarinov
04-17-03 16 Lecture 25
Tuple-based InsertTuple-based Insert
• Retrieve source data using a pre-order
traversal.
• Use a stack to keep track of parent id
mapping.
• Execute a separate SQL INSERT to copy each
tuple – very slow!
Adapted from slides ©Igor Tatarinov
04-17-03 17 Lecture 25
Table-based InsertTable-based Insert
• Copy source tuples into a “wide” auxiliary table – creates a snapshot
• Find min and max tuple id values
• Copy tuples while re-mapping ids • Add offset nextID – minID to each ID
• Advance nextID by maxID – minID + 1
• Only one SQL statement per table
• Disadvantages: • Extra data copying
• Id space exhaustion (?)
8 7 5 9
nextID = 20
23 22 20 24
nextID = 25
Adapted from slides ©Igor Tatarinov
0
1
2
3
4
5
6
7
XML doc size
Tim
e, se
c
tuple- based
table- based
INSERT (Copy) PerformanceINSERT (Copy) Performance
100K 1M500K
Adapted from slides ©Igor Tatarinov
04-17-03 19 Lecture 25
Authors’ SummaryAuthors’ Summary
• Designed a set of XML update operations• Developed language extensions for XQuery
• Resolved subtle issues in the semantics
• Proposed a number of alternative
implementations of XML updates• Performed a comprehensive experimental evaluation
• Identified most efficient methods
What do you think of this paper?
Will Suman & Yan’s book be a bestseller? Adapted from slides ©Igor Tatarinov
04-17-03 20 Lecture 25
OutlineOutline
•Last time: XML Query Processing (Part 1)• Shanmugasundaram et al, “Relational Databases for
Querying XML Documents: Limitations and Opportunities”, VLDB ’99
•XML Query Processing (Part 2)• Tatarinov, Ives, Halevy, Weld, “Updating XML”, SIGMOD
’01
•Synopses for Historical Queries • Non-decaying
• Sliding window
• More general decaying: Cohen, Strauss, “Maintaining Time-Decaying Stream Aggregates”, PODS ’03
04-17-03 21 Lecture 25
Historical QueriesHistorical Queries
•What is the average occupancy of this parking lot?
•… from 9 to 5?
•… at 9 am Monday – Friday?
•… in the last hour?
•What are the peak load hours on PlanetLab nodes?
•How many distinct IP addresses requested by
nodes?
•What is the most frequently requested IP address?
04-17-03 22 Lecture 25
Storing Historical DataStoring Historical Data
•View as a Data Stream of values• Usually, not practical to store ALL past database values
• Want to limit the amount of storage used
•Often a detailed, exact answer is not interesting: • Prefer summarized data (aggregates, samples)
• Prefer to focus primarily on recent data
• Suffices to get the leading digits of aggregates correct
=> Keys to staying within the memory limitations
04-17-03 23 Lecture 25
OutlineOutline
•Last time: XML Query Processing (Part 1)• Shanmugasundaram et al, “Relational Databases for
Querying XML Documents: Limitations and Opportunities”, VLDB ’99
•XML Query Processing (Part 2)• Tatarinov, Ives, Halevy, Weld, “Updating XML”, SIGMOD
’01
•Synopses for Historical Queries • Non-decaying
• Sliding window
• More general decaying: Cohen, Strauss, “Maintaining Time-Decaying Stream Aggregates”, PODS ’03
04-17-03 24 Lecture 25
Sampling: BasicsSampling: Basics• Idea: A small random sample S of the data often
well-represents all the data• For a fast approx answer, apply the query to S & “scale”
the result
• E.g., R.a is {0,1}, S is a 20% sample
select count(*) from R where R.a = 0
select 5 * count(*) from S where S.a = 0
1 1 0 1 1 1 1 1 0 0 0
0 1 1 1 1 1 0 11 1 0 1 0 1 1
0 1 1 0
Red = in S
R.aR.a
Est. count = 5*2 = 10, Exact count = 10
Easy to collect sample of a data stream
04-17-03 25 Lecture 25
Sampling: BasicsSampling: Basics•Unbiased: For expressions involving count, sum,
avg: the estimator is unbiased, i.e., the expected value of the answer is the actual answer, even for (most) queries with predicates!
•Can leverage extensive literature on confidence intervals for sampling
• Actual answer is within the interval [a,b] with a given probability
E.g., 54,000 ± 600 with prob 90%
• Use Chebychev, Chernoff, and Hoeffding inequalities
04-17-03 26 Lecture 25
Biased SamplingBiased Sampling• Often, advantageous to sample different data at
different rates (Stratified Sampling)• E.g., outliers can be sampled at a higher rate to ensure
they are accounted for; better accuracy for small groups in group-by queries
• Each tuple j in the relation is selected for the sample S with some probability Pj (can depend on values in tuple j)
• If selected, added to S along with its scale factor sf = 1/Pj
select sum(R.a) from R where R.b < 5
select sum(S.a * S.sf) from S where S.b < 5
R.a 10 10 10 50 50Pj 1/3 1/3 1/3 ½ ½ S.sf --- 3 --- --- 2
Sum(R.a) = 130Sum(S.a*S.sf) = 10*3 + 50*2 = 130
04-17-03 27 Lecture 25
How Many Distinct Values in How Many Distinct Values in Stream?Stream?
7 3 3 7 9 1
7 6
5 distinct?50 distinct?
10% sample
Stream: 3, 1, 4, 1, 5, 9, 2, 6, 5, 4 7 distinct values
Number of distinct values may be linear in the length of the stream, so can’t afford space for each distinct value
1. Collect and store a uniform random
sample of the data
2. At query time, estimate based on a
function of the frequency distribution
Sampling-based approaches:
Theorem in [Charlikar et al, PODS ’00] shows that one must examine (sample) almost the entire table to guarantee the estimate is within a factor of 10 with probability > ½ ,
regardless of the function used!
04-17-03 28 Lecture 25
Flajolet & Martin’s 1983 AlgorithmFlajolet & Martin’s 1983 Algorithm1. Select a hash function H mapping each domain value to
a bit position according to an exponential distribution
2. Scan data set, and for each attr value v, set bit position H(v)
3. Estimate based on the least-significant position j with 0-bit
0 1 k u-1
Pr(H(v))=k)
E.g., 00001011111 j = 5
Estimate: d’ = (2^j)/.77351
Alon, Matias, Szegedy ’96/’99 Estimate: 2^r, where r is most-significant position with 1-bit (r=6)
+ Explicit construction of H (with only pairwise independence)
+ Stronger guarantees
04-17-03 29 Lecture 25
Distinct-Values QueriesDistinct-Values Queries
select count(distinct o_custkey)
from orders Example
where o_orderdate >= ‘2002-01-01’
• How many distinct customers have placed orders this year?
• Stream “orders” has many rows for each customer, but must only count each customer once & only if has an order this year
select count(distinct target-attr)from Stream Templatewhere predicate
Solve using a “Distinct Sample” [Gibbons, VLDB ’01]
04-17-03 30 Lecture 25
Counting Samples Counting Samples [Gibbons & Matias, Sigmod [Gibbons & Matias, Sigmod ‘98]‘98]
• Effective for identifying the most popular items • Without keeping track of all items
• With an adaptive threshold for what’s popular
Sample S is a set of <value, count> pairs
• For each new stream element
• If element value in S, increment its count
• Otherwise, add to S with probability 1/T
04-17-03 31 Lecture 25
Counting SamplesCounting Samples
• If size of sample S exceeds the target space bound, select new threshold T’ > T
• For each value (with count C) in S, decrement count in repeated tries until C tries or a try in which count is not decremented
• First try, decrement count with prob 1 - T/T’
• Next tries, decrement count with prob 1 - 1/T’
• Subject each subsequent stream element to higher threshold T’
• Estimate of frequency for value in S: count in S + 0.418*T
04-17-03 32 Lecture 25
Comparison of Hot List AlgorithmsComparison of Hot List Algorithms
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
0
50000
100000
150000
200000
250000
exactcountingconcisetraditional
500K values in [1,500]Zipf parameter = 1.5Footprint = 100
04-17-03 33 Lecture 25
New Stream Algorithms forNew Stream Algorithms for• Histograms
• Equi-Width Histograms (Quantiles)
• Most popular items, V-Opt Histograms
• Wavelets
• Data Mining• Stream Clustering (e.g. k-medians)
• Decision Trees
• Frequency moments, Lp Norms of two streams
• Relational DB operators• Join size estimation
Papers in STOC, FOCS, SODA, SIGMOD, VLDB, PODS, etc
04-17-03 34 Lecture 25
OutlineOutline
•Last time: XML Query Processing (Part 1)• Shanmugasundaram et al, “Relational Databases for
Querying XML Documents: Limitations and Opportunities”, VLDB ’99
•XML Query Processing (Part 2)• Tatarinov, Ives, Halevy, Weld, “Updating XML”, SIGMOD
’01
•Synopses for Historical Queries • Non-decaying
• Sliding window
• More general decaying: Cohen, Strauss, “Maintaining Time-Decaying Stream Aggregates”, PODS ’03
04-17-03 35 Lecture 25
Sliding WindowSliding Window
•Maintain the aggregate / statistic over a sliding
window of the N most recent stream elements• Motivation: Only the most recent data is important
Position: 1 2 … 20 21 22 23 24 25 26 27 28 29Stream: 0 1 … 1 0 1 0 0 1 1 0 1 0
N = 10
Number of 1’s = 5
N = 10
30 0
Number of 1’s = 4
[Datar, Gionis, Indyk, Motwani, SODA’02]
04-17-03 36 Lecture 25
Exponential Histograms Exponential Histograms [Datar at el, SODA [Datar at el, SODA ’02]’02]
•Used to maintain the number of 1’s in sliding
window, to within a relative error of , using O(1/ε
* log2 N) bits of storage
•Data structure invariant:• Bucket sizes are non-decreasing powers of 2
• For every bucket other than the last bucket, there are at least k and at most k+1 buckets of that size
• Example: k=2: (1,1,2,2,2,4,4,4,8,8,..)
04-17-03 37 Lecture 25
AlgorithmAlgorithmData structures:
• For each bucket: timestamp of most recent 1, size
• LAST: size of the last bucket
• TOTAL: Total size of the buckets
New element arrives at time t
1. If last bucket expired, update LAST and TOTAL
2. If (element == 1)Create new bucket with size 1; update
TOTAL
3. Merge buckets if there are more than k+1 buckets of the same size; repeat as needed
4. Update LAST if changed
Anytime estimate: TOTAL – (LAST/2)Adapted from slides ©Garofalakis, Gehrke, Rastogi
04-17-03 38 Lecture 25
Example RunExample Run If last bucket expired, update LAST and TOTAL
If (element == 1)Create new bucket with size 1; update TOTAL
Merge buckets if there are more than k+1 buckets of the same size; repeat as needed
Update LAST if changed
32,16,8,8,4,4,2,1,1 <- 1
32,16,8,8,4,4,2,2,1 <- 1
32,16,8,8,4,4,2,2,1,1 <- 1
32,16,16,8,4,2,1Adapted from slides ©Garofalakis, Gehrke, Rastogi
04-17-03 39 Lecture 25
Approximation GuaranteeApproximation Guarantee
• Estimate: TOTAL – (LAST/2)
• For every bucket other than the last bucket, there are at
least k and at most k+1 buckets of that size
• Example: k=2: (1,1,2,2,2,4,4,4,8,8,..)
• Implies:
• Case 1: Si > Si-1 : Si=2j, Si-1=2j-1
Si-1+…+S2+S1+1 >= k*(Σ(1+2+4+..+2j-1)) > k*2j = k*Si
• Case 2: Si = Si-1 : Si=2j, Si-1=2j is similar
Expire a bucket only when its most recent 1 moves out of window
Thus actual number of 1’s is between TOTAL–LAST and TOTAL
Relative error is at most (LAST/2)/(TOTAL-LAST)
Thus relative error is < 1/k. Setting k = 1/, makes error < Adapted from slides ©Garofalakis, Gehrke, Rastogi
04-17-03 40 Lecture 25
Distributed Streams Algorithms Distributed Streams Algorithms for Sliding Windows for Sliding Windows [Gibbons & Tirthapura, [Gibbons & Tirthapura, SPAA ’02]SPAA ’02]
•Number of 1’s, Union of distributed streams• Lower Bound: Any deterministic approx scheme for
1/64 requires (N) space, even for 2 streams
• First randomized (,)-approx scheme for sliding window using only logarithmic memory words per stream
•Number of distinct values, Distributed streams,
Sliding window• First randomized (,)-approx scheme using only
logarithmic memory words per stream
• O(log N log(1/) / ^2) words per stream, O(log(1/)) expected update time
0 1 0 11 0 0 1
04-17-03 41 Lecture 25
OutlineOutline
•Last time: XML Query Processing (Part 1)• Shanmugasundaram et al, “Relational Databases for
Querying XML Documents: Limitations and Opportunities”, VLDB ’99
•XML Query Processing (Part 2)• Tatarinov, Ives, Halevy, Weld, “Updating XML”, SIGMOD
’01
•Synopses for Historical Queries • Non-decaying
• Sliding window
• More general decaying: Cohen, Strauss, “Maintaining Time-Decaying Stream Aggregates”, PODS ’03
04-17-03 42 Lecture 25
Time-Decaying AggregatesTime-Decaying Aggregates
•Random Early Detection (RED)• Weighted average of previous queue lengths used to
determine what fraction of packets to discard
•Holding-time policies for ATM virtual circuits• Time-decaying weighted averages of previous idle times
used to determine which circuits to close
•Internet gateway selection products• Time-decaying average of previous reliability
measurements used in path selection
•Telecom usage patterns
04-17-03 43 Lecture 25
Decay FunctionsDecay Functions
•Focused on the Decayed Count Problem:
• C(T) = (t < T s.t. v(t) = 1) g(T – t)
•Exponential: g(x) = exp(-x), > 0• C := v + exp(-) * C
•Sliding window of size W: g(x) = 1 for x <= W and
otherwise g(x) = 0
•Polynomial: g(x) = (1/x)^
•Also: Polyexponential, Chordal, Polygonal
04-17-03 44 Lecture 25
Decay FunctionsDecay Functions• N = min value after which the decay function nullifies
• Exponential: g(x) = exp(-x), > 0
(log N) bits
• Sliding window of size W: g(x) = 1 for x <= W, g(x) = 0 o.w.
(log^2 N) bits [Datar et al, SODA ’02]
• Any decay function:
• O(log^2 N) bits
• Polynomial: g(x) = (1/x)^• O(log N loglog N) bits
• Polyexponential, Chordal, Polygonal:
(log N) bits
04-17-03 45 Lecture 25
Cascaded Exponential HistogramsCascaded Exponential Histograms
•Exponential histogram for a sliding window of size W
can produce an estimate for all windows of size W
•Linear combination of these estimates results in an
estimate for ANY decay function g(x), that is within
relative error
Example: g(0)=8, g(1)=5, g(2)=3, g(3)=2, & g(x)=0 for x > 3
Decayed count is 8*v(T) + 5*v(T-1) + 3*v(T-2) + 2*v(T-3)
= (8-5)*v(T) + (5-3)*v[T-1..T] + (3-2)*v[T-2..T] + 2*v[T-3..T]
Plug in sliding window estimate for each v[t..T] above
04-17-03 46 Lecture 25
IrisLog uses Decaying SamplesIrisLog uses Decaying Samples
•We have implemented a Host and Network
Monitoring Service on IrisNet running on
PlanetLab• To support historical queries on the monitored data, we
use multi-resolution vectors to store each monitored metric
• These vectors provide higher resolution samples of recent data than of older dataCurrently, an ad hoc decay function is used
We have also explored an exponential-decay type sampling procedure that at each step,
• adds the current value to the sample, and • discards a random item from the sample
top related