lecture 25 15-829a/18-849b/95-811a/19-729a internet-scale sensor systems: design and policy lecture...

Lecture 25

15-829A/18-849B/95-811A/19-729A15-829A/18-849B/95-811A/19-729A

Internet-Scale Sensor Systems: Internet-Scale Sensor Systems: Design and PolicyDesign and Policy

Lecture 25

XML Query Processing &Historical Synopses

Phil Gibbons

April 17, 2003

04-17-03 2 Lecture 25

OutlineOutline

•Last time: XML Query Processing (Part 1)• Shanmugasundaram et al, “Relational Databases for

Querying XML Documents: Limitations and Opportunities”, VLDB ’99

•XML Query Processing (Part 2)• Tatarinov, Ives, Halevy, Weld, “Updating XML”, SIGMOD

•Synopses for Historical Queries • Non-decaying

• Sliding window

• More general decaying: Cohen, Strauss, “Maintaining Time-Decaying Stream Aggregates”, PODS ’03

04-17-03 3 Lecture 25

RecapRecap

•XML for sensor systems+ Good for rich, heterogeneous data

+ Supports on-the-fly schema changes

+ Good for hierarchical data

+ Standard data exchange format

- Query processing is SLOW!

- In contrast, relational DBMS are highly reliable, scalable, optimized for performance, & have advanced functionality

Key research question: Can we store XML in a relational DB,and use a relational database system to process queries?

04-17-03 4 Lecture 25

OutlineOutline

•Synopses for Historical Queries• Non-decaying

• Sliding window

04-17-03 5 Lecture 25

XML Update OperationsXML Update Operations

• Update primitives• Delete (recursive)

• Insert

• Simple (atomic values, literal content)

• Complex (copying)

• Replace (delete & insert not efficient)

• Move (copy & delete not always feasible)

• Multiple operations in a statement

• Nested updates

• Focus on semantics rather than syntax

Adapted from slides ©Igor Tatarinov

XML Data TreeXML Data Tree

bookdb

title author

author

Nath Ke

publisher

IrisNet Stinks

[publisher=“spi”]

Student Publishing, Inc

[ID=“spi”]IDREF

04-17-03 7 Lecture 25

Querying XML Data: XQueryQuerying XML Data: XQuery

FOR $b IN document(“bookdb.xml”)/bookdb/bookWHERE $b/author/name = “Nath”RETURN $b bookdb

title author

author

Nath Ke

publisher

IrisNet Stinks

[publisher=“spi”]

Student Publishing, Inc

[ID=“spi”]

04-17-03 8 Lecture 25

Multiple Update OperationsMultiple Update Operations

• Need to DELETE/INSERT/REPLACE in a single

update

FOR $book IN document(“bookdb.xml”)/bookdb/book, $price IN $book/price, WHERE $price > 100UPDATE $book {

REPLACE $price WITH $price*0.50INSERT <comment>closeout</comment>

04-17-03 9 Lecture 25

Semantic Issues in XML UpdatesSemantic Issues in XML Updates

• IDs and IDREFs

• Can’t duplicate XML IDs

• Can’t leave dangling references

• Non-deterministic (ambiguous) updates

• There is more than one way (XPath expression) to get to an XML element

• Would like to detect it at compile time

04-17-03 10 Lecture 25

Deletion: IDREFsDeletion: IDREFs

Solutions:

1. Don’t allow delete at all

2. Remove incoming refs

3. Delete entire referrers

4. Reattach to new parent?

(schema violation likely)

04-17-03 11 Lecture 25

Insertion (Copying): IDs & IDREFsInsertion (Copying): IDs & IDREFs

• Can’t duplicate IDs• Elements with IDs can’t be copied

• XML IDs are created by user/app => can’t generate new IDs

• OK to move (or copy into a different document)

• Can copy IDREFs• Cases 1,2 – illegal (ID copying)

• Case 3 - copy the reference

2Adapted from slides ©Igor Tatarinov

04-17-03 12 Lecture 25

Ambiguous UpdatesAmbiguous Updates

• Multiple paths to the same element =>

non-deterministic update

• Solution: constrain the language• Update operations can only modify the children of the

element being updated

• Limit XPath expressions that can be used

Book/Price += 10

Book/Price *= 0.8

bookIDREF

04-17-03 13 Lecture 25

Implementing XML updates over a Implementing XML updates over a relational DBMSrelational DBMS

• Map XML schema (DTD) to relations• Shared Inlining method [SGT+99]

In our example:

Book(tuple_id, title, price)

Author(parent_id, name)

tuple_id and parent_id link child tuples to their parents

Tuple ids are different from XML IDs!!

• Map XQuery updates to SQL• Minimize the number of SQL statements

04-17-03 14 Lecture 25

Deletion MethodsDeletion Methods

• Trigger-based• Using per-tuple triggers

• Using per-statement triggers

• Cascading delete

• Using Access Support Relations (ASRs)

Per-tuple trigger on table Book:

DELETE FROM Author

WHERE parent_id = deleted.tuple_id

Can be executed through index lookup – fast!Adapted from slides ©Igor Tatarinov

04-17-03 15 Lecture 25

Insertion MethodsInsertion Methods

Book Author

70IrisNet Stinks17

……

75IrisNet Stinks1

PriceTitletid

Nath17

……

Nameparent_id

Need to remember parent mapping (1, 17) when copying children!

• Copying is the hard (interesting) case

• Challenge: assign new tuple id’s so that the new

parent/child pairs are linked correctly

04-17-03 16 Lecture 25

Tuple-based InsertTuple-based Insert

• Retrieve source data using a pre-order

traversal.

• Use a stack to keep track of parent id

mapping.

• Execute a separate SQL INSERT to copy each

tuple – very slow!

04-17-03 17 Lecture 25

Table-based InsertTable-based Insert

• Copy source tuples into a “wide” auxiliary table – creates a snapshot

• Find min and max tuple id values

• Copy tuples while re-mapping ids • Add offset nextID – minID to each ID

• Advance nextID by maxID – minID + 1

• Only one SQL statement per table

• Disadvantages: • Extra data copying

• Id space exhaustion (?)

8 7 5 9

nextID = 20

23 22 20 24

nextID = 25

XML doc size

tuple- based

table- based

INSERT (Copy) PerformanceINSERT (Copy) Performance

100K 1M500K

04-17-03 19 Lecture 25

Authors’ SummaryAuthors’ Summary

• Designed a set of XML update operations• Developed language extensions for XQuery

• Resolved subtle issues in the semantics

• Proposed a number of alternative

implementations of XML updates• Performed a comprehensive experimental evaluation

• Identified most efficient methods

What do you think of this paper?

Will Suman & Yan’s book be a bestseller? Adapted from slides ©Igor Tatarinov

04-17-03 20 Lecture 25

OutlineOutline

• Sliding window

04-17-03 21 Lecture 25

Historical QueriesHistorical Queries

•What is the average occupancy of this parking lot?

•… from 9 to 5?

•… at 9 am Monday – Friday?

•… in the last hour?

•What are the peak load hours on PlanetLab nodes?

•How many distinct IP addresses requested by

nodes?

•What is the most frequently requested IP address?

04-17-03 22 Lecture 25

Storing Historical DataStoring Historical Data

•View as a Data Stream of values• Usually, not practical to store ALL past database values

• Want to limit the amount of storage used

•Often a detailed, exact answer is not interesting: • Prefer summarized data (aggregates, samples)

• Prefer to focus primarily on recent data

• Suffices to get the leading digits of aggregates correct

=> Keys to staying within the memory limitations

04-17-03 23 Lecture 25

OutlineOutline

• Sliding window

04-17-03 24 Lecture 25

Sampling: BasicsSampling: Basics• Idea: A small random sample S of the data often

well-represents all the data• For a fast approx answer, apply the query to S & “scale”

the result

• E.g., R.a is {0,1}, S is a 20% sample

select count(*) from R where R.a = 0

select 5 * count(*) from S where S.a = 0

1 1 0 1 1 1 1 1 0 0 0

0 1 1 1 1 1 0 11 1 0 1 0 1 1

0 1 1 0

Red = in S

R.aR.a

Est. count = 5*2 = 10, Exact count = 10

Easy to collect sample of a data stream

04-17-03 25 Lecture 25

Sampling: BasicsSampling: Basics•Unbiased: For expressions involving count, sum,

avg: the estimator is unbiased, i.e., the expected value of the answer is the actual answer, even for (most) queries with predicates!

•Can leverage extensive literature on confidence intervals for sampling

• Actual answer is within the interval [a,b] with a given probability

E.g., 54,000 ± 600 with prob 90%

• Use Chebychev, Chernoff, and Hoeffding inequalities

04-17-03 26 Lecture 25

Biased SamplingBiased Sampling• Often, advantageous to sample different data at

different rates (Stratified Sampling)• E.g., outliers can be sampled at a higher rate to ensure

they are accounted for; better accuracy for small groups in group-by queries

• Each tuple j in the relation is selected for the sample S with some probability Pj (can depend on values in tuple j)

• If selected, added to S along with its scale factor sf = 1/Pj

select sum(R.a) from R where R.b < 5

select sum(S.a * S.sf) from S where S.b < 5

R.a 10 10 10 50 50Pj 1/3 1/3 1/3 ½ ½ S.sf --- 3 --- --- 2

Sum(R.a) = 130Sum(S.a*S.sf) = 10*3 + 50*2 = 130

04-17-03 27 Lecture 25

How Many Distinct Values in How Many Distinct Values in Stream?Stream?

7 3 3 7 9 1

5 distinct?50 distinct?

10% sample

Stream: 3, 1, 4, 1, 5, 9, 2, 6, 5, 4 7 distinct values

Number of distinct values may be linear in the length of the stream, so can’t afford space for each distinct value

1. Collect and store a uniform random

sample of the data

2. At query time, estimate based on a

function of the frequency distribution

Sampling-based approaches:

Theorem in [Charlikar et al, PODS ’00] shows that one must examine (sample) almost the entire table to guarantee the estimate is within a factor of 10 with probability > ½ ,

regardless of the function used!

04-17-03 28 Lecture 25

Flajolet & Martin’s 1983 AlgorithmFlajolet & Martin’s 1983 Algorithm1. Select a hash function H mapping each domain value to

a bit position according to an exponential distribution

2. Scan data set, and for each attr value v, set bit position H(v)

3. Estimate based on the least-significant position j with 0-bit

0 1 k u-1

Pr(H(v))=k)

E.g., 00001011111 j = 5

Estimate: d’ = (2^j)/.77351

Alon, Matias, Szegedy ’96/’99 Estimate: 2^r, where r is most-significant position with 1-bit (r=6)

+ Explicit construction of H (with only pairwise independence)

+ Stronger guarantees

04-17-03 29 Lecture 25

Distinct-Values QueriesDistinct-Values Queries

select count(distinct o_custkey)

from orders Example

where o_orderdate >= ‘2002-01-01’

• How many distinct customers have placed orders this year?

• Stream “orders” has many rows for each customer, but must only count each customer once & only if has an order this year

select count(distinct target-attr)from Stream Templatewhere predicate

Solve using a “Distinct Sample” [Gibbons, VLDB ’01]

04-17-03 30 Lecture 25

Counting Samples Counting Samples [Gibbons & Matias, Sigmod [Gibbons & Matias, Sigmod ‘98]‘98]

• Effective for identifying the most popular items • Without keeping track of all items

• With an adaptive threshold for what’s popular

Sample S is a set of <value, count> pairs

• For each new stream element

• If element value in S, increment its count

• Otherwise, add to S with probability 1/T

04-17-03 31 Lecture 25

Counting SamplesCounting Samples

• If size of sample S exceeds the target space bound, select new threshold T’ > T

• For each value (with count C) in S, decrement count in repeated tries until C tries or a try in which count is not decremented

• First try, decrement count with prob 1 - T/T’

• Next tries, decrement count with prob 1 - 1/T’

• Subject each subsequent stream element to higher threshold T’

• Estimate of frequency for value in S: count in S + 0.418*T

04-17-03 32 Lecture 25

Comparison of Hot List AlgorithmsComparison of Hot List Algorithms

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

100000

150000

200000

250000

exactcountingconcisetraditional

500K values in [1,500]Zipf parameter = 1.5Footprint = 100

04-17-03 33 Lecture 25

New Stream Algorithms forNew Stream Algorithms for• Histograms

• Equi-Width Histograms (Quantiles)

• Most popular items, V-Opt Histograms

• Wavelets

• Data Mining• Stream Clustering (e.g. k-medians)

• Decision Trees

• Frequency moments, Lp Norms of two streams

• Relational DB operators• Join size estimation

Papers in STOC, FOCS, SODA, SIGMOD, VLDB, PODS, etc

04-17-03 34 Lecture 25

OutlineOutline

• Sliding window

04-17-03 35 Lecture 25

Sliding WindowSliding Window

•Maintain the aggregate / statistic over a sliding

window of the N most recent stream elements• Motivation: Only the most recent data is important

Position: 1 2 … 20 21 22 23 24 25 26 27 28 29Stream: 0 1 … 1 0 1 0 0 1 1 0 1 0

N = 10

Number of 1’s = 5

N = 10

Number of 1’s = 4

[Datar, Gionis, Indyk, Motwani, SODA’02]

04-17-03 36 Lecture 25

Exponential Histograms Exponential Histograms [Datar at el, SODA [Datar at el, SODA ’02]’02]

•Used to maintain the number of 1’s in sliding

window, to within a relative error of , using O(1/ε

* log2 N) bits of storage

•Data structure invariant:• Bucket sizes are non-decreasing powers of 2

• For every bucket other than the last bucket, there are at least k and at most k+1 buckets of that size

• Example: k=2: (1,1,2,2,2,4,4,4,8,8,..)

04-17-03 37 Lecture 25

AlgorithmAlgorithmData structures:

• For each bucket: timestamp of most recent 1, size

• LAST: size of the last bucket

• TOTAL: Total size of the buckets

New element arrives at time t

1. If last bucket expired, update LAST and TOTAL

2. If (element == 1)Create new bucket with size 1; update

3. Merge buckets if there are more than k+1 buckets of the same size; repeat as needed

4. Update LAST if changed

04-17-03 38 Lecture 25

Example RunExample Run If last bucket expired, update LAST and TOTAL

If (element == 1)Create new bucket with size 1; update TOTAL

Merge buckets if there are more than k+1 buckets of the same size; repeat as needed

Update LAST if changed

32,16,8,8,4,4,2,1,1 <- 1

32,16,8,8,4,4,2,2,1 <- 1

32,16,8,8,4,4,2,2,1,1 <- 1

04-17-03 39 Lecture 25

Approximation GuaranteeApproximation Guarantee

• Estimate: TOTAL – (LAST/2)

• For every bucket other than the last bucket, there are at

least k and at most k+1 buckets of that size

• Example: k=2: (1,1,2,2,2,4,4,4,8,8,..)

• Implies:

• Case 1: Si > Si-1 : Si=2j, Si-1=2j-1

Si-1+…+S2+S1+1 >= k*(Σ(1+2+4+..+2j-1)) > k*2j = k*Si

• Case 2: Si = Si-1 : Si=2j, Si-1=2j is similar

Expire a bucket only when its most recent 1 moves out of window

Thus actual number of 1’s is between TOTAL–LAST and TOTAL

Relative error is at most (LAST/2)/(TOTAL-LAST)

04-17-03 40 Lecture 25

Distributed Streams Algorithms Distributed Streams Algorithms for Sliding Windows for Sliding Windows [Gibbons & Tirthapura, [Gibbons & Tirthapura, SPAA ’02]SPAA ’02]

•Number of 1’s, Union of distributed streams• Lower Bound: Any deterministic approx scheme for

1/64 requires (N) space, even for 2 streams

• First randomized (,)-approx scheme for sliding window using only logarithmic memory words per stream

•Number of distinct values, Distributed streams,

Sliding window• First randomized (,)-approx scheme using only

logarithmic memory words per stream

• O(log N log(1/) / ^2) words per stream, O(log(1/)) expected update time

0 1 0 11 0 0 1

04-17-03 41 Lecture 25

OutlineOutline

• Sliding window

04-17-03 42 Lecture 25

Time-Decaying AggregatesTime-Decaying Aggregates

•Random Early Detection (RED)• Weighted average of previous queue lengths used to

determine what fraction of packets to discard

•Holding-time policies for ATM virtual circuits• Time-decaying weighted averages of previous idle times

used to determine which circuits to close

•Internet gateway selection products• Time-decaying average of previous reliability

measurements used in path selection

•Telecom usage patterns

04-17-03 43 Lecture 25

Decay FunctionsDecay Functions

•Focused on the Decayed Count Problem:

• C(T) = (t < T s.t. v(t) = 1) g(T – t)

•Exponential: g(x) = exp(-x), > 0• C := v + exp(-) * C

•Sliding window of size W: g(x) = 1 for x <= W and

otherwise g(x) = 0

•Polynomial: g(x) = (1/x)^

•Also: Polyexponential, Chordal, Polygonal

04-17-03 44 Lecture 25

Decay FunctionsDecay Functions• N = min value after which the decay function nullifies

• Exponential: g(x) = exp(-x), > 0

(log N) bits

• Sliding window of size W: g(x) = 1 for x <= W, g(x) = 0 o.w.

(log^2 N) bits [Datar et al, SODA ’02]

• Any decay function:

• O(log^2 N) bits

• Polynomial: g(x) = (1/x)^• O(log N loglog N) bits

• Polyexponential, Chordal, Polygonal:

(log N) bits

04-17-03 45 Lecture 25

Cascaded Exponential HistogramsCascaded Exponential Histograms

•Exponential histogram for a sliding window of size W

can produce an estimate for all windows of size W

•Linear combination of these estimates results in an

estimate for ANY decay function g(x), that is within

relative error

Example: g(0)=8, g(1)=5, g(2)=3, g(3)=2, & g(x)=0 for x > 3

Decayed count is 8*v(T) + 5*v(T-1) + 3*v(T-2) + 2*v(T-3)

= (8-5)*v(T) + (5-3)*v[T-1..T] + (3-2)*v[T-2..T] + 2*v[T-3..T]

Plug in sliding window estimate for each v[t..T] above

04-17-03 46 Lecture 25

IrisLog uses Decaying SamplesIrisLog uses Decaying Samples

•We have implemented a Host and Network

Monitoring Service on IrisNet running on

PlanetLab• To support historical queries on the monitored data, we

use multi-resolution vectors to store each monitored metric

• These vectors provide higher resolution samples of recent data than of older dataCurrently, an ad hoc decay function is used

We have also explored an exponential-decay type sampling procedure that at each step,

• adds the current value to the sample, and • discards a random item from the sample

lecture 25 15-829a/18-849b/95-811a/19-729a internet-scale sensor systems: design and policy lecture...

Documents

how to build the dx-811a all-band linear amplifier cq ... ·...

15-829b/18-849b/95-811a/19-729a internet-scale sensor...

1 15-829a/18-849b/95-811a/19-729a internet-scale sensor...

lecture 15 15-829a/18-849b/95-811a/19-729a internet-scale...

ect - moore industries international · up to 2500 vrms...

16. petrology and geochemistry of volcanic ......overlying...

français deutsch manuel d'instruction italiano … ·...

32. borehole televiewer data analysis from …€¦ · ·...

team manual -...

ts-811a instruction manual

environmental programs academic meeting › ... › uploads...

departmental forms manual - usda€¦ · ad-811a amendment...

m 829a slicers · este manual contiene importantes...

user manual visitor guide -...

00626 g122-829a ds a4 v1 - moog inc.op. valve spool see note...

arxiv:2003.10314v3 [astro-ph.ep] 16 jul 2020 · toi-849b...

pipe fabrication equipment may 1 july 31 2018 chuck machine...

lecture 7 15-829a/18-849b/95-811a/19-729a internet-scale...

unpacking instructions - classic international · 2 al-811...

how to build the dx-811a all-band linear amplifier cq … gg...