getting the most out of your sample
DESCRIPTION
Getting the Most out of Your Sample. Edith Cohen Haim Kaplan. Tel Aviv University. Why data is sampled. Lots of Data: measurements, Web searches, tweets, downloads, locations, content, social networks, and all this both historic and current… - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Getting the Most out of Your Sample](https://reader035.vdocuments.us/reader035/viewer/2022070500/5681684a550346895dde3c80/html5/thumbnails/1.jpg)
Getting the Most out of Your Sample
Edith Cohen Haim Kaplan Tel Aviv University
![Page 2: Getting the Most out of Your Sample](https://reader035.vdocuments.us/reader035/viewer/2022070500/5681684a550346895dde3c80/html5/thumbnails/2.jpg)
Why data is sampled• Lots of Data: measurements, Web
searches, tweets, downloads, locations, content, social networks, and all this both historic and current…
• To get value from data we need to be able to process queries over it
• But resources are constrained: Data too large to: transmit, store in full, to process even if stored in full…
![Page 3: Getting the Most out of Your Sample](https://reader035.vdocuments.us/reader035/viewer/2022070500/5681684a550346895dde3c80/html5/thumbnails/3.jpg)
Random samples• A compact synopsis/summary of the
data. Easier to store, transmit, update, and manipulate
• Aggregate queries over the data can be approximately (and efficiently) answered from the sample
• Flexible: Same sample supports many types of queries (which do not have to be known in advance)
![Page 4: Getting the Most out of Your Sample](https://reader035.vdocuments.us/reader035/viewer/2022070500/5681684a550346895dde3c80/html5/thumbnails/4.jpg)
Queries and estimatorsThe value of our sampled data hinges on the quality of our estimators• Estimation of some basic (sub)population
statistics is well understood (sample mean, variance,…)
• We need to better understand how to estimate other basic queries:– difference norms (anomaly/change detection) – distinct counts
![Page 5: Getting the Most out of Your Sample](https://reader035.vdocuments.us/reader035/viewer/2022070500/5681684a550346895dde3c80/html5/thumbnails/5.jpg)
Our Aim We want to estimate sum aggregates over sampled data
1 2( ( ), ( ),....)h
f v h v h• Classic sampling schemes• Seek “variance optimal” estimators• Understand (and meet) limits of what
can be done
![Page 6: Getting the Most out of Your Sample](https://reader035.vdocuments.us/reader035/viewer/2022070500/5681684a550346895dde3c80/html5/thumbnails/6.jpg)
Example: sensor measurements time sensor1 sensor27:00 3 4
7:01 13 5
7:02 8 12
7:03 4 13
7:04 2 9
7:05 1 37:06 10 5
7:07 6 14
7:08 13 11
7:09 12 13
7:10 6 7
7:11 20 21
7:12 10 10
Could be • radiation/pollution levels • sugar levels• temperature readings
![Page 7: Getting the Most out of Your Sample](https://reader035.vdocuments.us/reader035/viewer/2022070500/5681684a550346895dde3c80/html5/thumbnails/7.jpg)
Sensor measurementstime sensor1 sensor2 max7:00 3 4 47:01 13 5 137:02 8 12 127:03 4 13 137:04 2 9 97:05 1 3 37:06 10 5 107:07 6 14 147:08 13 11 137:09 12 13 137:10 6 7 77:11 20 21 217:12 10 10 10
We are interested in the maximum reading in each timestamp
![Page 8: Getting the Most out of Your Sample](https://reader035.vdocuments.us/reader035/viewer/2022070500/5681684a550346895dde3c80/html5/thumbnails/8.jpg)
Sensor measurements: Sum of maximums over selected subset
time sensor1 sensor2 max7:00 3 4 47:01 13 5 137:02 8 12 127:03 4 13 137:04 2 9 9
7:05 1 3 37:06 10 5 107:07 6 14 147:08 13 11 137:09 12 13 137:10 6 7 77:11 20 21 217:12 10 10 10
2
1
max( 1, 2)t
t
sensor sensor
![Page 9: Getting the Most out of Your Sample](https://reader035.vdocuments.us/reader035/viewer/2022070500/5681684a550346895dde3c80/html5/thumbnails/9.jpg)
But we only have sampled datatime sensor1 sensor2 max
7:00 3 4 47:01 5 ?7:02 8 12 127:03 4 13 137:04 ?7:05 1 ?7:06 5 ?7:07 6 ?7:08 11 ?7:09 12 13 137:10 ?7:11 21 ?7:12 10 ?
![Page 10: Getting the Most out of Your Sample](https://reader035.vdocuments.us/reader035/viewer/2022070500/5681684a550346895dde3c80/html5/thumbnails/10.jpg)
Example: distinct count of IP addresses
Interface1: Interface2:
Active IP addresses
132.169.1.1
216.115.108.245
……………
132.67.192.131
170.149.173.130
74.125.39.105
Active IP addresses
132.169.1.1
216.115.108.245
74.125.39.105
………………….
87.248.112.181
74.125.39.105
![Page 11: Getting the Most out of Your Sample](https://reader035.vdocuments.us/reader035/viewer/2022070500/5681684a550346895dde3c80/html5/thumbnails/11.jpg)
Distinct Count Queries
Interface1 Interface2
Query: How many distinct IP addresses from a certain AS accessed the router through one of the interfaces during this time period ?
I1 Active IP addresses
132.169.1.1
216.115.108.245
……………
132.67.192.131
170.149.173.130
74.125.39.105
I2 Active IP addresses
132.169.1.1
216.115.108.245
74.125.39.105
………………….
87.248.112.181
74.125.39.105
Distinct in AS
132.169.1.1
132.67.192.131
![Page 12: Getting the Most out of Your Sample](https://reader035.vdocuments.us/reader035/viewer/2022070500/5681684a550346895dde3c80/html5/thumbnails/12.jpg)
This is also a sum aggregate
( , i )nterinte frface1 ace2IPaddresses AS
OR
We do not have all IP addresses going through each interface but just a sample
![Page 13: Getting the Most out of Your Sample](https://reader035.vdocuments.us/reader035/viewer/2022070500/5681684a550346895dde3c80/html5/thumbnails/13.jpg)
Example: Difference queries IP traffic between time periods
IP prefix kBytes 03/05/2012
132.169.1.0/24 2534875
216.115.0.0/16 4326
132.67.192.0/20 6783467
170.149.173.130 784632
74.125.39.105/10 2573580
IP prefix kBytes 03/06/2012
132.169.1.0/24 4679235
216.115.0.0/16 1243
74.125.39.105/32 462534
87.248.112.181/20 29865
74.125.39.105/10 4572456
day1 day2
![Page 14: Getting the Most out of Your Sample](https://reader035.vdocuments.us/reader035/viewer/2022070500/5681684a550346895dde3c80/html5/thumbnails/14.jpg)
Difference QueriesQuery: Difference between day1 and day2 ∑
𝑖∈ 𝐼𝑃 𝑝𝑟𝑒𝑓𝑖𝑥𝑒𝑠𝑖𝑛 𝐴𝑆¿𝑘𝐵𝑖 (𝑑𝑎𝑦 1 )−𝑘𝐵𝑖 (𝑑𝑎𝑦 2 )∨¿𝑝¿
Query: Upward change from day1 to day2
∑𝑖∈ 𝐼𝑃 𝑝𝑟𝑒𝑓𝑖𝑥𝑒𝑠𝑖𝑛 𝐴𝑆
¿𝑘𝐵𝑖 (𝑑𝑎𝑦 1 )−𝑘𝐵𝑖 (𝑑𝑎𝑦 2 )∨¿+¿𝑝
¿¿
¿ ¿1/𝑝
¿ ¿1/𝑝
![Page 15: Getting the Most out of Your Sample](https://reader035.vdocuments.us/reader035/viewer/2022070500/5681684a550346895dde3c80/html5/thumbnails/15.jpg)
Sum AggregatesWe want to estimate sum aggregates from samples
1 2( ( ), ( ),....)h
f v h v h
![Page 16: Getting the Most out of Your Sample](https://reader035.vdocuments.us/reader035/viewer/2022070500/5681684a550346895dde3c80/html5/thumbnails/16.jpg)
Single-tuple “reduction”To estimate the sum we sum single-tuple estimates
1 2( ( ), ( ),....)h
f v h v h
∑h𝑓 (𝑣1 (h ) ,𝑣2 (h ) ,…)
Each tuple estimate has high variance, but relative error of sum of unbiased (pairwise) independent estimates decreases with aggregation
Almost WLOG since tuples are sampled (nearly) independently
![Page 17: Getting the Most out of Your Sample](https://reader035.vdocuments.us/reader035/viewer/2022070500/5681684a550346895dde3c80/html5/thumbnails/17.jpg)
Estimator properties
• Unbiased• Nonnegative
We seek estimators that are:
• Pareto-optimal: No estimator has smaller variance on all data vectors (Tailor to data?)
• “Variance competitive”: Not far from minimum for all data
• Monotone: increases with information
eith
er
or b
oth
mus
t ha
ve
Sometimes:
![Page 18: Getting the Most out of Your Sample](https://reader035.vdocuments.us/reader035/viewer/2022070500/5681684a550346895dde3c80/html5/thumbnails/18.jpg)
Sampling schemes• Weighted vs. Random Access (weight-oblivious)
– Sampling probability does/does not depend on value• Independent vs. Coordinated sampling
– In both cases sampling of an entry is independent of values of other entries.
– Independent: joint inclusion probability is product of individual inclusion probabilities.
– Coordinated: sharing random bits so that joint probability is higher
• Poisson / bottom-k sampling
Start with a warm-up: Random Access Sampling
![Page 19: Getting the Most out of Your Sample](https://reader035.vdocuments.us/reader035/viewer/2022070500/5681684a550346895dde3c80/html5/thumbnails/19.jpg)
Random Access: Suppose each entry is sampled with probability p
time sensor1 sensor2 max
7:00 3 4 47:01 5 ?7:02 8 12 127:03 4 13 137:04 ?7:05 1 ?7:06 5 ?7:07 6 ?7:08 11 ?7:09 12 13 137:10 ?7:11 21 ?7:12 10 ?
2
4 12 13 13p
7:12
7:00
max( 1, 2)sensor sensorWe know the maximum only when both entries are sampled = probability p2
Tuple estimate: max/p2 if both sampled. 0 otherwise.
estimate
![Page 20: Getting the Most out of Your Sample](https://reader035.vdocuments.us/reader035/viewer/2022070500/5681684a550346895dde3c80/html5/thumbnails/20.jpg)
It’s a sum of unbiased estimatorstime sensor1 sensor2 e(max)
7:00 3 4 4/p2
7:01 5 07:02 8 12 12/p2
7:03 4 13 13/p2
7:04 07:05 1 07:06 5 07:07 6 07:08 11 07:09 12 13 13/p2
7:10 07:11 21 07:12 10 0
1 22
max( , )v vp
This is the Horvitz-Thompson estimator:
• Nonnegative• Unbiased:2 21 2
1 22
max( , ) (1 )0 max( , )v vp p v vp
![Page 21: Getting the Most out of Your Sample](https://reader035.vdocuments.us/reader035/viewer/2022070500/5681684a550346895dde3c80/html5/thumbnails/21.jpg)
The weakness which we address:time sensor1 sensor2 max
7:00 3 4 47:01 5 ?7:02 8 12 127:03 4 13 137:04 ?7:05 1 ?7:06 5 ?7:07 6 ?7:08 11 ?7:09 12 13 137:10 ?7:11 21 ?7:12 10 ?
Not optimal
Ignores a lot of information in the sample
![Page 22: Getting the Most out of Your Sample](https://reader035.vdocuments.us/reader035/viewer/2022070500/5681684a550346895dde3c80/html5/thumbnails/22.jpg)
Doing the best for equal entriestime sensor1 sensor2 max
7:00 4 4 4
If none is sampled (1-p)2:
time sensor1 sensor2 e(max)
7:00 ? ? 0
If we sample at least one value 2p-p2: time sensor1 sensor2 e(max)
7:00 4 ? 4/(2p-p2)
time sensor1 sensor2 e(max)
7:00 ? 4 4/(2p-p2)
time sensor1 sensor2 e(max)
7:00 4 4 4/(2p-p2)
![Page 23: Getting the Most out of Your Sample](https://reader035.vdocuments.us/reader035/viewer/2022070500/5681684a550346895dde3c80/html5/thumbnails/23.jpg)
What if entry values are different ?time sensor1 sensor2 max
7:00 3 4 4
time sensor1 sensor2 e(max)
7:00 3 ? 3/(2p-p2)
time sensor1 sensor2 e(max)
7:00 ? 4 4/(2p-p2)
If we sample both time sensor1 sensor2 e(max)
7:00 3 4 ?
Unbiasedness determines value
If none is sampled: time sensor1 sensor2 e(max)
7:00 ? ? 0
If we sample one value:
![Page 24: Getting the Most out of Your Sample](https://reader035.vdocuments.us/reader035/viewer/2022070500/5681684a550346895dde3c80/html5/thumbnails/24.jpg)
What if entry values are different ?time sensor1 sensor2 e(max)
7:00 ? ? 0
time sensor1 sensor2 e(max)
7:00 3 ? 3/(2p-p2)
time sensor1 sensor2 e(max)
7:00 ? 4 4/(2p-p2)
time sensor1 sensor2 e(max)
7:00 3 4 X
22 2
3 4(1 ) (1 ) 42 2
p p p p pp p p
Xp
2
1 14 (4 3)2pX
p p
Unbiased:
X>0 → nonnegativeX> 4/(2p-p2) → monotone
![Page 25: Getting the Most out of Your Sample](https://reader035.vdocuments.us/reader035/viewer/2022070500/5681684a550346895dde3c80/html5/thumbnails/25.jpg)
… The L estimatortime sensor1 sensor2 e(max)
7:00 ? ? 0
time sensor1 sensor2 e(max)
7:00 v1 ? v1/(2p-p2)
time sensor1 sensor2 e(max)
7:00 ? v2 v2/(2p-p2)
time sensor1 sensor2 e(max)
7:00 v1 v2 X 1 1 22
1 1 ( )2pX v v v
p p
• Nonnegative• Unbiased• Pareto optimal• Min possible variance for two equal
entries• Monotone (unique) and symmetric
![Page 26: Getting the Most out of Your Sample](https://reader035.vdocuments.us/reader035/viewer/2022070500/5681684a550346895dde3c80/html5/thumbnails/26.jpg)
Back to our sum aggregatetime sensor1 sensor2 e(max)
7:00 3 47:01 57:02 8 127:03 4 137:04 07:05 17:06 57:07 6
7:08 117:09 12 137:10 07:11 217:12 10
2
1 14 (4 3)2p
p p
25 /(2 )p p
2
1 112 (12 8)2p
p p
2
1 113 (13 4)2p
p p
25/(2 )p p
21/(2 )p p
26 /(2 )p p
211/(2 )p p
2
1 113 (13 12)2p
p p
221/(2 )p p
210 /(2 )p p
7:12
7:00
max( 1, 2)sensor sensor
![Page 27: Getting the Most out of Your Sample](https://reader035.vdocuments.us/reader035/viewer/2022070500/5681684a550346895dde3c80/html5/thumbnails/27.jpg)
The L estimatortime sensor1 sensor2 e(max)
7:00 ? ? 0
time sensor1 sensor2 e(max)
7:00 v1 ? v1/(2p-p2)
time sensor1 sensor2 e(max)
7:00 ? v2 v2/(2p-p2)
time sensor1 sensor2 e(max)
7:00 v1 v2 X 1 1 22
1 1 ( )2pX v v v
p p
• Nonnegative• Unbiased• Pareto optimal• Min possible variance for two equal
entries• Monotone (unique) and symmetric
![Page 28: Getting the Most out of Your Sample](https://reader035.vdocuments.us/reader035/viewer/2022070500/5681684a550346895dde3c80/html5/thumbnails/28.jpg)
The U estimator
time sensor1 sensor2 e(max)
7:00 ? ? 0
time sensor1 sensor2 e(max)
7:00 v1 ? v1/(p(1+[1-2p]+)
time sensor1 sensor2 e(max)
7:00 ? v2 v2/(p(1+[1-2p]+)
time sensor1 sensor2 e(max)
7:00 v1 v2 (max(v1,v2)-(v1+v2)(1-p)/(1+[1-2p]+))/p2
Q: What if we want to optimize the estimator for sparse vectors ?U estimator: Nonnegative, Unbiased, symmetric, Pareto optimal, not monotone
![Page 29: Getting the Most out of Your Sample](https://reader035.vdocuments.us/reader035/viewer/2022070500/5681684a550346895dde3c80/html5/thumbnails/29.jpg)
Variance ratio U/L vs. HTp=½
![Page 30: Getting the Most out of Your Sample](https://reader035.vdocuments.us/reader035/viewer/2022070500/5681684a550346895dde3c80/html5/thumbnails/30.jpg)
Order-based variance optimality
An estimator is ‹-optimal if any estimator with lower variance on v has strictly higher variance on some z<v.
The L max estimator is <-optimal for (v,v) < (v,v-x)The U max estimator is <-optimal for (v,0) < (v,x)
We can construct unbiased <-optimal estimators for any other order <
Fine-grained tailoring of estimator to data patterns
![Page 31: Getting the Most out of Your Sample](https://reader035.vdocuments.us/reader035/viewer/2022070500/5681684a550346895dde3c80/html5/thumbnails/31.jpg)
Weighted sampling: Estimating distinct count
Interface1: Interface2: IP addresses
132.169.1.1
216.115.108.245
132.67.192.131
170.149.173.130
74.125.39.105
IP addresses
132.169.1.1
216.115.108.245
74.125.39.105
87.248.112.181
74.125.39.105
Q: How many distinct IP addresses accessed the router through one of the interfaces ?
( , i )nterinte frface1 ace2IPaddresses AS
OR
![Page 32: Getting the Most out of Your Sample](https://reader035.vdocuments.us/reader035/viewer/2022070500/5681684a550346895dde3c80/html5/thumbnails/32.jpg)
Sampling is “weighted”Interface1: Interface2:
If the IP address did not access the interface then we sample it with probability 0, otherwise with probability p
IP addresses
132.169.1.1
216.115.108.245
132.67.192.131
170.149.173.130
74.125.39.105
IP addresses
132.169.1.1
216.115.108.245
74.125.39.105
87.248.112.181
74.125.39.105
![Page 33: Getting the Most out of Your Sample](https://reader035.vdocuments.us/reader035/viewer/2022070500/5681684a550346895dde3c80/html5/thumbnails/33.jpg)
→ No unbiased nonnegative estimator when p<½
Independent “weighted” sampling with probability p IP add Interface1 Interface2 e(OR)
132.169.1.1 ? ? 0
IP add Interface1 Interface2 e(OR)
132.169.1.1 1 ? 1/p
IP add Interface1 Interface2 e(OR)132.169.1.1 ? 1 1/p
IP add Interface1 Interface2 e(OR)132.169.1.1 1 1 x
p2 x + 2p(1-p)/p=1 → x=(2p-1)/ p2
To be unbiased for (1,1)
To be nonnegative for (0,0)
To be unbiased for (1,0)
To be unbiased for (0,1)
<0
OR estimation
![Page 34: Getting the Most out of Your Sample](https://reader035.vdocuments.us/reader035/viewer/2022070500/5681684a550346895dde3c80/html5/thumbnails/34.jpg)
Negative result (??)Independent “weighted” sampling: There is no non-negative unbiased estimator for OR (related to M. Charikar, S. Chaudhuri, R. Motwani, and V. Narasayya (PODS 2000), negative results for distinct element counts)
Same holds for other functions like the ℓ-th largest value (ℓ < d), and range (Lp difference)
Known seeds: Same sampling scheme but we make random bits “public:” can sometimes get information on values also when entry is not sampled (and get good estimators!)
![Page 35: Getting the Most out of Your Sample](https://reader035.vdocuments.us/reader035/viewer/2022070500/5681684a550346895dde3c80/html5/thumbnails/35.jpg)
Estimate OR
How do we know if 132.67.192.131 did not occur in interface2 or was not sampled by interface2 ?
Sampled with probability p
Sample not empty with probability 2p-p2
132.67.192.131
Interface1 Interface2
132.67.192.131
![Page 36: Getting the Most out of Your Sample](https://reader035.vdocuments.us/reader035/viewer/2022070500/5681684a550346895dde3c80/html5/thumbnails/36.jpg)
Known seeds
132.67.192.131
Interface2 samples active IP iff H2(132.67.192.131) < p
H2(132.67.192.131) < p 132.67.192.131 interface2
Interface1 Interface2
H2(132.67.192.131) > p We do not know
Interface1 samples active IP iff H1(132.67.192.131) < p
![Page 37: Getting the Most out of Your Sample](https://reader035.vdocuments.us/reader035/viewer/2022070500/5681684a550346895dde3c80/html5/thumbnails/37.jpg)
HT estimator+known seeds
132.67.192.131
With probability p2, H1(132.67.192.131) < pand H2(132.67.192.131) < p.We know if 132.67.192.131 occurred in both interfaces. If sampled in either interface then HT estimate is 1/p2 .
Interface1 Interface2
H2(132.67.192.131) > p or H1(132.67.192.131) > p, the HT estimate is 0
We only use outcomes for which we know everything?
![Page 38: Getting the Most out of Your Sample](https://reader035.vdocuments.us/reader035/viewer/2022070500/5681684a550346895dde3c80/html5/thumbnails/38.jpg)
Our L estimator:
132.67.192.131
Interface1 Interface2
?
• Nonnegative• Unbiased• Monotone (unique) and symmetric• Pareto optimal• Min possible variance for (1,1) (both
interfaces are accessed)
1𝑝2(2−𝑝)
1𝑝 (2−𝑝)
![Page 39: Getting the Most out of Your Sample](https://reader035.vdocuments.us/reader035/viewer/2022070500/5681684a550346895dde3c80/html5/thumbnails/39.jpg)
Unbiased
22 2
1(1 12 (2 )) 1pp p p
pp p
p
For items in the intersection (minimum possible variance):
22
12 1( )2
p pp p
For items in one interface:
![Page 40: Getting the Most out of Your Sample](https://reader035.vdocuments.us/reader035/viewer/2022070500/5681684a550346895dde3c80/html5/thumbnails/40.jpg)
OR (ind weighted+known seeds) variance of L, U, HT
![Page 41: Getting the Most out of Your Sample](https://reader035.vdocuments.us/reader035/viewer/2022070500/5681684a550346895dde3c80/html5/thumbnails/41.jpg)
Independent Sampling + Known Seeds
• Method to construct unbiased <-optimal estimators. • Nonnegative either through explicit constraints
or smart selection of “<“ on vectors.
Results:
• L estimator: use an order < such that <-optimal estimator is nonnegative.• Maximum • Exponentiated range (sum aggregate is L )
Take home: use known seeds with your classic weighted sampling scheme
![Page 42: Getting the Most out of Your Sample](https://reader035.vdocuments.us/reader035/viewer/2022070500/5681684a550346895dde3c80/html5/thumbnails/42.jpg)
Estimating max sum: Independent, pps, known seeds
# IP flows to dest IP address in two one-hour periods:
![Page 43: Getting the Most out of Your Sample](https://reader035.vdocuments.us/reader035/viewer/2022070500/5681684a550346895dde3c80/html5/thumbnails/43.jpg)
Coordinated SamplingShared seeds coordinated sampling: Seeds are known and shared across instances: • More similar instances have more similar samples.• Allows for tighter estimates of some functions
(difference norms, distinct counts)
• Precise characterization of functions for which nonnegative and unbiased estimators exist. (Also, bounded, bounded variances)
Results:
• Construction of nonnegative order-optimal estimators
• L estimator: nonnegative, unbiased, monotone both variance+ optimal and 4-competitive (on all data, all functions ] at most 4 times minimum possible)
![Page 44: Getting the Most out of Your Sample](https://reader035.vdocuments.us/reader035/viewer/2022070500/5681684a550346895dde3c80/html5/thumbnails/44.jpg)
Estimating L1 difference Independent / Coordinated, pps, known seeds destination IP addresses: #IP flows in two time periods
![Page 45: Getting the Most out of Your Sample](https://reader035.vdocuments.us/reader035/viewer/2022070500/5681684a550346895dde3c80/html5/thumbnails/45.jpg)
Estimating Ldifference destination IP addresses: #IP flows in two time periods
Independent / Coordinated, pps, Known seeds
![Page 46: Getting the Most out of Your Sample](https://reader035.vdocuments.us/reader035/viewer/2022070500/5681684a550346895dde3c80/html5/thumbnails/46.jpg)
Estimating Ldifference Surname occurrences in 2007, 2008 books (Google ngrams)
Independent / Coordinated, pps, Known seeds
![Page 47: Getting the Most out of Your Sample](https://reader035.vdocuments.us/reader035/viewer/2022070500/5681684a550346895dde3c80/html5/thumbnails/47.jpg)
Estimating Ldifference Surname occurrences in 2007, 2008 books (Google ngrams)
Independent / Coordinated, pps, Known seeds
![Page 48: Getting the Most out of Your Sample](https://reader035.vdocuments.us/reader035/viewer/2022070500/5681684a550346895dde3c80/html5/thumbnails/48.jpg)
Conclusion• We present estimators for sum aggregates over
samples: unbiased, nonnegative, variance optimal– Tailoring estimator to data (<-optimality)– Classic sampling schemes:
independent/coordinated weighted/weight-oblivious• Perform well: tight estimates with small fraction
sampled– considerable improvement over HT– can estimate functions for which there were no
known unbiased nonnegative estimators, like L Open/future: independent sampling (weighted+known
seeds): precise characterization, derivation for >2 instances