![Page 1: The Web Changes Everything: Understanding the Dynamics of Web Content](https://reader036.vdocuments.us/reader036/viewer/2022062310/56816413550346895dd5c1c7/html5/thumbnails/1.jpg)
The Web Changes Everything:Understanding the Dynamics of Web Content
Eytan Adar, Jaime Teevan, Susan Dumais, and Jon ElsasUniversity of Washington, Microsoft Research, and Carnegie Mellon UniversityWSDM’09
![Page 2: The Web Changes Everything: Understanding the Dynamics of Web Content](https://reader036.vdocuments.us/reader036/viewer/2022062310/56816413550346895dd5c1c7/html5/thumbnails/2.jpg)
Who Cares About Web Change?
• Revisitation– Monitoring
• Page Structure– Fragility
• Dynamic language– Search engine design
![Page 3: The Web Changes Everything: Understanding the Dynamics of Web Content](https://reader036.vdocuments.us/reader036/viewer/2022062310/56816413550346895dd5c1c7/html5/thumbnails/3.jpg)
Quantifying Change
• Dynamics of the Web is well researched– Fetterly et al., (150 million pages), 65% stay the same– Koehler et al., (5 years), stabilization– Ntoulas et al., (turnover), 50% new content a year– And many others (see the paper for a summary)
• But: eye towards systems issues– Crawl rates, indexing, storage needs, etc. – Always random samples
• What about the visited Web– Slow (every day at best)
![Page 4: The Web Changes Everything: Understanding the Dynamics of Web Content](https://reader036.vdocuments.us/reader036/viewer/2022062310/56816413550346895dd5c1c7/html5/thumbnails/4.jpg)
Outline
• Behavior-Driven Sampling & Crawling• Measuring change
– Basic change behavior– Page evolution– Text changes – DOM level changes
• Applications
Foo Bar Baz
![Page 5: The Web Changes Everything: Understanding the Dynamics of Web Content](https://reader036.vdocuments.us/reader036/viewer/2022062310/56816413550346895dd5c1c7/html5/thumbnails/5.jpg)
Outline
• Behavior-Driven Sampling & Crawling• Measuring change
– Basic change behavior– Page evolution– Text changes – DOM level changes
• Applications
![Page 6: The Web Changes Everything: Understanding the Dynamics of Web Content](https://reader036.vdocuments.us/reader036/viewer/2022062310/56816413550346895dd5c1c7/html5/thumbnails/6.jpg)
Behavior Driven Sampling• Can we measure dynamics of the actually used Web?• Usage Logs
– Live Toolbar• 600k from August of ‘06• Subset of total
![Page 7: The Web Changes Everything: Understanding the Dynamics of Web Content](https://reader036.vdocuments.us/reader036/viewer/2022062310/56816413550346895dd5c1c7/html5/thumbnails/7.jpg)
468 (avg), 650 (med)
X 120 = 54788
All crawlable, min 2 users, 2 times
Sampling URLs
Inter-arrival time
Unique Users (popularity)
Visits Per User
Full details: Adar et al., CHI08
![Page 8: The Web Changes Everything: Understanding the Dynamics of Web Content](https://reader036.vdocuments.us/reader036/viewer/2022062310/56816413550346895dd5c1c7/html5/thumbnails/8.jpg)
Behavior Driven Sampling• Can we measure dynamics of the actually used Web?• Usage Logs
– Live Toolbar• 600k from August of ‘06• Subset of total
– Sampled URLs• Around 55k (use the 40k that had revisits in May/June)• Crawled hourly (and sub-hourly) for a year
– May/June ’07
![Page 9: The Web Changes Everything: Understanding the Dynamics of Web Content](https://reader036.vdocuments.us/reader036/viewer/2022062310/56816413550346895dd5c1c7/html5/thumbnails/9.jpg)
URL Annotations
• Visitation properties– Revisits, popularity, etc.
• Broad type– News, Sports, Personal, Adult, etc.
• Structural location– Top level page or deep within site?
![Page 10: The Web Changes Everything: Understanding the Dynamics of Web Content](https://reader036.vdocuments.us/reader036/viewer/2022062310/56816413550346895dd5c1c7/html5/thumbnails/10.jpg)
Outline
• Behavior-Driven Sampling & Crawling• Measuring change
– Basic change behavior– Page evolution– Text changes – DOM level changes
• Applications
![Page 11: The Web Changes Everything: Understanding the Dynamics of Web Content](https://reader036.vdocuments.us/reader036/viewer/2022062310/56816413550346895dd5c1c7/html5/thumbnails/11.jpg)
Basic measures of change
time
Page version 1
Page version 2
How long? (inter-version time)How much?
Dice: 2*|A B| / (|A|+|B|)⋂
![Page 12: The Web Changes Everything: Understanding the Dynamics of Web Content](https://reader036.vdocuments.us/reader036/viewer/2022062310/56816413550346895dd5c1c7/html5/thumbnails/12.jpg)
66% displayed change in 5 week sample(every 123 hours on average)
Random web: 35% change after 11 weeks
![Page 13: The Web Changes Everything: Understanding the Dynamics of Web Content](https://reader036.vdocuments.us/reader036/viewer/2022062310/56816413550346895dd5c1c7/html5/thumbnails/13.jpg)
More visitors = faster change
0
50
100
150
2 3-6 7-38 39+
Average hours to change as a function of # of visitors
hours
visitors
Average Inter-version Time by Page Popularity
![Page 14: The Web Changes Everything: Understanding the Dynamics of Web Content](https://reader036.vdocuments.us/reader036/viewer/2022062310/56816413550346895dd5c1c7/html5/thumbnails/14.jpg)
5+ 4 3 2 1 00
50
100
150
200
250
Hours per change as a function of URL Depth
More shallow (closer to homepage)= faster change
hours
URL Depth
Average Inter-version Time by Page “Depth”
![Page 15: The Web Changes Everything: Understanding the Dynamics of Web Content](https://reader036.vdocuments.us/reader036/viewer/2022062310/56816413550346895dd5c1c7/html5/thumbnails/15.jpg)
Change Plot by Type
0.50.55
0.60.65
0.70.75
0.80.85
0.90.95
0 50 100 150 200 250
Mean Dice Coeffi
cient
Mean Inter-version time (hours)
Industry/Trade
Music
AdultPersonal Pages
Sports/Recreation
News/ Magazine
![Page 16: The Web Changes Everything: Understanding the Dynamics of Web Content](https://reader036.vdocuments.us/reader036/viewer/2022062310/56816413550346895dd5c1c7/html5/thumbnails/16.jpg)
Inter-Version Distribution
![Page 17: The Web Changes Everything: Understanding the Dynamics of Web Content](https://reader036.vdocuments.us/reader036/viewer/2022062310/56816413550346895dd5c1c7/html5/thumbnails/17.jpg)
Sub-hourly crawls
• Over 60% of pages displayed some change when crawled every 60 minutes.
• What is the “true” change rate of the page?
![Page 18: The Web Changes Everything: Understanding the Dynamics of Web Content](https://reader036.vdocuments.us/reader036/viewer/2022062310/56816413550346895dd5c1c7/html5/thumbnails/18.jpg)
Sub-hourly crawls
• Round-robin crawling
• 8 samples over 3 (week)days • shifted by at least 4 hours
controller
Original crawl 1
Original crawl 2
2 minute delay
16 minute delay
32 minute delay
60 minute delay
![Page 19: The Web Changes Everything: Understanding the Dynamics of Web Content](https://reader036.vdocuments.us/reader036/viewer/2022062310/56816413550346895dd5c1c7/html5/thumbnails/19.jpg)
05000
10000150002000025000300003500040000
0 minutes 2 minutes 16 minutes 32 minutes 60 minutes
At least once
Change every
sample6%
11%9%
19%
11%
23% 42%
66%
12%
24%
Range of Changes in Sub-hourly crawls
0.8
0.85
0.9
0.95
Mean Dice
page
s
![Page 20: The Web Changes Everything: Understanding the Dynamics of Web Content](https://reader036.vdocuments.us/reader036/viewer/2022062310/56816413550346895dd5c1c7/html5/thumbnails/20.jpg)
05000
10000150002000025000300003500040000
0 minutes 2 minutes 16 minutes 32 minutes 60 minutes
At least once
Change every
sample6%
11%9%
19%
11%
23% 42%
66%
12%
24%
Range of Changes in Sub-hourly crawlspa
ges
“623 Users Online”“Page generated in .6 ms”“Served to IP address…”
![Page 21: The Web Changes Everything: Understanding the Dynamics of Web Content](https://reader036.vdocuments.us/reader036/viewer/2022062310/56816413550346895dd5c1c7/html5/thumbnails/21.jpg)
Outline
• Behavior-Driven Sampling & Crawling• Measuring change
– Basic change behavior– Page evolution– Text changes – DOM level changes
• Applications
![Page 22: The Web Changes Everything: Understanding the Dynamics of Web Content](https://reader036.vdocuments.us/reader036/viewer/2022062310/56816413550346895dd5c1c7/html5/thumbnails/22.jpg)
Outline
• Behavior-Driven Sampling & Crawling• Measuring change
– Basic change behavior– Page evolution– Text changes – DOM level changes
• Applications
![Page 23: The Web Changes Everything: Understanding the Dynamics of Web Content](https://reader036.vdocuments.us/reader036/viewer/2022062310/56816413550346895dd5c1c7/html5/thumbnails/23.jpg)
Measuring change
t0 t1 t2 t3 t4 t5
• Pages are equally (dis)similar
• Similarity based on• navigation elements • base language model
Time (hours)
Dice
![Page 24: The Web Changes Everything: Understanding the Dynamics of Web Content](https://reader036.vdocuments.us/reader036/viewer/2022062310/56816413550346895dd5c1c7/html5/thumbnails/24.jpg)
Two Segment Model• 2 segment (linear) – hockey stick
Knot point
Dynamic versus static
steady state
Time at which proportion of dynamic to static remains constant
![Page 25: The Web Changes Everything: Understanding the Dynamics of Web Content](https://reader036.vdocuments.us/reader036/viewer/2022062310/56816413550346895dd5c1c7/html5/thumbnails/25.jpg)
Calculating the Knot Point
• Optimization problem
Knot point
![Page 26: The Web Changes Everything: Understanding the Dynamics of Web Content](https://reader036.vdocuments.us/reader036/viewer/2022062310/56816413550346895dd5c1c7/html5/thumbnails/26.jpg)
Calculating the Knot Point
• Optimization problem
Knot point
![Page 27: The Web Changes Everything: Understanding the Dynamics of Web Content](https://reader036.vdocuments.us/reader036/viewer/2022062310/56816413550346895dd5c1c7/html5/thumbnails/27.jpg)
Calculating the Knot Point
• Optimization problem
Knot point
![Page 28: The Web Changes Everything: Understanding the Dynamics of Web Content](https://reader036.vdocuments.us/reader036/viewer/2022062310/56816413550346895dd5c1c7/html5/thumbnails/28.jpg)
Calculating the Knot Point
• Optimization problem
Knot point
![Page 29: The Web Changes Everything: Understanding the Dynamics of Web Content](https://reader036.vdocuments.us/reader036/viewer/2022062310/56816413550346895dd5c1c7/html5/thumbnails/29.jpg)
Types of Change Curves
• 3 main types– Knotted (two-segment)– Sloped– Unchanging
• Automatic classification (93% accuracy*)– 70% are knotted
• 145 hours mean, 92 median– 28% sloped– 2% unchanging (flat)
*Consistent with the proportions of hand labeled data
![Page 30: The Web Changes Everything: Understanding the Dynamics of Web Content](https://reader036.vdocuments.us/reader036/viewer/2022062310/56816413550346895dd5c1c7/html5/thumbnails/30.jpg)
Change curves
Different stable segment different ratios of dynamic to stable content
http://www.nytimes.com http://www.allrecipes.com
![Page 31: The Web Changes Everything: Understanding the Dynamics of Web Content](https://reader036.vdocuments.us/reader036/viewer/2022062310/56816413550346895dd5c1c7/html5/thumbnails/31.jpg)
Change curves
dice
1
.4
hours10 20 30 40
LA
AK
Craigslist, Anchorage, AK Craigslist, Los Angeles, CA
![Page 32: The Web Changes Everything: Understanding the Dynamics of Web Content](https://reader036.vdocuments.us/reader036/viewer/2022062310/56816413550346895dd5c1c7/html5/thumbnails/32.jpg)
Outline
• Behavior-Driven Sampling & Crawling• Measuring change
– Basic change behavior– Page evolution– Text changes – DOM level changes
• Applications
![Page 33: The Web Changes Everything: Understanding the Dynamics of Web Content](https://reader036.vdocuments.us/reader036/viewer/2022062310/56816413550346895dd5c1c7/html5/thumbnails/33.jpg)
Outline
• Behavior-Driven Sampling & Crawling• Measuring change
– Basic change behavior– Page evolution– Text changes – DOM level changes
• Applications
Foo Bar Baz
![Page 34: The Web Changes Everything: Understanding the Dynamics of Web Content](https://reader036.vdocuments.us/reader036/viewer/2022062310/56816413550346895dd5c1c7/html5/thumbnails/34.jpg)
Nature of the Text
What terms vanish here?
Or are still here?
Foo Bar Baz
![Page 35: The Web Changes Everything: Understanding the Dynamics of Web Content](https://reader036.vdocuments.us/reader036/viewer/2022062310/56816413550346895dd5c1c7/html5/thumbnails/35.jpg)
Term Longevity Plot
TimeSep. Oct. Nov. Dec.
Foo Bar Baz
![Page 36: The Web Changes Everything: Understanding the Dynamics of Web Content](https://reader036.vdocuments.us/reader036/viewer/2022062310/56816413550346895dd5c1c7/html5/thumbnails/36.jpg)
Term Longevity Plot
• Term level representation of change curve– Pick a vertical (t0)– Compare overlap of
terms to next vertical
Foo Bar Baz
![Page 37: The Web Changes Everything: Understanding the Dynamics of Web Content](https://reader036.vdocuments.us/reader036/viewer/2022062310/56816413550346895dd5c1c7/html5/thumbnails/37.jpg)
Features of Terms
• Divergence– Which terms distinguish current document from
the collection (at a point in time)
• Staying power (σ)– Likelihood of observing a word (w) at two different
times, t and t+α in document D– σ(w,D)≈ P(t)P(α)P(w|Dt,Dt+ α)
( | )( , ) ( | )log( | )
P w DDiv w D P w DP w C
Foo Bar Baz
![Page 38: The Web Changes Everything: Understanding the Dynamics of Web Content](https://reader036.vdocuments.us/reader036/viewer/2022062310/56816413550346895dd5c1c7/html5/thumbnails/38.jpg)
Distribution of terms by staying power (σ)
Low staying power(allrecipes.com)
High Div.bbqsaladssandwichesporkcheesecool
High staying power(allrecipes.com)
Foo Bar Baz
cookscookbooks
ingredient
desserts
home
you
search…
High Div.
Low Div.
![Page 39: The Web Changes Everything: Understanding the Dynamics of Web Content](https://reader036.vdocuments.us/reader036/viewer/2022062310/56816413550346895dd5c1c7/html5/thumbnails/39.jpg)
Outline
• Behavior-Driven Sampling & Crawling• Measuring change
– Basic change behavior– Page evolution– Text changes – DOM level changes
• Applications
Foo Bar Baz
![Page 40: The Web Changes Everything: Understanding the Dynamics of Web Content](https://reader036.vdocuments.us/reader036/viewer/2022062310/56816413550346895dd5c1c7/html5/thumbnails/40.jpg)
Outline
• Behavior-Driven Sampling & Crawling• Measuring change
– Basic change behavior– Page evolution– Text changes – DOM level changes
• Applications
![Page 41: The Web Changes Everything: Understanding the Dynamics of Web Content](https://reader036.vdocuments.us/reader036/viewer/2022062310/56816413550346895dd5c1c7/html5/thumbnails/41.jpg)
DOM Level Changes
• How long does structure hold?• Applications with assumed stability
– Programming by Demonstration (PbD)– Mashups– Scrapers, etc.
[UIST08] Adar et al., “Zoetrope: Interacting with the Ephemeral Web”
DOM Structure
![Page 42: The Web Changes Everything: Understanding the Dynamics of Web Content](https://reader036.vdocuments.us/reader036/viewer/2022062310/56816413550346895dd5c1c7/html5/thumbnails/42.jpg)
Tree Isomorphism
• The “general” approach:– Compare the DOM structure of 2 trees– Produces alignment, edit distances, etc. [Grandi’04]
– But: somewhat inefficient for large scale• We want:
– A method for comparing many (1000s) of versions of the same page at the same time
![Page 43: The Web Changes Everything: Understanding the Dynamics of Web Content](https://reader036.vdocuments.us/reader036/viewer/2022062310/56816413550346895dd5c1c7/html5/thumbnails/43.jpg)
The Idea
• Serialize each DOM structure
a
b
bar
foo
<a>foo <b>bar</b></a>@time = 0
//a[0]/b[0] (/a/b)[0] H(“bar”) H(“bar”) 0
/a[0] (/a)[0] H(“foo”) H(“foo bar”)
0
full path type path node hash subtree hash version
![Page 44: The Web Changes Everything: Understanding the Dynamics of Web Content](https://reader036.vdocuments.us/reader036/viewer/2022062310/56816413550346895dd5c1c7/html5/thumbnails/44.jpg)
The Idea
• Serialize each DOM structure
a
b
jar
foo
<a>foo <b>jar</b></a>@time = 1
//a[0]/b[0] (/a/b)[0] H(“bar”) H(“bar”) 0
/a[0] (/a)[0] H(“foo”) H(“foo bar”)
0
/a[0]/b[0] (/a/b)[0] H(“bar”) H(“jar”) 1/a[0] (/a)[0] H(“foo”) H(“foo
jar”)1
![Page 45: The Web Changes Everything: Understanding the Dynamics of Web Content](https://reader036.vdocuments.us/reader036/viewer/2022062310/56816413550346895dd5c1c7/html5/thumbnails/45.jpg)
The Idea
• Serialize each DOM structure
a
b
jar
<a><b>jar</b></a>@time = 2
//a[0]/b[0] (/a/b)[0] H(“bar”) H(“bar”) 0
/a[0] (/a)[0] H(“foo”) H(“foo bar”)
0
/a[0]/b[0] (/a/b)[0] H(“jar”) H(“jar”) 1/a[0] (/a)[0] H(“foo”) H(“foo
jar”)1
/a[0]/b[0] (/a/b)[0] H(“jar”) H(“jar”) 2
![Page 46: The Web Changes Everything: Understanding the Dynamics of Web Content](https://reader036.vdocuments.us/reader036/viewer/2022062310/56816413550346895dd5c1c7/html5/thumbnails/46.jpg)
Operators on Serialized Data
• sort(columns)– Sorts by the variables
• reduce(columns)– Generates a set of sets
• Look familiar?
![Page 47: The Web Changes Everything: Understanding the Dynamics of Web Content](https://reader036.vdocuments.us/reader036/viewer/2022062310/56816413550346895dd5c1c7/html5/thumbnails/47.jpg)
sort(full_path,version)
S = reduce(full_path)
foreach s in S: calculate the difference between the minimum version id and last reported id
/a[0]/b[0] … 0
/a[0] … 0
/a[0]/b[0] … 1/a[0] … 1/a[0]/b[0] … 2
/a[0]/b[0] … 0
/a[0]/b[0] … 1
/a[0]/b[0] … 2/a[0] … 0/a[0] … 1
/a[0]/b[0] … 0
/a[0]/b[0] … 1
/a[0]/b[0] … 2/a[0] … 0/a[0] … 1
2
1
![Page 48: The Web Changes Everything: Understanding the Dynamics of Web Content](https://reader036.vdocuments.us/reader036/viewer/2022062310/56816413550346895dd5c1c7/html5/thumbnails/48.jpg)
Structure Survival Over Time
0%10%20%30%40%50%60%70%80%90%
100%
2 hour 1 day 1 week 2 weeks 4 weeks 5 weeks
Smaller dataset ([UIST’08]) shows that mean survival after a year is only 23%
![Page 49: The Web Changes Everything: Understanding the Dynamics of Web Content](https://reader036.vdocuments.us/reader036/viewer/2022062310/56816413550346895dd5c1c7/html5/thumbnails/49.jpg)
Frequencies and Motion
• Frequency of change of DOM elements
![Page 50: The Web Changes Everything: Understanding the Dynamics of Web Content](https://reader036.vdocuments.us/reader036/viewer/2022062310/56816413550346895dd5c1c7/html5/thumbnails/50.jpg)
Frequencies and Motion
• Frequency of change of DOM elements• Motion of elements on a page
– Can we predict the motion of a page element?
![Page 51: The Web Changes Everything: Understanding the Dynamics of Web Content](https://reader036.vdocuments.us/reader036/viewer/2022062310/56816413550346895dd5c1c7/html5/thumbnails/51.jpg)
Outline
• Behavior-Driven Sampling & Crawling• Measuring change
– Basic change behavior– Page evolution– Text changes – DOM level changes
• Applications
![Page 52: The Web Changes Everything: Understanding the Dynamics of Web Content](https://reader036.vdocuments.us/reader036/viewer/2022062310/56816413550346895dd5c1c7/html5/thumbnails/52.jpg)
Outline
• Behavior-Driven Sampling & Crawling• Measuring change
– Basic change behavior– Page evolution– Text changes – DOM level changes
• Applications
![Page 53: The Web Changes Everything: Understanding the Dynamics of Web Content](https://reader036.vdocuments.us/reader036/viewer/2022062310/56816413550346895dd5c1c7/html5/thumbnails/53.jpg)
Revisitation and Change (to appear [CHI’09])
Series1
0
0.2
0.4
0.6
0.8
1
1.2
NYT WootCostco
Series1
0
0.2
0.4
0.6
0.8
1
1.2
NYT WootCostco
Series1
0
0.2
0.4
0.6
0.8
1
1.2
NYT WootCostco
Series1
-0.2-1.66533453693773E-16
0.20.40.60.8
11.2
NYT WootCostco
Time
[CHI09] Adar et al., “Resonance on the Web: Web Dynamics and Revisitation Patterns”
Implying interest in the newest newsImplying interest in the newest deal (once every 24 hours)Implying interest in the stable (slow changing) content
![Page 54: The Web Changes Everything: Understanding the Dynamics of Web Content](https://reader036.vdocuments.us/reader036/viewer/2022062310/56816413550346895dd5c1c7/html5/thumbnails/54.jpg)
[CHI09] Adar et al., “Resonance on the Web: Web Dynamics and Revisitation Patterns”
Inferred intent• Filter content by removing content changing faster
or slower than peak revisitation
![Page 55: The Web Changes Everything: Understanding the Dynamics of Web Content](https://reader036.vdocuments.us/reader036/viewer/2022062310/56816413550346895dd5c1c7/html5/thumbnails/55.jpg)
Improve Search Over Time
• Changing content– Likelihood that document still contains term– Improved ranking
• Different term weights based on page dynamics
• Differential snippet generation– Take into account survivability of terms
![Page 56: The Web Changes Everything: Understanding the Dynamics of Web Content](https://reader036.vdocuments.us/reader036/viewer/2022062310/56816413550346895dd5c1c7/html5/thumbnails/56.jpg)
Summary
• Behavior driven sampling gives us different statistics then what we’ve seen
• Fine grained analyses (time, terms, DOM)• Data is useful for modeling and simulation (today)• The techniques are useful going forward
– Change curves and the hockey-stick model– Language models and term longevity plots– Comparison of many DOM structures simultaneously
![Page 57: The Web Changes Everything: Understanding the Dynamics of Web Content](https://reader036.vdocuments.us/reader036/viewer/2022062310/56816413550346895dd5c1c7/html5/thumbnails/57.jpg)
Thanks!
?
![Page 58: The Web Changes Everything: Understanding the Dynamics of Web Content](https://reader036.vdocuments.us/reader036/viewer/2022062310/56816413550346895dd5c1c7/html5/thumbnails/58.jpg)
Inter-arrival time
Unique Users (popularity)
Visits Per User
time
User 1 = 10 times
User 2 = 12 times
…
523 unique users
Sampling URLsFull details: Adar et al., CHI08
![Page 59: The Web Changes Everything: Understanding the Dynamics of Web Content](https://reader036.vdocuments.us/reader036/viewer/2022062310/56816413550346895dd5c1c7/html5/thumbnails/59.jpg)
Sampling URLs
Inter-arrival time
Unique Users (popularity)
Visits Per User
Full details: Adar et al., CHI08
![Page 60: The Web Changes Everything: Understanding the Dynamics of Web Content](https://reader036.vdocuments.us/reader036/viewer/2022062310/56816413550346895dd5c1c7/html5/thumbnails/60.jpg)
Change curve implementation
• Which t0 do we pick?
• Any specific t0 results in different (noisy) curve– Solution: pick multiple t0 at random, calculate
average change curve
Log minutes
![Page 61: The Web Changes Everything: Understanding the Dynamics of Web Content](https://reader036.vdocuments.us/reader036/viewer/2022062310/56816413550346895dd5c1c7/html5/thumbnails/61.jpg)
Operators on Serialized Data
• Combining sorts/reduces and intermediate outputs can answer many questions
• Can’t identify “changes” in text – shingles can help– Can identify when a hash “vanishes” and a new
one appears• Test only on those
![Page 62: The Web Changes Everything: Understanding the Dynamics of Web Content](https://reader036.vdocuments.us/reader036/viewer/2022062310/56816413550346895dd5c1c7/html5/thumbnails/62.jpg)
Overall Distributions (Dice)
![Page 63: The Web Changes Everything: Understanding the Dynamics of Web Content](https://reader036.vdocuments.us/reader036/viewer/2022062310/56816413550346895dd5c1c7/html5/thumbnails/63.jpg)
Change Plot for a Page
![Page 64: The Web Changes Everything: Understanding the Dynamics of Web Content](https://reader036.vdocuments.us/reader036/viewer/2022062310/56816413550346895dd5c1c7/html5/thumbnails/64.jpg)
Change Plot by Domain
0.74
0.76
0.78
0.8
0.82
0.84
0.86
0.88
0.9
0 50 100 150 200
Mean Dice Coeffi
cient
Mean Inter-Version Time
.gov
.edu
.com
.net
.org