the web changes everything: understanding the dynamics of web content
DESCRIPTION
Eytan Adar, Jaime Teevan , Susan Dumais , and Jon Elsas University of Washington, Microsoft Research, and Carnegie Mellon University WSDM’09. The Web Changes Everything: Understanding the Dynamics of Web Content. Who Cares About Web Change?. Revisitation Monitoring Page Structure - PowerPoint PPT PresentationTRANSCRIPT
The Web Changes Everything:Understanding the Dynamics of Web Content
Eytan Adar, Jaime Teevan, Susan Dumais, and Jon ElsasUniversity of Washington, Microsoft Research, and Carnegie Mellon UniversityWSDM’09
Who Cares About Web Change?
• Revisitation– Monitoring
• Page Structure– Fragility
• Dynamic language– Search engine design
Quantifying Change
• Dynamics of the Web is well researched– Fetterly et al., (150 million pages), 65% stay the same– Koehler et al., (5 years), stabilization– Ntoulas et al., (turnover), 50% new content a year– And many others (see the paper for a summary)
• But: eye towards systems issues– Crawl rates, indexing, storage needs, etc. – Always random samples
• What about the visited Web– Slow (every day at best)
Outline
• Behavior-Driven Sampling & Crawling• Measuring change
– Basic change behavior– Page evolution– Text changes – DOM level changes
• Applications
Foo Bar Baz
Outline
• Behavior-Driven Sampling & Crawling• Measuring change
– Basic change behavior– Page evolution– Text changes – DOM level changes
• Applications
Behavior Driven Sampling• Can we measure dynamics of the actually used Web?• Usage Logs
– Live Toolbar• 600k from August of ‘06• Subset of total
468 (avg), 650 (med)
X 120 = 54788
All crawlable, min 2 users, 2 times
Sampling URLs
Inter-arrival time
Unique Users (popularity)
Visits Per User
Full details: Adar et al., CHI08
Behavior Driven Sampling• Can we measure dynamics of the actually used Web?• Usage Logs
– Live Toolbar• 600k from August of ‘06• Subset of total
– Sampled URLs• Around 55k (use the 40k that had revisits in May/June)• Crawled hourly (and sub-hourly) for a year
– May/June ’07
URL Annotations
• Visitation properties– Revisits, popularity, etc.
• Broad type– News, Sports, Personal, Adult, etc.
• Structural location– Top level page or deep within site?
Outline
• Behavior-Driven Sampling & Crawling• Measuring change
– Basic change behavior– Page evolution– Text changes – DOM level changes
• Applications
Basic measures of change
time
Page version 1
Page version 2
How long? (inter-version time)How much?
Dice: 2*|A B| / (|A|+|B|)⋂
66% displayed change in 5 week sample(every 123 hours on average)
Random web: 35% change after 11 weeks
More visitors = faster change
0
50
100
150
2 3-6 7-38 39+
Average hours to change as a function of # of visitors
hours
visitors
Average Inter-version Time by Page Popularity
5+ 4 3 2 1 00
50
100
150
200
250
Hours per change as a function of URL Depth
More shallow (closer to homepage)= faster change
hours
URL Depth
Average Inter-version Time by Page “Depth”
Change Plot by Type
0.50.55
0.60.65
0.70.75
0.80.85
0.90.95
0 50 100 150 200 250
Mean Dice Coeffi
cient
Mean Inter-version time (hours)
Industry/Trade
Music
AdultPersonal Pages
Sports/Recreation
News/ Magazine
Inter-Version Distribution
Sub-hourly crawls
• Over 60% of pages displayed some change when crawled every 60 minutes.
• What is the “true” change rate of the page?
Sub-hourly crawls
• Round-robin crawling
• 8 samples over 3 (week)days • shifted by at least 4 hours
controller
Original crawl 1
Original crawl 2
2 minute delay
16 minute delay
32 minute delay
60 minute delay
05000
10000150002000025000300003500040000
0 minutes 2 minutes 16 minutes 32 minutes 60 minutes
At least once
Change every
sample6%
11%9%
19%
11%
23% 42%
66%
12%
24%
Range of Changes in Sub-hourly crawls
0.8
0.85
0.9
0.95
Mean Dice
page
s
05000
10000150002000025000300003500040000
0 minutes 2 minutes 16 minutes 32 minutes 60 minutes
At least once
Change every
sample6%
11%9%
19%
11%
23% 42%
66%
12%
24%
Range of Changes in Sub-hourly crawlspa
ges
“623 Users Online”“Page generated in .6 ms”“Served to IP address…”
Outline
• Behavior-Driven Sampling & Crawling• Measuring change
– Basic change behavior– Page evolution– Text changes – DOM level changes
• Applications
Outline
• Behavior-Driven Sampling & Crawling• Measuring change
– Basic change behavior– Page evolution– Text changes – DOM level changes
• Applications
Measuring change
t0 t1 t2 t3 t4 t5
• Pages are equally (dis)similar
• Similarity based on• navigation elements • base language model
Time (hours)
Dice
Two Segment Model• 2 segment (linear) – hockey stick
Knot point
Dynamic versus static
steady state
Time at which proportion of dynamic to static remains constant
Calculating the Knot Point
• Optimization problem
Knot point
Calculating the Knot Point
• Optimization problem
Knot point
Calculating the Knot Point
• Optimization problem
Knot point
Calculating the Knot Point
• Optimization problem
Knot point
Types of Change Curves
• 3 main types– Knotted (two-segment)– Sloped– Unchanging
• Automatic classification (93% accuracy*)– 70% are knotted
• 145 hours mean, 92 median– 28% sloped– 2% unchanging (flat)
*Consistent with the proportions of hand labeled data
Change curves
Different stable segment different ratios of dynamic to stable content
http://www.nytimes.com http://www.allrecipes.com
Change curves
dice
1
.4
hours10 20 30 40
LA
AK
Craigslist, Anchorage, AK Craigslist, Los Angeles, CA
Outline
• Behavior-Driven Sampling & Crawling• Measuring change
– Basic change behavior– Page evolution– Text changes – DOM level changes
• Applications
Outline
• Behavior-Driven Sampling & Crawling• Measuring change
– Basic change behavior– Page evolution– Text changes – DOM level changes
• Applications
Foo Bar Baz
Nature of the Text
What terms vanish here?
Or are still here?
Foo Bar Baz
Term Longevity Plot
TimeSep. Oct. Nov. Dec.
Foo Bar Baz
Term Longevity Plot
• Term level representation of change curve– Pick a vertical (t0)– Compare overlap of
terms to next vertical
Foo Bar Baz
Features of Terms
• Divergence– Which terms distinguish current document from
the collection (at a point in time)
• Staying power (σ)– Likelihood of observing a word (w) at two different
times, t and t+α in document D– σ(w,D)≈ P(t)P(α)P(w|Dt,Dt+ α)
( | )( , ) ( | )log( | )
P w DDiv w D P w DP w C
Foo Bar Baz
Distribution of terms by staying power (σ)
Low staying power(allrecipes.com)
High Div.bbqsaladssandwichesporkcheesecool
High staying power(allrecipes.com)
Foo Bar Baz
cookscookbooks
ingredient
desserts
home
you
search…
High Div.
Low Div.
Outline
• Behavior-Driven Sampling & Crawling• Measuring change
– Basic change behavior– Page evolution– Text changes – DOM level changes
• Applications
Foo Bar Baz
Outline
• Behavior-Driven Sampling & Crawling• Measuring change
– Basic change behavior– Page evolution– Text changes – DOM level changes
• Applications
DOM Level Changes
• How long does structure hold?• Applications with assumed stability
– Programming by Demonstration (PbD)– Mashups– Scrapers, etc.
[UIST08] Adar et al., “Zoetrope: Interacting with the Ephemeral Web”
DOM Structure
Tree Isomorphism
• The “general” approach:– Compare the DOM structure of 2 trees– Produces alignment, edit distances, etc. [Grandi’04]
– But: somewhat inefficient for large scale• We want:
– A method for comparing many (1000s) of versions of the same page at the same time
The Idea
• Serialize each DOM structure
a
b
bar
foo
<a>foo <b>bar</b></a>@time = 0
//a[0]/b[0] (/a/b)[0] H(“bar”) H(“bar”) 0
/a[0] (/a)[0] H(“foo”) H(“foo bar”)
0
full path type path node hash subtree hash version
The Idea
• Serialize each DOM structure
a
b
jar
foo
<a>foo <b>jar</b></a>@time = 1
//a[0]/b[0] (/a/b)[0] H(“bar”) H(“bar”) 0
/a[0] (/a)[0] H(“foo”) H(“foo bar”)
0
/a[0]/b[0] (/a/b)[0] H(“bar”) H(“jar”) 1/a[0] (/a)[0] H(“foo”) H(“foo
jar”)1
The Idea
• Serialize each DOM structure
a
b
jar
<a><b>jar</b></a>@time = 2
//a[0]/b[0] (/a/b)[0] H(“bar”) H(“bar”) 0
/a[0] (/a)[0] H(“foo”) H(“foo bar”)
0
/a[0]/b[0] (/a/b)[0] H(“jar”) H(“jar”) 1/a[0] (/a)[0] H(“foo”) H(“foo
jar”)1
/a[0]/b[0] (/a/b)[0] H(“jar”) H(“jar”) 2
Operators on Serialized Data
• sort(columns)– Sorts by the variables
• reduce(columns)– Generates a set of sets
• Look familiar?
sort(full_path,version)
S = reduce(full_path)
foreach s in S: calculate the difference between the minimum version id and last reported id
/a[0]/b[0] … 0
/a[0] … 0
/a[0]/b[0] … 1/a[0] … 1/a[0]/b[0] … 2
/a[0]/b[0] … 0
/a[0]/b[0] … 1
/a[0]/b[0] … 2/a[0] … 0/a[0] … 1
/a[0]/b[0] … 0
/a[0]/b[0] … 1
/a[0]/b[0] … 2/a[0] … 0/a[0] … 1
2
1
Structure Survival Over Time
0%10%20%30%40%50%60%70%80%90%
100%
2 hour 1 day 1 week 2 weeks 4 weeks 5 weeks
Smaller dataset ([UIST’08]) shows that mean survival after a year is only 23%
Frequencies and Motion
• Frequency of change of DOM elements
Frequencies and Motion
• Frequency of change of DOM elements• Motion of elements on a page
– Can we predict the motion of a page element?
Outline
• Behavior-Driven Sampling & Crawling• Measuring change
– Basic change behavior– Page evolution– Text changes – DOM level changes
• Applications
Outline
• Behavior-Driven Sampling & Crawling• Measuring change
– Basic change behavior– Page evolution– Text changes – DOM level changes
• Applications
Revisitation and Change (to appear [CHI’09])
Series1
0
0.2
0.4
0.6
0.8
1
1.2
NYT WootCostco
Series1
0
0.2
0.4
0.6
0.8
1
1.2
NYT WootCostco
Series1
0
0.2
0.4
0.6
0.8
1
1.2
NYT WootCostco
Series1
-0.2-1.66533453693773E-16
0.20.40.60.8
11.2
NYT WootCostco
Time
[CHI09] Adar et al., “Resonance on the Web: Web Dynamics and Revisitation Patterns”
Implying interest in the newest newsImplying interest in the newest deal (once every 24 hours)Implying interest in the stable (slow changing) content
[CHI09] Adar et al., “Resonance on the Web: Web Dynamics and Revisitation Patterns”
Inferred intent• Filter content by removing content changing faster
or slower than peak revisitation
Improve Search Over Time
• Changing content– Likelihood that document still contains term– Improved ranking
• Different term weights based on page dynamics
• Differential snippet generation– Take into account survivability of terms
Summary
• Behavior driven sampling gives us different statistics then what we’ve seen
• Fine grained analyses (time, terms, DOM)• Data is useful for modeling and simulation (today)• The techniques are useful going forward
– Change curves and the hockey-stick model– Language models and term longevity plots– Comparison of many DOM structures simultaneously
Thanks!
?
Inter-arrival time
Unique Users (popularity)
Visits Per User
time
User 1 = 10 times
User 2 = 12 times
…
523 unique users
Sampling URLsFull details: Adar et al., CHI08
Sampling URLs
Inter-arrival time
Unique Users (popularity)
Visits Per User
Full details: Adar et al., CHI08
Change curve implementation
• Which t0 do we pick?
• Any specific t0 results in different (noisy) curve– Solution: pick multiple t0 at random, calculate
average change curve
Log minutes
Operators on Serialized Data
• Combining sorts/reduces and intermediate outputs can answer many questions
• Can’t identify “changes” in text – shingles can help– Can identify when a hash “vanishes” and a new
one appears• Test only on those
Overall Distributions (Dice)
Change Plot for a Page
Change Plot by Domain
0.74
0.76
0.78
0.8
0.82
0.84
0.86
0.88
0.9
0 50 100 150 200
Mean Dice Coeffi
cient
Mean Inter-Version Time
.gov
.edu
.com
.net
.org