ranking web sites with real user traffic

32
Ranking Web Sites with Real User Traffic Mark Meiss Filippo Menczer Santo Fortunato Alessandro Flammini Alessandro Vespignani Web Search and Data Mining Stanford, California February 11, 2008

Upload: lang

Post on 24-Jan-2016

38 views

Category:

Documents


0 download

DESCRIPTION

Ranking Web Sites with Real User Traffic. Mark Meiss Filippo Menczer Santo Fortunato Alessandro Flammini Alessandro Vespignani. Web Search and Data Mining Stanford, California February 11, 2008. Outline. Data collection Structural properties Behavioral patterns PageRank validation - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Ranking Web Sites with Real User Traffic

Ranking Web Sites with Real User Traffic

Mark MeissFilippo MenczerSanto Fortunato

Alessandro FlamminiAlessandro Vespignani

Web Search and Data MiningStanford, CaliforniaFebruary 11, 2008

Page 2: Ranking Web Sites with Real User Traffic

Outline

•Data collection

•Structural properties

•Behavioral patterns

•PageRank validation

•Temporal patterns

Page 3: Ranking Web Sites with Real User Traffic

Sources for Ranking Data:The Link Graph

Page 4: Ranking Web Sites with Real User Traffic

Sources for Ranking Data:Dynamic Sources

• Network flow data

• Web server logs

• Toolbars and plugins

Page 5: Ranking Web Sites with Real User Traffic

ISP

~100 K users

Sources for Ranking Data:Packet Inspection

Page 6: Ranking Web Sites with Real User Traffic

Data Collection

HostHostPathPath

RefererRefererUser-AgentUser-AgentTimestampTimestamp

HTTP (80)HTTP (80)30% @ peak30% @ peak

anonymizeranonymizer

GETGET

requests requests from IU onlyfrom IU only

FULLFULLh/p/r/a/th/p/r/a/t

HUMANHUMANh/p/r/a/th/p/r/a/t

{

Page 7: Ranking Web Sites with Real User Traffic
Page 8: Ranking Web Sites with Real User Traffic

Outline

•Data collection

•Structural properties

•Behavioral patterns

•PageRank validation

•Temporal patterns

Page 9: Ranking Web Sites with Real User Traffic

Structural properties: Degree

Page 10: Ranking Web Sites with Real User Traffic

Caveat: Sampling Bias

Page 11: Ranking Web Sites with Real User Traffic

Structural properties:Strength (Site Traffic)

Page 12: Ranking Web Sites with Real User Traffic

Structural properties:Weights (Link Traffic)

Page 13: Ranking Web Sites with Real User Traffic

Outline

•Data collection

•Structural properties

•Behavioral patterns

•PageRank validation

•Temporal patterns

Page 14: Ranking Web Sites with Real User Traffic

Behavioral patterns (HUMAN)

(Proportion of total out-strength)

Empty Referrer54%

Search5%

Other40%

Webmail1%

Page 15: Ranking Web Sites with Real User Traffic

Ratios are stableR

equ

est

s (x

10

6)

0%

20%

40%

60%

80%

100%

Sep06

Oct06

Nov06

Dec06

Jan07

Feb07

Mar07

Apr07

May07

Page 16: Ranking Web Sites with Real User Traffic

Requ

est

s (x

10

6)

0%

20%

40%

60%

80%

100%

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

Ratios are stable

Page 17: Ranking Web Sites with Real User Traffic

Outline

•Data collection

•Structural properties

•Behavioral patterns

•PageRank validation

•Temporal patterns

Page 18: Ranking Web Sites with Real User Traffic

Validation of PageRank

• PR is a stationary distribution of visit frequency by a modified random walk (with jumps) on the Web graph

• Compare with actual site traffic (in-strength)

• From an application perspective, we care about the resulting ranking of sites rather than the actual values

Page 19: Ranking Web Sites with Real User Traffic

Kendall’s Rank Correlation

Page 20: Ranking Web Sites with Real User Traffic

PageRank Assumptions

1. Equal probability of teleporting to each of the nodes

2. Equal probability of teleporting from each of the nodes

3. Equal probability of following each link from any given node

0:

)()(

)1()(ijwi out

ij iPRWis

w

NjPRW

Page 21: Ranking Web Sites with Real User Traffic

Kendall’s Rank Correlation

Page 22: Ranking Web Sites with Real User Traffic

Local Link Heterogeneity

perfect

perfect concentratio

concentrationn

perf

ect

perf

ect

hom

ogen

eity

hom

ogen

eity

HH Index of concentration or

disparity

j out

iji is

wY

2

)(

Page 23: Ranking Web Sites with Real User Traffic

Teleportation Target Heterogeneity

Page 24: Ranking Web Sites with Real User Traffic

Teleportation Source Heterogeneity (“hubness”)

ssoutout < s < sinin

teleport sourcesteleport sourcesbrowsing sinksbrowsing sinks

-2

ssoutout > s > sinin

popular hubspopular hubs

Page 25: Ranking Web Sites with Real User Traffic

Navigation vs. Jumps: Sources of Popularity

Page 26: Ranking Web Sites with Real User Traffic

Outline

•Data collection

•Structural properties

•Behavioral patterns

•PageRank validation

•Temporal patterns

Page 27: Ranking Web Sites with Real User Traffic

Temporal patterns

How predictable are traffic patterns?

-- Cache refreshing

(e.g. proxies)

-- Capacity allocation

(e.g. peering and provisioning for spikes)

-- Site design

(e.g. expose content based on time of day)

Page 28: Ranking Web Sites with Real User Traffic

• Predict future host graph (clicks) from current one, as a function of delay

• Generalized temporal precision and recall:

Ttij ij

ij ijij

tw

twtwR

,)(

)(),(min)(

Temporal patterns

Ttij ij

ij ijij

tw

twtwP

,)(

)(),(min)(

Page 29: Ranking Web Sites with Real User Traffic

HUMAN host graph (FULL is about 10% more predictable)

Page 30: Ranking Web Sites with Real User Traffic

Summary

•Heterogeneity: incoming and outgoing site traffic, link traffic

• Less than half of traffic is from following links

•Only 5% of traffic is directly from search engines

•High temporal regularity

•PageRank is a poor predictor of traffic: random walk and random teleportation assumptions violated

Page 31: Ranking Web Sites with Real User Traffic

Next

•Sampling bias and search bias

•From host graph to page graph

•Modeling traffic: Beyond random walk?

Page 32: Ranking Web Sites with Real User Traffic

THANKS!

Mark Meiss

Filippo Menczer

Santo Fortunato

Alessandro Vespignani

Alessandro Flammini CNLL

??