1 mining the web traces: workload characterization, performance diagnosis, and applications lili qiu...

77
1 Mining the Web Traces: Workload Characterization, Performance Diagnosis, and Applications Lili Qiu Microsoft Research Performance’2002, Rome, Italy September 2002

Upload: alban-young

Post on 17-Jan-2016

226 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 1 Mining the Web Traces: Workload Characterization, Performance Diagnosis, and Applications Lili Qiu Microsoft Research Performance’2002, Rome, Italy September

1

Mining the Web Traces:Workload Characterization, Performance Diagnosis, and Applications

Lili QiuMicrosoft Research

Performance’2002, Rome, ItalySeptember 2002

Page 2: 1 Mining the Web Traces: Workload Characterization, Performance Diagnosis, and Applications Lili Qiu Microsoft Research Performance’2002, Rome, Italy September

2

Motivation

Why do we care about Web traces? Content providers

How do users come to visit the Web site? Why do users leave the Web site? Is poor

performance the cause for this? What content are users interested in? How do users’ interest vary in time? How do users’ interest vary across

different geographical regions? Where are the performance bottlenecks?

Page 3: 1 Mining the Web Traces: Workload Characterization, Performance Diagnosis, and Applications Lili Qiu Microsoft Research Performance’2002, Rome, Italy September

3

Motivation (Cont.)

Web hosting companies Accounting & billing Server selection Provisioning server farms: where to place servers

ISPs How to save bandwidth by storing proxy caches? Traffic engineering & provisioning

Researchers Where are the performance bottlenecks? How to improve Web performance? Examples: Traffic measurements have influenced

the design of HTTP (e.g., persistent connections and pipeline), TCP (e.g., initial congestion window)

Page 4: 1 Mining the Web Traces: Workload Characterization, Performance Diagnosis, and Applications Lili Qiu Microsoft Research Performance’2002, Rome, Italy September

4

Tutorial Outline

Background Web workload characterization Performance diagnosis Applications of traces Bibliography

Page 5: 1 Mining the Web Traces: Workload Characterization, Performance Diagnosis, and Applications Lili Qiu Microsoft Research Performance’2002, Rome, Italy September

5

Part I: Background

Web software components Web semantic components Web protocols Types of Web traces

Page 6: 1 Mining the Web Traces: Workload Characterization, Performance Diagnosis, and Applications Lili Qiu Microsoft Research Performance’2002, Rome, Italy September

6

Web Software Components

Web clients An application that

establishes connections to send Web requests

E.g., Mosaic, Netscape Navigator, Microsoft IE

Web servers An application that

accepts connections to service requests by sending back responses

E.g., Apache, IIS Web proxies (optional) Web replicas

(optional)

Internetreplica

proxy

replica

proxy

proxy

WebClients

WebServer

s

Page 7: 1 Mining the Web Traces: Workload Characterization, Performance Diagnosis, and Applications Lili Qiu Microsoft Research Performance’2002, Rome, Italy September

7

Web Semantic Components Uniform Resource Identifier (URI)

An identifier for a Web resource Name of protocol: http, https, ftp, .. Name of the server Name of the resource on the server Eg., http://www.foobar.com/info.html

Hypertext Markup Language (HTML) Platform-independent styles (indicated by markup

tags) that define the various components of a Web document

Hypertext Transfer Protocol (HTTP) Define the syntax and semantics of messages

exchanged between Web software components

Page 8: 1 Mining the Web Traces: Workload Characterization, Performance Diagnosis, and Applications Lili Qiu Microsoft Research Performance’2002, Rome, Italy September

8

Example of a Web Transaction

BrowserWeb server

DNSserver1. DNS query

2. Setup TCP connection

3. HTTP request

4. HTTP response

Page 9: 1 Mining the Web Traces: Workload Characterization, Performance Diagnosis, and Applications Lili Qiu Microsoft Research Performance’2002, Rome, Italy September

9

Internet Protocol Stack

Application layer: application programs (HTTP, Telnet, FTP, DNS)

Transport layer: error control + flow control (TCP,UDP)

Network layer: routing (IP)

Datalink layer: handle hardware details(Ethernet, ATM)

Physical layer: moving bits(coaxial cable, optical fiber)

Page 10: 1 Mining the Web Traces: Workload Characterization, Performance Diagnosis, and Applications Lili Qiu Microsoft Research Performance’2002, Rome, Italy September

10

Web Protocols

HTTP

TCP

IP

Ethernet

HTTP

TCP

IP

Ethernet

HTTP messages

TCP segments

A picture taken from [KR01]

IP IPIP pktIP pkt

Ethernet Sonet Sonet Ethernet

IP pkt

Sonet link Ethernet Ethernet

Page 11: 1 Mining the Web Traces: Workload Characterization, Performance Diagnosis, and Applications Lili Qiu Microsoft Research Performance’2002, Rome, Italy September

11

Web Protocols (Cont.) DNS [AL01]

An application layer protocol responsible for translating hostname to IP and vice versa (e.g., perf2002.uniroma2.it 160.80.2.140)

TCP [JK88] A transport layer protocol that does error control and flow

control Hypertext Transfer Protocol (HTTP)

HTTP 1.0 [BLFF96] The most widely used HTTP version A “Stop and wait” protocol

HTTP 1.1 [GMF+99] Adds persistent connections, pipelining, caching,

compression

Page 12: 1 Mining the Web Traces: Workload Characterization, Performance Diagnosis, and Applications Lili Qiu Microsoft Research Performance’2002, Rome, Italy September

12

HTTP 1.0

HTTP request Request = Simple-Request | Full-Request

Simple-Request = "GET" SP Request-URI CRLF Full-Request = Request-Line;

*( General/Request/Entity Header) ; CRLF [ Entity-Body ] ;

Request-Line = Method SP Request-URI SP HTTP-Version CRLF Method = "GET" ;| "HEAD" ; | "POST" ;| extension-method

Example: GET /info.html HTTP/1.0

Page 13: 1 Mining the Web Traces: Workload Characterization, Performance Diagnosis, and Applications Lili Qiu Microsoft Research Performance’2002, Rome, Italy September

13

HTTP 1.0 (Cont.)

HTTP response Response = Simple-Response | Full-Response

Simple-Response = [ Entity-Body ]Full-Response = Status-Line;

*( General/Response/Entity Header ); CRLF

[ Entity-Body ] ; Example:

HTTP/1.0 200 OKDate: Mon, 09 Sep 2002 06:07:53 GMTServer: Apache/1.3.20 (Unix) (Red-Hat/Linux) PHP/4.0.6Last-Modified: Mon, 29 Jul 2002 10:58:59 GMTContent-Length: 21748Content-Type: text/html…<21748 bytes of the current version of info.html>

Page 14: 1 Mining the Web Traces: Workload Characterization, Performance Diagnosis, and Applications Lili Qiu Microsoft Research Performance’2002, Rome, Italy September

14

HTTP 1.1 Connection management

Persistent connections [Mogul95] Use one TCP connection for multiple HTTP requests Pros:

Reduce the overhead of connection setup and teardown Avoid TCP slow start

Cons: head-of-line blocking increase servers’ state

Pipeline [Pad95] Send multiple requests without waiting for a response

between requests Pros: avoid the round-trip delay of waiting for each

response Cons: connection abortion is harder to deal with

Page 15: 1 Mining the Web Traces: Workload Characterization, Performance Diagnosis, and Applications Lili Qiu Microsoft Research Performance’2002, Rome, Italy September

15

HTTP 1.1 (Cont.)

Caching Continues to support the notion of expiration used in

HTTP 1.0 Add a cache-control header to handle the issues of

cacheability and semantic transparency [KR01] E.g., no-cache, only-if-cache, no-store, max-age, max-

stale, min-fresh, …

Others Range request Content negotiation Security …

Page 16: 1 Mining the Web Traces: Workload Characterization, Performance Diagnosis, and Applications Lili Qiu Microsoft Research Performance’2002, Rome, Italy September

16

Types of Web Traces

Application level traces Collection method: Available tools: Concerns:

Flow level traces Collection method: Available tools: Concerns:

Packet level traces Collection method: monitor a network link Available tools: tcpdump, libpcap Concerns: packet dropping, timestamp accuracy

Page 17: 1 Mining the Web Traces: Workload Characterization, Performance Diagnosis, and Applications Lili Qiu Microsoft Research Performance’2002, Rome, Italy September

17

Tutorial Outline

Background Web workload characterization Performance diagnosis Applications of traces Bibliography

Page 18: 1 Mining the Web Traces: Workload Characterization, Performance Diagnosis, and Applications Lili Qiu Microsoft Research Performance’2002, Rome, Italy September

18

Part II: Web Workload Characterization

Overview Content dynamics Access dynamics Common pitfalls Case studies

Page 19: 1 Mining the Web Traces: Workload Characterization, Performance Diagnosis, and Applications Lili Qiu Microsoft Research Performance’2002, Rome, Italy September

19

Overview Process of trace analyses Common analysis techniques Common analysis tools Challenges in workload characterization

Page 20: 1 Mining the Web Traces: Workload Characterization, Performance Diagnosis, and Applications Lili Qiu Microsoft Research Performance’2002, Rome, Italy September

20

Process of trace analyses

Collect traces where to monitor, how to collect (e.g.,

efficiency, privacy, accuracy) Determine key metrics to characterize Process traces Draw inferences from the data Apply the traces or insights gained from

the trace analyses to design better protocols & systems

Page 21: 1 Mining the Web Traces: Workload Characterization, Performance Diagnosis, and Applications Lili Qiu Microsoft Research Performance’2002, Rome, Italy September

21

Common Analysis Techniques - Statistics

Mean Median Variance and standard deviation

Geometric mean: less sensitive to outliers

Confidence interval A range of values that has a specified probability of

containing the parameter being estimated Example: 95% confidence interval 10 x 20

)var()(,)(1

)var(1

2 xxstduxN

xN

ii

)(log xEnixGM

Page 22: 1 Mining the Web Traces: Workload Characterization, Performance Diagnosis, and Applications Lili Qiu Microsoft Research Performance’2002, Rome, Italy September

22

Common Analysis Techniques – Statistics (Cont.)

Cumulative distribution (CDF) P(x a)

Probability density function (PDF) Derivative of CDF: f(x) = dF(x)/dx

Check for heavy tail distribution Log-log complementary plot, and check its tail Example: Pareto distribution

If 2, distribution has infinite variance (a heavy tail)If 1, distribution has infinite mean

axax

axF ,0,,)(1)(

Page 23: 1 Mining the Web Traces: Workload Characterization, Performance Diagnosis, and Applications Lili Qiu Microsoft Research Performance’2002, Rome, Italy September

23

Common Analysis Techniques – Data Fitting

Visually compare the empirical distribution with a standard distribution

Chi Squared tests [AS86,Jain91] If , then two distributions are close, where

need enough samples Kolmogorov-Smirnov tests [AS86,Jain91]

Compares two distributions by finding the maximum differences between two variables’ cumulative distribution functions

kX 2

k

i i

ii

E

ExX

1

22 )(

Page 24: 1 Mining the Web Traces: Workload Characterization, Performance Diagnosis, and Applications Lili Qiu Microsoft Research Performance’2002, Rome, Italy September

24

Common Analysis Techniques – Data Fitting (Cont.)

Quantile-quantile plots [AS86,Jain91] Compare two distributions by plotting the

inverse of the cumulative distribution function F-1(x) for two variables, and find best fitting line

If the slope of the line is close to 1, and y-intercept is close to 0, the two data sets are almost identically distributed

Page 25: 1 Mining the Web Traces: Workload Characterization, Performance Diagnosis, and Applications Lili Qiu Microsoft Research Performance’2002, Rome, Italy September

25

Common Analysis Tools

Scripting languages Perl, awk, UNIX shell scripts, VB

Databases SQL, DB2, …

Statistics packages Matlab, S+, R, SAS, …

Write our own low level programs C, C++, C#, Java, …

Page 26: 1 Mining the Web Traces: Workload Characterization, Performance Diagnosis, and Applications Lili Qiu Microsoft Research Performance’2002, Rome, Italy September

26

Challenges in Workload Characterization

Each of the Web components provides a limited perspective on the functioning of the Web

Workload characteristics vary both in space and in time

Internetreplica

proxy

replica

proxy

proxy

Clients Servers

Page 27: 1 Mining the Web Traces: Workload Characterization, Performance Diagnosis, and Applications Lili Qiu Microsoft Research Performance’2002, Rome, Italy September

27

Views from Clients

Capture clients’ requests to all servers Pros

Know details of client activities, such as requests satisfied by browser caches, client abortion

The ability to record detailed information, as this does not impose significant load on a client browser

Cons Need to modify browser software Hard to deploy for a large number of clients

Page 28: 1 Mining the Web Traces: Workload Characterization, Performance Diagnosis, and Applications Lili Qiu Microsoft Research Performance’2002, Rome, Italy September

28

Views from Web Servers

Capture most clients’ requests (excluding those satisfied by caches) to a single server

Pros Relatively easy to deploy/change logging

software Cons

Requests satisfied by browser & proxy caches will not appear in the logs

May not log detailed information to ensure fast processing of client requests

Page 29: 1 Mining the Web Traces: Workload Characterization, Performance Diagnosis, and Applications Lili Qiu Microsoft Research Performance’2002, Rome, Italy September

29

Views from Web Proxies Depending on the proxy’s location

A proxy close to clients see requests from a a small client group to a large number of servers [KR00]

A proxy close to the servers see requests from a large client group to a small number of servers [KR00]

Pros More diverse …?

Cons Requests satisfied by browser caches will not appear in

the logs May not log detailed information to ensure fast

processing of requests Does not have full information …?

Page 30: 1 Mining the Web Traces: Workload Characterization, Performance Diagnosis, and Applications Lili Qiu Microsoft Research Performance’2002, Rome, Italy September

30

Workload Variation Vary with measurement points Vary with sites being measured

Information servers (news site), e-commercial servers, query servers, streaming servers, upload servers

US vs. Italy, … Vary with the clients being measured

Internet clients vs. wireless clients University clients vs. home users US vs. Italy, …

Vary in time Day vs. night Weekday vs. weekend Changes with new applications, recent events Evolve over time, …

Page 31: 1 Mining the Web Traces: Workload Characterization, Performance Diagnosis, and Applications Lili Qiu Microsoft Research Performance’2002, Rome, Italy September

31

Part II: Web Workload Overview Content dynamics Access dynamics Common pitfalls Case studies

Page 32: 1 Mining the Web Traces: Workload Characterization, Performance Diagnosis, and Applications Lili Qiu Microsoft Research Performance’2002, Rome, Italy September

32

Content Dynamics

File size distribution File update patterns

How often files are updated How much files are updated

Page 33: 1 Mining the Web Traces: Workload Characterization, Performance Diagnosis, and Applications Lili Qiu Microsoft Research Performance’2002, Rome, Italy September

33

File Size Distribution

Two definitions D1: Size of all files on a Web server D2: Size of all files transferred by a Web

server D1 D2, because some files can be

transferred multiple times or not in completion and other files are not transferred

Studies show that the distribution of file sizes in both definitions exhibit heavy tails (i.e., P[F > x] ~ x-, 0 2)

Page 34: 1 Mining the Web Traces: Workload Characterization, Performance Diagnosis, and Applications Lili Qiu Microsoft Research Performance’2002, Rome, Italy September

34

File Update Interval

Varies in time Hot events and fast changing events require more

frequent update, e.g., Worldcup Varies across sites

Depending on server update policy Depending on the nature of content (e.g., University

sites have slower update rate than news sites) Recent studies

Study of the proxy traces collected at DEC and AT&T in 1996 showed the rate of change depended on content type, top-level domains etc. [DFK+97]

Study of 1999 MSNBC logs shows that modification history yields a rough predictor of future modification interval [PQ00]

Page 35: 1 Mining the Web Traces: Workload Characterization, Performance Diagnosis, and Applications Lili Qiu Microsoft Research Performance’2002, Rome, Italy September

35

Extent of Change upon Modifications

Varies in time Varies across sites Recent studies

Study of 1996 DEC and AT&T proxy traces shows that ??? [MDF+97]

Study of 1999 MSNBC log shows that most file modifications are small delta encoding can be very useful [PQ00]

Page 36: 1 Mining the Web Traces: Workload Characterization, Performance Diagnosis, and Applications Lili Qiu Microsoft Research Performance’2002, Rome, Italy September

36

Part II: Web Workload Motivation Limitations of workload measurements Content dynamics Access Dynamics

File popularity distribution Temporal stability Spatial locality User session and request arrivals & duration Synthetic workload generation

Common pitfalls Case studies

Page 37: 1 Mining the Web Traces: Workload Characterization, Performance Diagnosis, and Applications Lili Qiu Microsoft Research Performance’2002, Rome, Italy September

37

Document Popularity

The Web requests follow Zipf-like distribution Request frequency 1/i, where i is a document’s ranking The value of depends on the point of measurements

Between 0.6 and 1 for client traces and proxy traces Close to or larger than 1 for server traces [ABC+96, PQ00]

The value of varies over time (e.g., larger during hot events)

0

0.5

1

1.5

2

MSNBC Proxies Less popular servers

Page 38: 1 Mining the Web Traces: Workload Characterization, Performance Diagnosis, and Applications Lili Qiu Microsoft Research Performance’2002, Rome, Italy September

38

Impact of the value Larger means more

concentrated accesses on popular documents caching is more beneficial

90% of the accesses are accounted by

Top 36% files in proxy traces [BCF+99, PQ00]

Top 10% files in small departmental server logs reported in [AW96]

Top 2-4% files in MSNBC traces

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

Percentage of Documents (sorted by popularity)

Pe

rce

nta

ge

of R

eq

ue

sts

12/17/98 Server Traces 08/01/99 Server Traces10/06/99 Proxy Traces

Page 39: 1 Mining the Web Traces: Workload Characterization, Performance Diagnosis, and Applications Lili Qiu Microsoft Research Performance’2002, Rome, Italy September

39

Temporal Stability Metrics

Coarse-grained: likely duration that a current popular file remains popular

e.g., overlap between the set of popular documents on day 1 and day 2

Fine-grained: how soon a requested file will be requested again

e.g., LRU stack distance [ABC+96]

File 5

File 4File 3

File 2File 1

File 2

File 5File 4

File 3File 1

Stack distance = 4

Page 40: 1 Mining the Web Traces: Workload Characterization, Performance Diagnosis, and Applications Lili Qiu Microsoft Research Performance’2002, Rome, Italy September

40

Spatial Locality

Refers to if users in the same geographical location or same organization tend to request the same documents E.g., degree of a request locally shared vs.

globally shared

Page 41: 1 Mining the Web Traces: Workload Characterization, Performance Diagnosis, and Applications Lili Qiu Microsoft Research Performance’2002, Rome, Italy September

41

Spatial Locality (Cont.)

Normal Day

0

0.2

0.4

0.6

0.8

1

0.E+00 1.E+04 2.E+04 3.E+04 4.E+04 5.E+04

Domain ID

Fra

cti

on

of

req

ue

sts

s

ha

red

Domain membership is significant except when there is a “hot” event of global interest

Dec. 17, 1998

0

0.2

0.4

0.6

0.8

1

1.2

0.0E+00 5.0E+03 1.0E+04 1.5E+04 2.0E+04 2.5E+04 3.0E+04 3.5E+04

Domain IDFr

actio

n of

re

ques

ts s

hare

d

Trace Random

Page 42: 1 Mining the Web Traces: Workload Characterization, Performance Diagnosis, and Applications Lili Qiu Microsoft Research Performance’2002, Rome, Italy September

42

User Request Arrivals & Duration User workload at three levels

Session: a consecutive series of requests from a user to a Web site

Click: a user action to request a page, submit a form, etc. Request: each click generates one or more HTTP requests

Exponential distribution [LNJV99,KR01] Session duration

Heavy-tail distribution [KR01] # clicks in a session, most in the range of 4-6 [Mah97] # embedded references in a Web page Think time: time between clicks Active time: time to download a Web page and its

embedded images

Page 43: 1 Mining the Web Traces: Workload Characterization, Performance Diagnosis, and Applications Lili Qiu Microsoft Research Performance’2002, Rome, Italy September

43

Common Pitfalls Trace analyses are all about writing scripts &

plotting nice graphs Challenges

Trace collection: where to monitor, how to collect (e.g., efficiency, privacy, accuracy)

Identify important metrics, and understand why they are important

Sound measurements require disciplines [Pax97] Draw implications from data analyses

Understanding the limitation of the traces No representative traces: workload changes in time and in

space Try to diversify data sets (e.g., collect traces at different

places and different sites) before jumping into conclusions Draw inferences more than what data show

Page 44: 1 Mining the Web Traces: Workload Characterization, Performance Diagnosis, and Applications Lili Qiu Microsoft Research Performance’2002, Rome, Italy September

44

Part II: Web Workload Motivation Limitations of workload measurements Content dynamics Access dynamics Common pitfalls Case studies

Boston University client log study UW proxy log study MSNBC server log study Mobile log study

Page 45: 1 Mining the Web Traces: Workload Characterization, Performance Diagnosis, and Applications Lili Qiu Microsoft Research Performance’2002, Rome, Italy September

45

Case Study I: BU Client Log Study

Overview One of the few client log studies Analyze clients’ browsing pattern and their impact on

network traffic [CBC95] Approaches

Trace collection Modify Mosaic and distribute it to machines in CS Dept. at

Boston Univ. to collect client traces in 1995 Log format: <client machine, request time, user id, URI,

document size, retrieval time> Data analyses

Distribution of document size, document popularity Relationship between retrieval latency and response size Implications on caching strategies

Page 46: 1 Mining the Web Traces: Workload Characterization, Performance Diagnosis, and Applications Lili Qiu Microsoft Research Performance’2002, Rome, Italy September

46

Major Findings

Power law distributions Distribution of document sizes Distribution of user requests for documents # requests to documents as a function of

their popularity Caching strategies should take into

account of document size (i.e., give preference to smaller documents)

Page 47: 1 Mining the Web Traces: Workload Characterization, Performance Diagnosis, and Applications Lili Qiu Microsoft Research Performance’2002, Rome, Italy September

47

Case Study II: UW Proxy Log Study

Overview Proxy traces collected at the University of

Washington and Microsoft Approaches [WVS+99a, WVS+99b]

Trace collection: deploy a passive network sniffer between the Univ. of Washington and the rest of the Internet in May 1999

Set well-defined objectives Understand the extent of document sharing within

an organization and across different organizations Understand the performance benefit of cooperative

proxy caching

Page 48: 1 Mining the Web Traces: Workload Characterization, Performance Diagnosis, and Applications Lili Qiu Microsoft Research Performance’2002, Rome, Italy September

48

Major Findings

Members of an organization are more likely to request the same documents than a random set of clients

Most popular documents are globally popular

Cooperative caching is most beneficial for small organizations

Cooperative caching among large organizations yield minor improvement if any

Page 49: 1 Mining the Web Traces: Workload Characterization, Performance Diagnosis, and Applications Lili Qiu Microsoft Research Performance’2002, Rome, Italy September

49

Case Study III: MSNBC Server Log Study

Overview of MSNBC server site a large news site server cluster with 40 nodes 25 million accesses a day (HTML content

alone) Period studied: Aug. – Oct. 99 & Dec. 17, 98

flash crowd

Page 50: 1 Mining the Web Traces: Workload Characterization, Performance Diagnosis, and Applications Lili Qiu Microsoft Research Performance’2002, Rome, Italy September

50

Approaches Trace collection

HTTP access logs Content Replication System (CRS) logs HTML content logs

Data analyses Content dynamics

How often files are modified? How to predict modification interval? How much a file change upon modification?

Access dynamics Document popularity Temporal stability Spatial locality Correlation between document age and popularity

Page 51: 1 Mining the Web Traces: Workload Characterization, Performance Diagnosis, and Applications Lili Qiu Microsoft Research Performance’2002, Rome, Italy September

51

Major Findings

Content dynamics Modification history is a rough predictor Frequent but minimal file modifications

Access dynamics Set of popular files remains stable for days Domain membership has a significant bearing on

client accesses except during a flash crowd of global interest

Zipf-like distribution of file popularity but with a much larger than at proxies

Accesses to old documents account for most first-time misses hard to anticipate such accesses, and eliminate these first-time misses

Page 52: 1 Mining the Web Traces: Workload Characterization, Performance Diagnosis, and Applications Lili Qiu Microsoft Research Performance’2002, Rome, Italy September

52

Case Study IV: Mobile Log Study Overview of a popular commercial Web site

for mobile clients Content

news, weather, stock quotes, email, yellow pages, travel reservations, entertainment etc.

Services Notification Browse

Period studied 3.25 million notifications in Aug. 20 – 26, 2000 33 million browse requests in Aug. 15 – 26, 2000

Page 53: 1 Mining the Web Traces: Workload Characterization, Performance Diagnosis, and Applications Lili Qiu Microsoft Research Performance’2002, Rome, Italy September

53

Approaches Analyze by user categories

Cellular users Browse the Web in real time using cellular technologies

Offline users Download content onto their PDAs for later (offline)

browsing, e.g. AvantGo Desktop users

Signup services and specify preferences Analyze by Web services

Browse Notifications

Use SQL database to manage data

Page 54: 1 Mining the Web Traces: Workload Characterization, Performance Diagnosis, and Applications Lili Qiu Microsoft Research Performance’2002, Rome, Italy September

54

Major Findings

Notification Services Popularity of notification messages follows

Zipf-like distribution Top 1% notification objects account for 54-

64% of total messages Exhibits geographical locality

Browse Services 0.1% - 0.5% urls account for 90% requests The set of popular urls remain stable

Correlation between the two services Correlation is limited

Page 55: 1 Mining the Web Traces: Workload Characterization, Performance Diagnosis, and Applications Lili Qiu Microsoft Research Performance’2002, Rome, Italy September

55

Tutorial Outline

Background Web Workload Performance Diagnosis

Overview Infer the causes of high end-to-end delay in

Web transfers [BC00] Infer the causes of high end-to-end loss rate

in Web transfers [PQW02a, PQW02b] Applications of traces

Page 56: 1 Mining the Web Traces: Workload Characterization, Performance Diagnosis, and Applications Lili Qiu Microsoft Research Performance’2002, Rome, Italy September

56

Overview Goal: Determine internal

network characteristics using end-to-end Web traces

Metrics of interest Delay Loss rate Raw bandwidth Available bandwidth Traffic rate

Why interesting Resolve the trouble spots Server selection Placement of mirror servers

Sprint

AT&T

Web Server

UUNET

MCI

Qwest AOL

EarthlinkWhy so slow?

Page 57: 1 Mining the Web Traces: Workload Characterization, Performance Diagnosis, and Applications Lili Qiu Microsoft Research Performance’2002, Rome, Italy September

57

Finding the Sources of Delays Goal

Why is my Web transfer slow? Is it the server or the network or the client?

Sources of delay in Web transfer DNS lookup Server delays Client delays Network delays

Propagation delays Network variation delays Delays introduced by packet losses (e.g., signaled by

the fast retransmit mechanism or TCP timeouts)

Page 58: 1 Mining the Web Traces: Workload Characterization, Performance Diagnosis, and Applications Lili Qiu Microsoft Research Performance’2002, Rome, Italy September

58

TCPEval Tool

Inputs: “tcpdump” packet traces taken at the communicating Web server and client

Generates a variety of statistics for file transactions File and packet transfer latencies Packet drop characteristics Packet and byte counts per unit time

Generates both timeline and sequence plots for transactions

Generates critical path profiles and statistics for transactions

Page 59: 1 Mining the Web Traces: Workload Characterization, Performance Diagnosis, and Applications Lili Qiu Microsoft Research Performance’2002, Rome, Italy September

59

Critical Path Analysis Tool

Client Server Client ServerData flow Critical Path

Network delay

Network delayServer delayNetwork delay

Client delayNetwork delayServer delay

Network delaydue to pkt loss

Page 60: 1 Mining the Web Traces: Workload Characterization, Performance Diagnosis, and Applications Lili Qiu Microsoft Research Performance’2002, Rome, Italy September

60

Finding Sources of Packet Losses Goal

Determine link loss rate or identify lossy links

l1

l8l7l6

l2

l4 l5

l3

server

clientsp1 p2 p3 p4 p5

(1-l1)*(1-l2)*(1-l4) = (1-p1)

(1-l1)*(1-l2)*(1-l5) = (1-p2)…(1-l1)*(1-l3)*(1-l8) = (1-p5)

Under-constrained system of

equations

Page 61: 1 Mining the Web Traces: Workload Characterization, Performance Diagnosis, and Applications Lili Qiu Microsoft Research Performance’2002, Rome, Italy September

61

Approaches

Active probing Probing

Multicast probes Striped unicast probes

Inference technique: EMS

A B

S

A B

Page 62: 1 Mining the Web Traces: Workload Characterization, Performance Diagnosis, and Applications Lili Qiu Microsoft Research Performance’2002, Rome, Italy September

62

Approaches (Cont.) Passive monitoring

Random sampling Random sample the solution space, and draw

conclusions based on samples Akin to monte carlo sampling

Linear optimization Determine a unique solution by optimizing an

objective function Gibbs sampling

Sampling from P(|D), where is ensemble of loss rates of links in the network, and D is observed packet transmission and losses at the clients

EM

Page 63: 1 Mining the Web Traces: Workload Characterization, Performance Diagnosis, and Applications Lili Qiu Microsoft Research Performance’2002, Rome, Italy September

63

Tutorial Outline

Background Web Workload Performance Diagnosis Applications of traces Bibliography

Page 64: 1 Mining the Web Traces: Workload Characterization, Performance Diagnosis, and Applications Lili Qiu Microsoft Research Performance’2002, Rome, Italy September

64

Part IV: Applications of Traces Synthetic workload generation Cache design

Cache replacement policies [CI97,BCF+99] Cache consistency algorithms [LC97, YBS99,YAD+01] Cooperative cache or not [WVS+99] Cache infrastructure

Pre-fetching algorithms [CB98, FJC+99] Placement of Web proxies/replicas [QPV01] Other optimizations

Improving TCP for Web transfers [Mah97,PK98,ZQK00] Concurrent downloads, pipelining, compression,…

Page 65: 1 Mining the Web Traces: Workload Characterization, Performance Diagnosis, and Applications Lili Qiu Microsoft Research Performance’2002, Rome, Italy September

65

Synthetic Workload Generation

Generate user requests Generate user sessions using a Poisson arrival

process For each user session, determine # clicks

using a Pareto distribution Assign a click to a request for a Web page,

while making sure The popularity distribution of a file follows a Zipf-like

distribution [BC98] Capture the temporal locality of successive requests

for the same resource Generate a next click from the same user with

think time following a Pareto distribution

Page 66: 1 Mining the Web Traces: Workload Characterization, Performance Diagnosis, and Applications Lili Qiu Microsoft Research Performance’2002, Rome, Italy September

66

Synthetic Workload Generation (Cont.)

Generate Web pages Determine the number of Web pages Generate the size of each Web pages using file

size distribution (log-normal) Associate a page with some number of

embedded pages using empirical distribution (heavy-tail)

Generate file modification events Examples of generators

Webbench [Wbe], WebStone[TS95], Surge [BC98], SPecweb99 [SP99], Web Polygraph [WP], …

Page 67: 1 Mining the Web Traces: Workload Characterization, Performance Diagnosis, and Applications Lili Qiu Microsoft Research Performance’2002, Rome, Italy September

67

Cache Replacement Policies Problem formulation

Given a fixed size cache, how to evict pages to maximize the hit ratio once the cache is full?

Hit ratio Fraction of requests satisfied by the cache Fraction of the total size of requested data satisfied by

the cache Factors to consider

Request frequency Modification frequency Benefit of caching: reduction in latency & BW Cost of caching: storage

Page 68: 1 Mining the Web Traces: Workload Characterization, Performance Diagnosis, and Applications Lili Qiu Microsoft Research Performance’2002, Rome, Italy September

68

Cache Replacement Policies (Cont.) Approaches

Least recently used (LRU) Least frequently used (LFU)

Perfect: maintain counters for all pages seen In-cache: maintain counters only for pages that are in

cache GreedyDual-size [CI97]

Assign a utility value to each object, and replace the one with the lowest utility

Use of traces Evaluate the algorithms using trace-driven

simulations or synthetic workload

Page 69: 1 Mining the Web Traces: Workload Characterization, Performance Diagnosis, and Applications Lili Qiu Microsoft Research Performance’2002, Rome, Italy September

69

Placement of Web Proxies/Replicas

Problem formulation [JJK+01,QPV01] How to place a fixed number of

proxies/replicas to minimize users’ request latency

Factors to consider Spatial distribution of requests Temporal stability of requests

Stability in popularity of objects Stability in spatial distribution of requests

Page 70: 1 Mining the Web Traces: Workload Characterization, Performance Diagnosis, and Applications Lili Qiu Microsoft Research Performance’2002, Rome, Italy September

70

Placement of Web Proxies/Replicas (Cont.) Approaches

Random placement Greedy placement Hot-spot placement

Use of traces Trace-driven simulations High concentration of requests to a small number

of objects focus on replicating only popular objects

Temporal stability in requests no need to frequently change the locations of proxies/replicas

Page 71: 1 Mining the Web Traces: Workload Characterization, Performance Diagnosis, and Applications Lili Qiu Microsoft Research Performance’2002, Rome, Italy September

71

References [AS86] R. B. D’Agostino and M. A. Stephens. Goodness-of-Fit Techniques. Marcel Dekker,

New York, NY 1986. [ABC+96] Virgilio Almeida, Azer Bestavros, Mark Crovella and Adriana de Oliveria.

Characterizing reference locality in the WWW. In Proceedings of 1996 International Conference on Parallel and Distributed Information Systems (PDIS'96), December 1996.

[ABQ01] A. Adya, P. Bahl, and L. Qiu. Analyzing Browse Patterns of Mobile Clients. In Proc. of SIGCOMM Measurement Workshop, Nov. 2001.

[ABQ02] A. Adya, P. Bahl, and L. Qiu. Characterizing Alert and Browse Services for Mobile Clients. In Proc. of USENIX, Jun. 2002.

[AL01] P. Albitz, and C. Liu. DNS and BIND (4th Edition), O’Reilly & Associates, Apr. 2001. [AW97] M. Arlitt and C. Williamson. Internet Web Servers: Workload Characterization and

Performance Implications. IEEE/ACM Transactions on Networking , Vol. 5, No. 5, pp. 631-645, October 1997.

[BC98] P. Barford and M. Crovella. Generating representative workloads for network and server performance evaluation. In Proc. of SIGMETRICS, 1998.

[BBC+98] P. Barford, A. Bestavros, M. Crovella, and A. Bradley. Changes in Web Client Access Patterns: Characteristics and Caching Implications, Special Issue on World Wide Web Characterization and Performance Evaluation; World Wide Web Journal, December 1998.

Page 72: 1 Mining the Web Traces: Workload Characterization, Performance Diagnosis, and Applications Lili Qiu Microsoft Research Performance’2002, Rome, Italy September

72

References (Cont.) [BCF+99] L. Breslau, P. Cao, L. Fan, G. Phillips, and S. Shenker. Web Caching and Zipf-like

Distributions: Evidence and Implications. In Proc. of INFOCOM, Mar. 1999. [BC00] P. Barford and M. Crovella. Critical Path Analysis of TCP Transactions. In Proc. of

ACM SIGCOMM, Aug. 2000. [BLFF96] T. Berners-Lee, R. Fielding, and H. Frystyk. Hypertext Transfer Protocol --

HTTP/1.0. RFC 1945, May 1996. [BPS+98] H. Balakrishnan, V. N. Padmanabhan, S. Seshan, M. Stemm and R. H. Katz. TCP

Behavior of a Busy Internet Server: Analysis and Improvements Proc. IEEE Infocom, San Francisco, CA, USA, March 1998.

[CB98] M. Crovella and P. Barford. The network effects of prefetching. In Proc. of INFOCOM, 1998.

[CBC95] C. R. Cunha, A. Bestavros, and M. E. Crovella. Characteristics of WWW client-based traces. Technical Report BU-CS-95-010, CS Dept., Boston University, 1995.

[CI97] P. Cao and S. Irani. Cost-Aware WWW proxy caching algorithms. In Proc. of USITS, Dec. 1997.

[DFK+97] F. Douglis, A. Feldmann, B. Krishnamurth, and J. Mogul. Rate of change and other metrics: a live study of the World Wide Web. In Proc. of USITS, 1997.

[FCD+99] A. Feldmann, R. Caceres, F. Douglis, and M. Rabinovich. Performance of Web Proxy Caching in heterogeneous bandwidth enviornments. In Proc. of IEEE INFOCOM, March 1999.

Page 73: 1 Mining the Web Traces: Workload Characterization, Performance Diagnosis, and Applications Lili Qiu Microsoft Research Performance’2002, Rome, Italy September

73

References (Cont.) [FJC+99] L. Fan, Q. Jacobson, P. Cao and W. Lin. Web Prefetching Between Low-Bandwidth

Clients and Proxies: Potential and Performance. In Proc. of SIGMETRICS, 1999. [GMF+99] J. Gettys, J. Mogul, H. Frystyk, L. Masinter, P. Leach, T. Berners-Lee. Hypertext

Transfer Protocol – HTTP 1.1. RFC 2616, Jun. 1999. [JK88] V. Jacobson, M. J. Karels. Congestion Avoidance and Control. In Proc. SIGCOMM,

Aug. 1988. [JJK+01] S. Jamin, C. Jin, A. R. Kurc, D. Raz, and Y. Shavitt. Constrained Mirror Placement on

the Internet. In Proc. of INFOCOM, Apr. 2001. [Jain91] R. Jain. The Art of Computer Systems Performance Analysis. John Wiley and Sons,

1991. [Kel02] T. Kelly. Thin-Client Web Access Patterns: Measurements from a Cache-Busting Proxy.

Computer Communications, Vol. 25, No. 4 (March 2002), pages 357-366.  [KR01] B. Krishnamurthy and J. Rexford. Web Protocols and Practice, HTTP/1.1, Networking

Protocols, Caching, and Traffic Measurement. Addison-Wesley, May 2001. [LC97] C. Liu and P. Cao. Maintaining Strong Cache Consistency in the World-Wide Web. In

Proc. of ICDCS'97, pp. 12-21, May 1997. [LNJV99] Z. Liu, N. Niclausse, and C. Jalpa-Villaneuva. Web Traffic Modeling and

Performance Comparison Between HTTP 1.0 and HTTP 1.1. In Erol Gelenbe, editor, System Performance Evaluation: Methodologies and Applications. CRC Press, Aug. 1999.

Page 74: 1 Mining the Web Traces: Workload Characterization, Performance Diagnosis, and Applications Lili Qiu Microsoft Research Performance’2002, Rome, Italy September

74

References (Cont.) [Mah97] Bruce Mah. An empirical model of HTTP network traffic. In Proc. of INFOCOM,

April 1997. [Mogul95] Jeffrey C. Mogul. The Case for Persistent-Connection HTTP. In Proc.

SIGCOMM '95, pages 299-313. Cambridge, MA, August, 1995. [MDF+97] J. C. Mogul, F. Douglis, A. Feldmann, and B. Krishnamurthy. Potential benefits

of delta-encoding and data compression for HTTP, In Proc. of SIGCOMM, September 1997. [Pad95] V. N. Padmanabhan. Improving World Wide Web Latency. Technical Report

UCB/CSD-95-875, University of California, Berkeley, May 1995. [PQ00] V. N. Padmanabhan and L. Qiu. The Content and Access Dynamics of a Busy Web

Server. In Proc. of SIGCOMM, Aug. 2000. [PQW02a] V. N. Padmanabhan and L. Qiu. Network Tomography using Passive End-to-

End Measurements, DIMACS on Internet and WWW Measurement, Mapping and Modeling, Feb. 2002.

[PQW02b] V. N. Padmanabhan, L. Qiu, and H. J. Wang. Passive Network Tomography using Bayesian Inference. Internet Measurement Workshop, Nov. 2002.

[QPV01] L. Qiu, V. N. Padmanabhan, and G. M. Voelker. On the Placement of Web Server Replicas. In Proc. of INFOCOM, Apr. 2001.

[SP99] SPECWeb99 Benchmark. http://www.spec.org/osg/web99/.

Page 75: 1 Mining the Web Traces: Workload Characterization, Performance Diagnosis, and Applications Lili Qiu Microsoft Research Performance’2002, Rome, Italy September

75

References (Cont.) [Pax98] V. Paxson. An Introduction to Internet Measurement and Modeling.

SIGCOMM’98 tutorial, August 1998. [TS95] G. Trent and M. Sake. WebStone: The First Generation in HTTP Server

Benchmarking, Feb. 1995. http://www.mindcraft.com/webstone/paper.html. [Wbe] Webbench. http://www.zdnet.com/etestinglabs

/stories/benchmarks/0,8829,2326243,00.html. [WP] Web Polygraph: Proxy performance benchmark. http://polygraph.ircache.net/. [WVS+99a] A. Wolman, G. Voelker, N. Sharma, N. Cardwell, M. Brown, T. Landray,D.

Pinnel, A. Karlin, and H. Levy. Organization-Based Analysis of Web-Object Sharing and Caching. In Proc. of the Second USENIX Symposium on Internet Technologies and Systems, Boulder, CO, October 1999.

[WVS+99b] A. Wolman, G. M. Voelker, N. Sharma, N. Cardwell, A. Karlin, and H. M. Levy. On the scale and performance of cooperative Web proxy caching. In Proc. of the 17th ACM Symposium on Operating Systems Principles, Kiawah Island, SC, Dec. 1999.

[YAD01] J. Yin, L. Alvisi, M. Dahlin, A. Iyengar. Engineering server-driven consistency for large scale dynamic services.

[YBS99] H. Yu, L. Breslau, and S. Shenker. A Scalable Web Cache Consistency Architecture. In Proc. of SIGCOMM, August 1999.

Page 76: 1 Mining the Web Traces: Workload Characterization, Performance Diagnosis, and Applications Lili Qiu Microsoft Research Performance’2002, Rome, Italy September

76

Acknowledgement

Page 77: 1 Mining the Web Traces: Workload Characterization, Performance Diagnosis, and Applications Lili Qiu Microsoft Research Performance’2002, Rome, Italy September

77

Thank you!Thank you!