mining the web traces: workload characterization, performance diagnosis, and applications
DESCRIPTION
Mining the Web Traces: Workload Characterization, Performance Diagnosis, and Applications. Lili Qiu Microsoft Research Performance’2002, Rome, Italy September 2002. Motivation. Why do we care about Web traces? Content providers How do users get to the Web site? - PowerPoint PPT PresentationTRANSCRIPT
1
Mining the Web Traces:Workload Characterization, Performance Diagnosis, and Applications
Lili QiuMicrosoft Research
Performance’2002, Rome, ItalySeptember 2002
2
Motivation
Why do we care about Web traces? Content providers
How do users get to the Web site? Why do users leave the Web site? Is poor
performance the cause for this? What content are users interested in? How do users’ interest vary in time? How do users’ interest vary across
different places? Where are the performance bottlenecks?
3
Motivation (Cont.)
Web hosting companies Accounting & billing Server selection Provisioning server farms: where to place servers
ISPs How to save bandwidth by storing proxy caches? Traffic engineering & provisioning
Researchers Where are the performance bottlenecks? How to improve Web performance? Examples: Traffic measurements have influenced
the design of HTTP (e.g., persistent connections and pipeline), TCP (e.g., initial congestion window)
4
Tutorial Outline
Background Web workload characterization Performance diagnosis Applications of traces Bibliography
5
Part I: Background
Web components Web transaction Web protocols Types of Web traces
6
Web Components
Web clients An application that
establishes connections to send Web requests
E.g., Mosaic, Netscape Navigator, IE
Web servers An application that
accepts connections to service requests by sending back responses
E.g., Apache, IIS Web proxies (optional) Web replicas
(optional)
Internetreplica
proxy
replica
proxy
proxy
WebClients
WebServer
s
7
Internet Protocol Stack
Application layer (HTTP, Telnet, FTP, DNS)
Transport layer (TCP,UDP)
Network layer (IP)
Datalink layer (Ethernet, ATM)
Physical layer(coaxial cable, optical fiber)
8
Web Protocols
HTTP
TCP
IP
Ethernet
HTTP
TCP
IP
Ethernet
HTTP messages
TCP segments
A picture taken from [KR01]
IP IPIP pktIP pkt
Ethernet Sonet Sonet Ethernet
IP pkt
Sonet link Ethernet Ethernet
9
Web Protocols (Cont.) DNS [AL01]
An application layer protocol responsible for translating hostname to IP and vice versa (e.g., perf2002.uniroma2.it 160.80.2.140)
TCP [JK88] A transport layer protocol that does error control and flow
control Hypertext Transfer Protocol (HTTP)
HTTP 1.0 [BLFF96] The most widely used HTTP version A “Stop and wait” protocol
HTTP 1.1 [GMF+99] Adds persistent connections, pipelining, caching,
compression
10
Example of a Web Transaction
BrowserWeb server
DNSserver1. DNS query
2. Setup TCP connection
3. HTTP request
4. HTTP response
11
HTTP 1.0
HTTP request Request = Simple-Request | Full-Request
Simple-Request = "GET" SP Request-URI CRLF Full-Request = Request-Line;
*( General/Request/Entity Header) ; CRLF [ Entity-Body ] ;
Request-Line = Method SP Request-URI SP HTTP-Version CRLF Method = "GET" ;| "HEAD" ; | "POST" ;| extension-method
Example: GET /info.html HTTP/1.0
12
HTTP 1.0 (Cont.)
HTTP response Response = Simple-Response | Full-Response
Simple-Response = [ Entity-Body ]Full-Response = Status-Line;
*( General/Response/Entity Header ); CRLF
[ Entity-Body ] ; Example:
HTTP/1.0 200 OKDate: Mon, 09 Sep 2002 06:07:53 GMTServer: Apache/1.3.20 (Unix) (Red-Hat/Linux) PHP/4.0.6Last-Modified: Mon, 29 Jul 2002 10:58:59 GMTContent-Length: 21748Content-Type: text/html…<21748 bytes of the current version of info.html>
13
HTTP 1.1 Connection management
Persistent connections [Mogul95] Use one TCP connection for multiple HTTP requests Pros:
Reduce the overhead of connection setup and teardown Avoid TCP slow start
Cons: head-of-line blocking increase servers’ state
Pipeline [Pad95] Send multiple requests without waiting for a response
between requests Pros: avoid the round-trip delay of waiting for each
response Cons: connection abortion is harder to deal with
14
HTTP 1.1 (Cont.)
Caching Continues to support the notion of expiration used in
HTTP 1.0 Add a cache-control header to handle the issues of
cacheability and semantic transparency [KR01] E.g., no-cache, only-if-cache, no-store, max-age, max-
stale, min-fresh, …
Others Range request Content negotiation Security …
15
Types of Web Traces
Application level traces Flow level traces Packet level traces
Collection method: monitor a network link
Available tools: tcpdump, libpcap Concerns: packet dropping, timestamp
accuracy
16
Tutorial Outline
Background Web workload characterization Performance diagnosis Applications of traces Bibliography
17
Part II: Web Workload Characterization
Overview Content dynamics Access dynamics Common pitfalls Case studies
18
Overview Process of trace analyses Common analysis techniques Common analysis tools Challenges in workload characterization
19
Process of trace analyses
Collect traces Define key metrics to characterize Process traces Draw inferences from the data Apply the traces or insights gained
from the trace analyses to design better protocols & systems
20
Common Analysis Techniques - Statistics
Mean Median Variance and standard deviation
Geometric mean Confidence interval
A range of values that has a specified probability of containing the parameter being estimated
Example: 95% confidence interval 10 x 20
)var()(
)(1
)var( 2
xxstd
uxN
x
21
Common Analysis Techniques – Statistics (Cont.)
Cumulative distribution (CDF) P(x a)
Probability density function (PDF) Derivative of CDF: f(x) = dF(x)/dx
Check for heavy tail distribution Log-log complementary plot, and check its tail Example: Pareto distribution
If 2, distribution has infinite variance (a heavy tail)If 1, distribution has infinite mean
axax
axF ,0,,)(1)(
22
Common Analysis Techniques – Data Fitting
Visually compare empirical distribution with standard distributions
Chi Squared tests [AS86,Jain91] If , then two distributions are close, where
Kolmogorov-Smirnov tests [AS86,Jain91] Compares two distributions by finding the maximum
differences between two variables’ cumulative distribution functions
Quantile-quantile plots [AS86,Jain91] Compare two distributions by plotting the inverse of the
cumulative distribution function F-1(x) for two variables and finding the best-fit line
kX 2
k
i i
ii
E
ExX
1
22 )(
23
Common Analysis Tools
Scripting languages Perl, awk, …
Databases SQL, …
Statistics packages Matlab, S, …
Programs
24
Challenges in Workload Characterization
Each of the Web components provides a limited perspective on the functioning of the Web
Workload characteristics vary both in space and in time
Internetreplica
proxy
replica
proxy
proxy
Clients Servers
25
Views from Clients
Capture clients’ requests to all servers Pros
Know details of client activities, such as requests satisfied by browser caches, client abortion
The ability to record detailed information, as this does not impose significant load on a client browse
Cons Need to modify browser software Hard to deploy for a large number of clients
26
Views from Web Servers
Capture all clients’ requests (except those satisfied by caches) to a single server
Pros Relatively easy to deploy/change logging
software Cons
Requests satisfied by browser & proxy caches will not appear in the logs
May not log detailed information to ensure fast processing of requests
27
Views from Web Proxies Depending on the proxy’s location
A proxy close to clients see requests from a a small client group to a large number of servers
A proxy close to the servers see requests from a large client group to a small number of servers
Pros More diverse …?
Cons Requests satisfied by browser caches will not appear in
the logs May not log detailed information to ensure fast
processing of requests Does not have full information …?
28
Workload Variation Vary with measurement points Vary with sites being measured
Information servers (news site), e-commercial servers, query servers, streaming servers, upload servers
US vs. Italy …
Vary with the clients being measured Internet clients vs. wireless clients University clients vs. home users …
Vary in time Day vs. night Weekday vs. weekend Changes with new applications, recent events …
29
Part II: Web Workload Overview Content dynamics Access dynamics Common pitfalls Case studies
30
Content Dynamics
File size distribution File update patterns
How often files are updated How much files are updated
31
File Size Distribution
Two definitions D1: Size of all files on a Web server D2: Size of all files transferred by a Web
server D1 D2, because some files can be
transferred multiple times or not in completion and other files are not transferred
Studies show that the distribution of file sizes in both definitions exhibit heavy tails (i.e., P[F > x] ~ x-, 0 2)
32
File Update Interval
Varies in time Hot events and fast changing events require
more frequent update, e.g., Worldcup Varies across sites
Depending on server update policy Predictability
Study of MSNBC logs show that modification history yields a rough predictor of future modification interval TTL based [PQ00]
33
Extent of Change upon Modifications
Studies show that most file modifications are small delta encoding can be very useful
34
Part II: Web Workload Motivation Limitations of workload measurements Content dynamics Access Dynamics
File popularity distribution Temporal stability Spatial locality User session and request arrivals & duration Synthetic workload generation
Common pitfalls Case studies
35
Document Popularity
The Web requests follow Zipf-like distribution Request frequency 1/i, where i is a document’s ranking The value of depends on the point of measurements
Between 0.6 and 1 for client traces and proxy traces Close to or larger than 1 for server traces [ABC+96, PQ00]
The value of varies over time (e.g., larger during hot events)
0
0.5
1
1.5
2
MSNBC Proxies Less popular servers
36
Impact of the value Larger means more
accesses are more concentrated on popular documents caching is more beneficial
90% of the accesses are accounted by
Top 36% files in proxy traces [BCF+99, PQ00]
Top 10% files in small departmental server logs reported in [AW96]
Top 2-4% files in MSNBC traces
0
0.2
0.4
0.6
0.8
1
0 0.2 0.4 0.6 0.8 1
Percentage of Documents (sorted by popularity)
Pe
rce
nta
ge
of R
eq
ue
sts
12/17/98 Server Traces 08/01/99 Server Traces10/06/99 Proxy Traces
37
Temporal Stability Metrics
Coarse-grained: likely duration that a current popular file remains popular
e.g., overlap between the set of popular documents on day 1 and day 2
Fine-grained: how soon a requested file will be requested again
e.g., LRU stack distance [ABC+96]
File 5
File 4File 3
File 2File 1
File 2
File 5File 4
File 3File 1
Stack distance = 4
38
Spatial Locality
Refers to if users in the same geographical location or same organization tend to request the same documents E.g., degree of a request locally shared vs.
globally shared
39
Spatial Locality (Cont.)
Normal Day
0
0.2
0.4
0.6
0.8
1
0.E+00 1.E+04 2.E+04 3.E+04 4.E+04 5.E+04
Domain ID
Fra
cti
on
of
req
ue
sts
s
ha
red
Domain membership is significant except when there is a “hot” event of global interest
Dec. 17, 1998
0
0.2
0.4
0.6
0.8
1
1.2
0.0E+00 5.0E+03 1.0E+04 1.5E+04 2.0E+04 2.5E+04 3.0E+04 3.5E+04
Domain IDFr
actio
n of
re
ques
ts s
hare
d
Trace Random
40
User session and request arrivals & duration
User’s workload at three levels Session: a consecutive series of requests from a user to a
Web site Click: a user action to request a page, submit a form, etc. Request: each click generates one or more HTTP requests
Exponential distribution [LNJV99,KR01] Session inter-arrival times
Heavy-tail distribution [KR01] # clicks in a session, most in the range of 4-6 [Mah97] # embedded references in a Web page Think time: time between clicks Active time: time to download a Web page and its
embedded images
41
Common Pitfalls
Trace analyses are all about writing scripts & plotting nice graphs
Trace analyses are for better design of systems, and implications are often more important than raw data
Challenges Trace collection: where to monitor, how to collect (e.g., efficiency,
privacy, accuracy) Identify important metrics, and understand why they are important Sound measurement require disciplines Draw implications from data analyses
Understanding the limitation of the traces No representative traces: workload changes in time and in
space Try to diversify data sets (e.g., collect over different places and
different sites) before jumping into conclusions Draw inferences more than what data show
42
Part II: Web Workload Motivation Limitations of workload measurements Content dynamics Access dynamics Common pitfalls Case studies
Boston University client log study UW proxy log study MSNBC server log study Mobile log study
43
Case Study I: BU Client Log Study
Analyze clients’ browsing pattern and their impact on network traffic [CBC95]
One of the few client log studies Approach
Modify Mosaic and distribute it to machines in CS Dept. at Boston Univ. to collect client traces in 1995
Major Findings Power law distributions
Distribution of document sizes Distribution of user requests for documents # requests to documents as a function of their popularity
44
Case Study II: UW Proxy Log Study
Approaches [WVS+99a, WVS+99b] Deploy a passive network sniffer
between the Univ. of Washington and the rest of the Internet in May 1999
45
Approaches
46
Major Findings
Members of an organization are more likely to request the same documents than a random set of clients
Significant fraction of uncachable documents
Significant fraction of audio/video content
47
Case Study III: MSNBC Server Log Study
MSNBC server site a large news site server cluster with 40 nodes 25 million accesses a day (HTML content alone) Period studied: Aug. – Oct. 99 & Dec. 17, 98 flash crowd
Server logs HTTP access logs Content Replication System (CRS) logs HTML content logs
Data analysis Content dynamics Access dynamics
48
Analysis Approaches
49
Major Findings
Content dynamics Modification history is a rough predictor Frequent but minimal file modifications
Access dynamics Set of popular files remains stable for days Domain membership has a significant bearing on
client accesses except during a flash crowd of global interest
Zipf-like distribution of file popularity but with a much larger than at proxies
Accesses to old documents account for most first-time misses hard to anticipate such accesses, and eliminate these first-time misses
50
Case Study IV: Mobile Log Study A popular commercial Web site for mobile
clients Content
news, weather, stock quotes, email, yellow pages, travel reservations, entertainment etc.
Services Notification Browse
Period studied 3.25 million notifications in Aug. 20 – 26, 2000 33 million browse requests in Aug. 15 – 26, 2000
51
Analysis Approaches Analyze by user categories
Cellular users Browse the Web in real time using cellular
technologies Offline users
Download content onto their PDAs for later (offline) browsing, e.g. AvantGo
Desktop users Signup services and specify preferences
Analyze by Web services Browse Notifications
52
Major Findings
Notification Services Popularity of notification messages follows
Zipf-like distribution Top 1% notification objects account for 54-
64% of total messages Exhibits geographical locality
Browse Services 0.1% - 0.5% urls account for 90% requests The set of popular urls remain stable
Correlation between the two services Correlation is limited
53
Tutorial Outline
Background Web Workload Performance Diagnosis
Overview Infer the causes of high end-to-end delay in
Web transfers [BC00] Infer the causes of high end-to-end loss rate
in Web transfers [PQW02a, PQW02b] Applications of traces
54
Overview Goal: Determine trouble
spots using end-to-end Web traces
Metrics of interest Delay Loss rate Raw bandwidth Available bandwidth Traffic rate
Why interesting Resolve the trouble spots Server selection Placement of mirror servers
Sprint
AT&T
Web Server
UUNET
MCI
Qwest AOL
EarthlinkWhy so slow?
55
Finding the Sources of Delays Goal
Why is my Web transfer slow? Is it the server or the network or the client?
Sources of delay in Web transfer DNS lookup Server delays Client delays Network delays
Propagation delays Network variation delays Delays introduced by packet losses (e.g., signaled by
the fast retransmit mechanism or TCP timeouts)
56
TCPEval Tool
Inputs: “tcpdump” packet traces taken at the communicating Web server and client
Generates a variety of statistics for file transactions File and packet transfer latencies Packet drop characteristics Packet and byte counts per unit time
Generates both timeline and sequence plots for transactions
Generates critical path profiles and statistics for transactions
57
Critical Path Analysis Tool
Client Server Client ServerData flow Critical Path
Network delay
Network delayServer delayNetwork delay
Client delayNetwork delayServer delay
Network delaydue to pkt loss
58
Finding Sources of Packet Losses Goal
Identify lossy links rather than determine exact loss rate Passive observation of existing traffic Active probing to discover network topology can be done
infrequently in the background
l1
l8l7l6
l2
l4 l5
l3
server
clientsp1 p2 p3 p4 p5
(1-l1)*(1-l2)*(1-l4) = (1-p1)
(1-l1)*(1-l2)*(1-l5) = (1-p2)…(1-l1)*(1-l3)*(1-l8) = (1-p5)
Under-constrained system of equations
59
#1: Random Sampling
Randomly sample the solution space Repeat this several times Draw conclusions based on overall
statistics How to do random sampling?
determine loss rate bound for each link using best downstream client
iterate over all links: pick loss rate at random within bounds update bounds for other links
Problem: little tolerance for estimation error
l1
l8l7l6
l2
l4 l5
l3
server
clients
p1 p2 p3 p4 p5
60
#2: Linear OptimizationGoals Parsimonious explanation Robust to error in client loss rate
estimate
Li = log(1/(1-li)), Pj = log(1/(1-pj))
minimize Li + |Sj|
L1+L2+L4 + S1 = P1
L1+L2+L5 + S2 = P2
…L1+L3+L8 + S5 = P5
Can be turned into a linear program
l1
l8l7l6
l2
l4 l5
l3
server
clients
p1 p2 p3 p4 p5
61
# 3: Gibbs Sampling
D observed packet transmission and loss at the clients
ensemble of loss rates of links in the network
Goal determine the posterior distribution P(|D)
Approach Use Markov Chain Monte Carlo with Gibbs sampling
to obtain samples from P(|D) Draw conclusions based on the samples
62
# 3: Gibbs Sampling (Cont.)
Applying Gibbs sampling to network tomography 1) Initialize link loss rates arbitrarily 2) For j = 1 : warmup
for each link i compute P(li|D, {li’}) where li is loss rate of link i, and {li’} = kI lk
3) For j = 1 : realSamples for each link i
compute P(li|D, {li’}) Use all the samples obtained at step 3 to
approximate P(|D)
63
Simulation Experiments
Advantage: no uncertainty about link loss rate! Methodology
Topologies used: randomly-generated: 20 - 3000 nodes, max degree = 5-50 real topology obtained by tracing paths to microsoft.com
clients randomly-generated packet loss events at each link
A fraction f of the links are good, and the rest are “bad” LM1: good links: 0 – 1%, bad links: 5 – 10% LM2: good links: 0 – 1%, bad links: 1 – 100%
Goodness metrics: Coverage: # correctly inferred lossy links False positive: # incorrectly inferred lossy links
64
Random Topologies
1000-node random topologies (d=10, f=0.5)
0
100
200
300
400
500
600
Random LP Gibbs
# li
nk
s
"# true lossy links""# correctly identified lossy links""# false positive"
1000-node random topologies (d=10, f=0.95)
0
20
40
60
80
100
120
140
160
Random LP Gibbs
# li
nk
s
"# true lossy links"
"# correctly identified lossy links"
"# false positive"
Techniques Coverage False Positive Computation
Random High High Low
LP Modest Low Medium
Gibbs sampling High Low High
65
Trace-driven Validation Validation approach
Divide client traces into two: tomography and validation Tomography data set => loss inference Validation set => check if clients downstream of the inferred
lossy links experience high loss Experimental setup
Real topologies and loss traces collected from traceroute and tcpdump at microsoft.com during Dec. 20, 2000 and Jan. 11, 2002
Results False positive rate is between 5 – 30% Likely candidates for lossy links:
links crossing an inter-AS boundary links having a large delay (e.g. transcontinental links) links that terminate at clients
66
Tutorial Outline
Background Web Workload Performance Diagnosis Applications of traces Bibliography
67
Part IV: Applications of Traces Synthetic workload generation Cache design
Cache replacement policies [CI97,BCF+99] Cache consistency algorithms [LC97, YBS99,YAD+01] Cooperative cache or not [WVS+99] Cache infrastructure
Pre-fetching algorithms [CB98, FJC+99] Placement of Web proxies/replicas [QPV01] Other optimizations
Improving TCP for Web transfers [Mah97,PK98,ZQK00] Concurrent downloads, pipelining, compression,…
…
68
Synthetic Workload Generation
Generate user requests Generate user sessions using a Poisson arrival
process For each user session, determine # clicks
using a Pareto distribution Assign a click to a request for a Web page,
while making sure The popularity distribution of a file follows a Zipf-like
distribution [BC98] Capture the temporal locality of successive requests
for the same resource Generate a next click from the same user with
think time following a Pareto distribution
69
Synthetic Workload Generation (Cont.)
Generate Web pages Determine the number of Web pages Generate the size of each Web pages using file
size distribution (log-normal) Associate a page with some number of
embedded pages using empirical distribution (heavy-tail)
Generate file modification events Examples of generators
Webbench [Wbe], WebStone[TS95], Surge [BC98], SPecweb99 [SP99], Web Polygraph [WP], …
70
Cache Replacement Policies Problem formulation
Given a fixed size cache, how to evict pages to maximize the hit ratio once the cache is full?
Hit ratio Fraction of requests satisfied by the cache Fraction of the total size of requested data satisfied by
the cache Factors to consider
Request frequency Modification frequency Benefit of caching: reduction in latency & BW Cost of caching: storage
71
Cache Replacement Policies (Cont.) Approaches
Least recently used (LRU) Least frequently used (LFU)
Perfect: maintain counters for all pages seen In-cache: maintain counters only for pages that are in
cache GreedyDual-size [CI97]
Assign a utility value to each object, and replace the one with the lowest utility
Use of traces Evaluate the algorithms using trace-driven
simulations or synthetic workload
72
Cache Consistency Algorithms
Problem formulation Factors to consider Approaches Use of traces
73
Placement of Web Proxies/Replicas
Problem formulation [JJK+01,QPV01] How to place a fixed number of
proxies/replicas to minimize users’ request latency
Factors to consider Spatial distribution of requests Temporal stability of requests
Stability in popularity of objects Stability in spatial distribution of requests
74
Placement of Web Proxies/Replicas (Cont.) Approaches
Random placement Greedy placement Hot-spot placement
Use of traces Trace-driven simulations High concentration of requests to a small number
of objects focus on replicating only popular objects
Temporal stability in requests no need to frequently change the locations of proxies/replicas
75
References [AS86] R. B. D’Agostino and M. A. Stephens. Goodness-of-Fit Techniques. Marcel Dekker,
New York, NY 1986. [ABC+96] Virgilio Almeida, Azer Bestavros, Mark Crovella and Adriana de Oliveria.
Characterizing reference locality in the WWW. In Proceedings of 1996 International Conference on Parallel and Distributed Information Systems (PDIS'96), December 1996.
[ABQ01] A. Adya, P. Bahl, and L. Qiu. Analyzing Browse Patterns of Mobile Clients. In Proc. of SIGCOMM Measurement Workshop, Nov. 2001.
[ABQ02] A. Adya, P. Bahl, and L. Qiu. Characterizing Alert and Browse Services for Mobile Clients. In Proc. of USENIX, Jun. 2002.
[AL01] P. Albitz, and C. Liu. DNS and BIND (4th Edition), O’Reilly & Associates, Apr. 2001. [AW97] M. Arlitt and C. Williamson. Internet Web Servers: Workload Characterization and
Performance Implications. IEEE/ACM Transactions on Networking , Vol. 5, No. 5, pp. 631-645, October 1997.
[BC98] P. Barford and M. Crovella. Generating representative workloads for network and server performance evaluation. In Proc. of SIGMETRICS, 1998.
[BBC+98] P. Barford, A. Bestavros, M. Crovella, and A. Bradley. Changes in Web Client Access Patterns: Characteristics and Caching Implications, Special Issue on World Wide Web Characterization and Performance Evaluation; World Wide Web Journal, December 1998.
76
References (Cont.) [BCF+99] L. Breslau, P. Cao, L. Fan, G. Phillips, and S. Shenker. Web Caching and Zipf-like
Distributions: Evidence and Implications. In Proc. of INFOCOM, Mar. 1999. [BC00] P. Barford and M. Crovella. Critical Path Analysis of TCP Transactions. In Proc. of
ACM SIGCOMM, Aug. 2000. [BLFF96] T. Berners-Lee, R. Fielding, and H. Frystyk. Hypertext Transfer Protocol --
HTTP/1.0. RFC 1945, May 1996. [BPS+98] H. Balakrishnan, V. N. Padmanabhan, S. Seshan, M. Stemm and R. H. Katz. TCP
Behavior of a Busy Internet Server: Analysis and Improvements Proc. IEEE Infocom, San Francisco, CA, USA, March 1998.
[CB98] M. Crovella and P. Barford. The network effects of prefetching. In Proc. of INFOCOM, 1998.
[CBC95] C. R. Cunha, A. Bestavros, and M. E. Crovella. Characteristics of WWW client-based traces. Technical Report BU-CS-95-010, CS Dept., Boston University, 1995.
[CI97] P. Cao and S. Irani. Cost-Aware WWW proxy caching algorithms. In Proc. of USITS, Dec. 1997.
[DFK+97] F. Douglis, A. Feldmann, B. Krishnamurth, and J. Mogul. Rate of change and other metrics: a live study of the World Wide Web. In Proc. of USITS, 1997.
[FCD+99] A. Feldmann, R. Caceres, F. Douglis, and M. Rabinovich. Performance of Web Proxy Caching in heterogeneous bandwidth enviornments. In Proc. of IEEE INFOCOM, March 1999.
77
References (Cont.) [FJC+99] L. Fan, Q. Jacobson, P. Cao and W. Lin. Web Prefetching Between Low-Bandwidth
Clients and Proxies: Potential and Performance. In Proc. of SIGMETRICS, 1999. [GMF+99] J. Gettys, J. Mogul, H. Frystyk, L. Masinter, P. Leach, T. Berners-Lee. Hypertext
Transfer Protocol – HTTP 1.1. RFC 2616, Jun. 1999. [JK88] V. Jacobson, M. J. Karels. Congestion Avoidance and Control. In Proc. SIGCOMM,
Aug. 1988. [JJK+01] S. Jamin, C. Jin, A. R. Kurc, D. Raz, and Y. Shavitt. Constrained Mirror Placement on
the Internet. In Proc. of INFOCOM, Apr. 2001. [Jain91] R. Jain. The Art of Computer Systems Performance Analysis. John Wiley and Sons,
1991. [Kel02] T. Kelly. Thin-Client Web Access Patterns: Measurements from a Cache-Busting Proxy.
Computer Communications, Vol. 25, No. 4 (March 2002), pages 357-366. [KR01] B. Krishnamurthy and J. Rexford. Web Protocols and Practice, HTTP/1.1, Networking
Protocols, Caching, and Traffic Measurement. Addison-Wesley, May 2001. [LC97] C. Liu and P. Cao. Maintaining Strong Cache Consistency in the World-Wide Web. In
Proc. of ICDCS'97, pp. 12-21, May 1997. [LNJV99] Z. Liu, N. Niclausse, and C. Jalpa-Villaneuva. Web Traffic Modeling and
Performance Comparison Between HTTP 1.0 and HTTP 1.1. In Erol Gelenbe, editor, System Performance Evaluation: Methodologies and Applications. CRC Press, Aug. 1999.
78
References (Cont.) [Mah97] Bruce Mah. An empirical model of HTTP network traffic. In Proc. of INFOCOM, April 1997. [Mogul95] Jeffrey C. Mogul. The Case for Persistent-Connection HTTP. In Proc. SIGCOMM '95, pages
299-313. Cambridge, MA, August, 1995. [MDF+97] J. C. Mogul, F. Douglis, A. Feldmann, and B. Krishnamurthy. Potential benefits of delta-
encoding and data compression for HTTP, In Proc. of SIGCOMM, September 1997. [Pad95] V. N. Padmanabhan. Improving World Wide Web Latency. Technical Report UCB/CSD-95-875,
University of California, Berkeley, May 1995. [PQ00] V. N. Padmanabhan and L. Qiu. The Content and Access Dynamics of a Busy Web Server. In Proc.
of SIGCOMM, Aug. 2000. [PQW02a] V. N. Padmanabhan and L. Qiu. Network Tomography using Passive End-to-End Measurements,
DIMACS on Internet and WWW Measurement, Mapping and Modeling, Feb. 2002. [PQW02b] V. N. Padmanabhan, L. Qiu, and H. J. Wang. Passive Network Tomography using Bayesian
Inference. Internet Measurement Workshop, Nov. 2002. [QPV01] L. Qiu, V. N. Padmanabhan, and G. M. Voelker. On the Placement of Web Server Replicas. In
Proc. of INFOCOM, Apr. 2001. [SP99] SPECWeb99 Benchmark. http://www.spec.org/osg/web99/. [Pax98] V. Paxson. An Introduction to Internet Measurement and Modeling. SIGCOMM’98 tutorial, August
1998. [TS95] G. Trent and M. Sake. WebStone: The First Generation in HTTP Server Benchmarking, Feb. 1995.
http://www.mindcraft.com/webstone/paper.html. [Wbe] Webbench. http://www.zdnet.com/etestinglabs/stories/benchmarks/0,8829,2326243,00.html.
79
References (Cont.) [WP] Web Polygraph: Proxy performance benchmark. http://polygraph.ircache.net/. [WVS+99a] A. Wolman, G. Voelker, N. Sharma, N. Cardwell, M. Brown, T. Landray,D.
Pinnel, A. Karlin, and H. Levy. Organization-Based Analysis of Web-Object Sharing and Caching. In Proc. of the Second USENIX Symposium on Internet Technologies and Systems, Boulder, CO, October 1999.
[WVS+99b] A. Wolman, G. M. Voelker, N. Sharma, N. Cardwell, A. Karlin, and H. M. Levy. On the scale and performance of cooperative Web proxy caching. In Proc. of the 17th ACM Symposium on Operating Systems Principles, Kiawah Island, SC, Dec. 1999.
[YAD01] J. Yin, L. Alvisi, M. Dahlin, A. Iyengar. Engineering server-driven consistency for large scale dynamic services.
[YBS99] H. Yu, L. Breslau, and S. Shenker. A Scalable Web Cache Consistency Architecture. In Proc. of SIGCOMM, August 1999.
80
Thank you!Thank you!
81
Tutorial Outline Background Web Workload Characterization
Motivation Data Analyses and fittings Understanding the limitations Content dynamics Access dynamics Case Studies Synthetic workload generation
Performance Diagnosis Infer causes of high end-to-end delay in Web transfers Infer causes of high end-to-end loss in Web transfers
Applications of traces
82
Part II: Web Workload Characterization Overview
Process of trace analyses Common analysis techniques & tools Challenges in workload characterization
Content dynamics File size distribution File update patterns
Access dynamics File popularity distribution Temporal stability Spatial locality Browser sessions: length & arrival pattern
Common pitfalls Case studies
Boston University client log study, UW proxy log study, MSNBC server log study, a mobile log study