magpie: distributed request tracking for realistic ... · l the sequence of application components...
TRANSCRIPT
12 November 2003
Rebecca IsaacsPaul Barham
Richard MortierDushyanth Narayanan
Microsoft Research Cambridge
James Bulpin University of Cambridge
Magpie: Distributed request tracking for realistic
performance modelling
12 November 2003
Performance in distributed systems
l Faults in distributed systems are notoriously hard to diagnose
l Performance problems are even more subtle to debugl Often transient or affect only a subset of requests / usersl Frequently involve complex interactions between multiple
machinesl Aggregate statistics (e.g. utilization) may look perfectly
normal
12 November 2003
Magpie Approach
l Track individual requests end to endl Observe control flow (causality)l Monitor resource consumption: CPU, bandwidth, diskl Debug performance “in the small”
l Build a probabilistic workload model from the aggregate requestsl Cluster similar requests according to their observed
behaviourl Debug performance “in the large”
12 November 2003
How do we use this information?
l Performance debuggingl Why did this request take much longer than that
request?l Fault detectionl Configuration and management
l Performance predictionl Realistic workload models for capacity planningl Obtain automatically on a “live” system
12 November 2003
Magpie components
l Instrumentationl System activity recorded to logs
l Generic request parserl Extract individual requests from logs according to
an event schema
l Model constructionl Behavioural clustersl Probabilistic state machine
12 November 2003
Outline
l Introductionl What is a request?l Instrumentationl Request extractionl Modellingl Current status
12 November 2003
What is a request?
l System activity which takes place in response to an action initiated by the application being tracedl HTTP requestl Database queryl File open request
l We describe a request asl The sequence of application components involved in its
processingl The resource consumed at each stagel CPU, bandwidth, disk transfer size, (latency)
12 November 2003
A typical e-commerce site (2)Fi
lter
Kernelhttp.sys
CLRIIS
Kernel
Web Server
Application Logic
WinSock2 API
SQL Server
Stored procedures
StaticContent
ASP.NET ADO.NET
WinSock2 API
Data
12 November 2003
HTTP request: detailed view
WEB.eec
WEB.398
Disk
Net RX
Net TX
10.051s 10.155s
Net TX
Net RX
Disk
SQL.9c4
10.051s 10.155s
!
- + - - + - - + - + -
- - -
10.100s
10.100s
HTTP request packet
from
IIS worker thread picks up request
http.sys Sync WinSock send to SQL Server
ASP.NET thread blocks after RPC to database
ASP.NET worker thread takes over
TDS request and reply packets sent and
received
SQL thread unblocks
HTTP response packets sent back to client
IIS worker thread wakes up to write log
Blocked IIS ASP.NET SQLKEY: Disk Other
12 November 2003
Why is request tracking hard?
l Many components, multiple machinesl Must track control flow across machines
l No globally unique request IDl Components are developed independently
l Multiple thread poolsl Many threads participate in processing a request
l Asynchronous communicationl Must match send/recvs between threads/machines
l Hand-rolled synchronization primitivesl SQL server has user-mode scheduler
12 November 2003
Outline
l Introductionl What is a request?l Instrumentationl Request extractionl Modellingl Current status
12 November 2003
Event Tracing for Windows
l Low-overhead event mechanisml Events timestamped with cycle counterl Global ordering on events on a single machinel Can enable/disable sets of events at runtime
l Using ETW in Magpiel Each instrumentation point posts an eventl Events are logged to diskl Logs are post-processed to extract requestsl Can also consume events in real time
12 November 2003
Instrumentation points
l Existing ETW event providersl IIS, kernel
l App-specific hooksl IIS, ASP.NET, SQL Server
l Detoursl Wrap dlls to trap Win32 and WinSock2 calls
l WinPcapl Capture packets on the wire
12 November 2003
CPU usage from kernel events
l The ETW kernel logger records every context switchl How do we know which cycles are used for which
request?
l We can attribute cycles to a request byl An application-specific event which occurs within
a delimited sector of CPU time, orl The current context of execution, eg thread id
12 November 2003
Example: protocol processing in a DPC
cswitchDPCstart
DPCend
pkt recv
Request 1cycle count
Request 2cycle count
Events: cswitch
time
12 November 2003
Application and middleware events
l Cover points where flow of control moves between components
l Cover points where resources are multiplexed and demultiplexedl E.g. user-level scheduling primitives
l Propagation of a global request id is notrequired!l Magpie used to do this but not any more
12 November 2003
Instrumenting a web serviceFi
lter
Kernelhttp.sys
CLRIIS
Kernel
Web Server
HTTPModule
Application Logic
SQL Server
Wra
pper
s
Stored procedures
ISAPI Filter
StaticContent
ASP.NET ADO.NET
CLR profiler
WinSock2 APIIntercept
Data
Event Tracing for WindowsPacket capture
Event Tracing for WindowsPacket capture
Extended SPs
WinSock2 APIIntercept
12 November 2003
Outline
l Introductionl What is a request?l Instrumentationl Request extractionl Modellingl Current status
12 November 2003
Generic request extraction
l No inbuilt assumptions about the system or the applicationl No common unique identifier
l Schema specifies semantics of eventsl Easy to add new event types
l Parser stitches events into requests based on event semantics
12 November 2003
Terminology
l Namespacel Event parameter which references an entity in the
system, eg thread id
l Timelinel Instantiation of a namespace with a unique value,
eg thread id = 0xa
l Events bind or unbind requests to timelinesl Bindings capture the semantics of each event for
a particular request type
12 November 2003
Cpuid=0
Tid=0xa
Tid=0xb
Connid=0xd
Enter R
ecv
cswitch
cswitch
DP
C start
DP
C end
Recv returns
TC
P pkt
Example: connecting events
Request 1Request 2
12 November 2003
End-to-end request extraction
l An instance of the request parser runs on each machine in the distributed systeml Online or offline mode
l Offline post-processing connects request fragments from each node according to a globally unique namespace, e.g. packet IP identifier
12 November 2003
Outline
l Introductionl What is a request?l Instrumentationl Request extractionl Modellingl Current status
12 November 2003
Clustering for workload generation
l Target the Indy performance modelling tooll Calculates throughput, bottlenecksl Needs transaction mix, resource consumption
l Previously: microbenchmark approachl Run 10000 of each “transaction type” (URL)l Divide aggregate resource usage by 10000
l Aim: provide realistic workload modelsl From real, mixed workloadsl Derive transaction “types” automatically
12 November 2003
Single request: cartoon view
l Partial ordering of eventsl Annotated with resource usage
5ms 6ms 1ms3ms 6ms
2ms 3ms
6ms
6k1k
192kread
24kread
12k1k
IIS CPU ASP.NET CPU SQL Server CPU
DiskNetwork
12 November 2003
Behavioural clustering of requests
l Represent requests as event stringsl “Flatten” out any concurrency
l Use Levenshtein string edit distancel Modified to factor in resource usage vectors
l Cluster requests based on this distancel Linear-time algorithm
l Each cluster is a request “type”l Select representative from near centroid
12 November 2003
Build a workload model by clustering similar requests
Requests in the same cluster often have different URLs, and one URL may appear in many clusters
A
D
B
CE
A 2ms 10ms 1ms14ms 24ms
5ms 11ms
5ms
6 k0.2k
30k1k
5ms
5ms
0.1k0.2k
2 k0.2k
7%
B 14ms 27ms 1ms 2ms 7ms
11k1k
2ms
10%
C 5ms 6ms 1ms3ms 6ms
2ms 3ms
6ms
6 k1 k
192kread
24kread
12k1k
15%
E 5ms 11ms
1k0.6k
63%
D 2ms 13ms 2ms3ms
5ms
5ms
0.3k
11k1k
11ms
0.3k
5%
12 November 2003
Taking it further: work-in-progress
l Online and incremental modelling:l Detect component failurel Detect sudden shifts in workload
l More sophisticated modelsl Learn the probabilistic state machine for each requestl c.f. flowcharts annotated with performance information
l “Bayesian watchdogs”l Compute the likelihood of a request’s behaviour as it
moves through the systeml Deal with “unlikely” requests appropriately
12 November 2003
Outline
l Introductionl What is a request?l Instrumentationl Request extractionl Modellingl Current status
12 November 2003
Current status
l Recent focus has been developing a generic request extraction schemel Prototype for 2-machine e-commerce sitel TPC-W style workload
l Prototype for single machine SQL Server 2000l Challenge is user mode schedulerl TPC-C workload
l Other applications on the wayl Large-scalel “Real” systems with “real” performance problems
12 November 2003
Conclusion
l Magpie is a tool for performance analysis in a distributed system
l Bottom up, per-request approachl Complementary to existing techniques:l Performance countersl Program profiling
l Feeds into performance debugging and prediction tools
12 November 2003
Work-in-progress: learning the probabilistic state machine
l Infer a stochastic context free grammar from a sample set of stringsl Each state transition emits a character and has
an associated probabilityl Use the Alergia algorithm (Carrasco & Oncina ‘94)l Construct a prefix tree from the sample setl Merge similar subtrees
l Apply to Magpie requestsl “Just” event strings…