capturing comprehensive storage workload traces in windows
TRANSCRIPT
Capturing Comprehensive Storage Workload Traces in Windows
Dr. Bruce WorthingtonWindows Server Performance
Microsoft Corporation
My Motivation• I’m tired of seeing storage research and performance
analysis limited by real-world trace availability.– It’s not all that much better now than it was when I
was in grad school in the early 90’s…• I’ve been saying I would help supply researchers with
long-term real-world traces and post-processing tools for almost a decade. – I’m finally following through on the promise to supply
production traces.– More importantly, Microsoft has made it easy for
anyone to capture detailed storage workload traces on Windows systems (along with many other types of traces and profiles).
Outline
• The tools– Event Tracing for Windows (ETW)– New: xperf, xperfinfo– Old: logman, trace*
• The traces– Benchmark (steady-state) workloads– Production workloads
• The challenge
Event Tracing for Windows (ETW)• ETW has been the core Windows tracing component
since Windows 2000 and is continually improved• Many Windows components, including the kernel,
produce events describing their behavior• Events from user-mode applications and kernel-mode
drivers can be logged• High performance, low overhead, highly scalable
– Efficient buffering and non-blocking logging mechanisms using per-processor buffers written to disk by a separate thread
• Tracing can be enabled/disabled dynamically without requiring system reboots or application restarts
Event Tracing for Windows (ETW)• Events can be sample-based, but most are single-instance (event A
occurred at time T)• Support for real-time consumption and file-based tracing• Configurable logging mode, buffer size, buffer count
– Sequential traces– Circular traces– Circular traces in memory (flight black-box) [Vista]
• Adding custom events enables better correlation of application activity with low-level resource usage
• On a standard Vista computer, the logging API (EventWrite) takes about 5,000 cycles, mostly spent in acquiring the timestamp via QueryPerformanceCounter (QPC)– About 2.5% processor overhead for a sustained rate of 10,000
events/second on a 2GHz processor – not including the cost of flushing trace buffers to disk
• Postprocessing the binary disk log correlates events with context and domain specific knowledge
ETW Architecture• Provider
– Provides event traces. Can be user-mode app, kernel-mode driver, or kernel.
– Providers use ETW APIs to register with the ETW framework to send event traces from various points in the code.
– When enabled, the provider sends event traces to a specific trace session designated by the controller.
• Controller– Assists in starting, stopping or updating trace
sessions in the kernel as well as enabling or disabling providers
– Sets trace session properties such as sequential or circular file logging or direct delivery to consumers
• Consumer– Reads trace files or listens to active trace sessions
and processes logged events– Not aware of the Providers– Only receive event traces from the trace sessions
or log files
Introduction
• Event Trace Session infrastructure– Brokers the event traces from provider(s)
to consumer(s) and adds data to each event (e.g., TimeStamp, Thread, Process, CPU)
“NT Kernel Provider”• Events
– Process, thread and image– Sampled Profile– Context Switch– Dispatcher (Ready Thread)– DPC (Deferred Procedure Call)– ISR (Interrupt Service Routine)– Disk I/O– File I/O– Registry– Hardfault– Pagefault– Driver delay– TCP/UDP– Power– ALPC
– Virtual Allocation– Heap– Memory– …
• Related providers– Thread Pool– Power Transition– Winlogon– Services– Prefetch
• Other providers– Shell– Internet Explorer– Media Foundation– Media Center– …
System Config: ETW Instrumentation• Automatically added to kernel traces• Rundown of system configuration at trace start/stop time
– CPU (number of logical and physical processors, frequency)– Memory (memory size, page size, allocation granularity)– Disk (physical disks, partitions, volumes)– Video adapters– Network adapters (IPv4, IPv6)– Services (including service tag)– Plug-and-Play Information– IRQ Assignment– Power capabilities (S1 - S5)– Network Identity (computer name, domain name)– Group Masks (what kernel flags are enabled)
Storage-Related Instrumentation• Disk Events:
– Read, Write, Flush Initiation/Completion• File Events:
– Filename Create, Delete, Rundown (when trace stopped)– File I/O Initiation, Hard Fault
• Create, Cleanup, Close, Flush, Read, Write, Set Info, Query Info, FSCTL, Delete, Rename, Directory Enumeration, Directory Notification
– File I/O Completion• Driver Events:
– Driver Call, Return (Major Function)– Driver Complete Request, Complete Request Return– Driver Completion Routine
• Binary storage-related event sizes range from ~30-80 bytes (not counting events that dump unique filenames)
The new tools: xperf & xperfinfo• Extensible performance analysis toolset• High-level control and decoding of ETW traces
– Emphasis on kernel events and system-wide resource usage
– Support for 3rd-party events, primarily in conjunction with kernel events
• Cross-platform– Windows XP SP1+, Vista– Windows Server 2003, Windows Server 2008
• Cross-architecture (x86, x64, ia64)• Capture-anywhere, process-anywhere
xperf• Detailed interactive analysis of performance
traces• High-level resource usage graphs on common
trace timeline with zoom capability• Low-level discrete graphs for resource state
transitions– Individual context switch and disk I/O events
• Powerful interactive summary tables with dynamic grouping, sorting and aggregation capability
Currently Available Graphs and Summary Tables
– Disk I/O Counts– Disk I/O Detail– Disk Utilization– File I/O– DPC (Deferred
Procedure Call)– ISR (Interrupt Service
Routine)– Hardfault– Pagefault– Driver Delay
– Sample Profile– CPU Availability– CPU Scheduling– Process Lifetime– Registry counts– Services– Plug ’n’ Play– Marks– Generic– …
Overview of xperf
Storage Activity Notes• Disk Reads, Writes and Flushes
– Vista introduced low-priority I/Os, which are deferred in a special queue to allow current and near-term future normal priority I/Os to complete
– Flushes may be “completed” by low-level storage drivers under certain conditions
• Hard Faults– Synchronous I/Os that block execution of issuing thread– Paging-in from disk pages not currently present in memory
• Communication aspect: System read-ahead and write-back– Asynchronous just-in-time prefetch for sequentially read buffered files– Asynchronous buffered writes– Issued from “System (4)” process
• xperf & xperfinfo infer disk “queue” parameters (wait time, service time, queue depths, skip behavior) assuming a single serialized spindle– not a valid assumption for disk arrays– Queue depth can be thought of as “number of requests in flight”
xperf Demo
• System Config• CPU• DPC & ISR• Process• Disk Summary Table• Disk I/O Detail• Disk I/O Detail Summary• Hard Faults
Sample xperf Screenshots
• Sidebar chart selection• Selecting a time range• CPU Usage Summary Table• DPC and ISR CPU Usage Frames• Disk I/O Summary Table• Disk I/O Detail• Disk I/O Detail Summary Table
Sidebar Frame Selector
Sidebar
Overview of xperf
Frame Scrollbar
Selecting a Time Range (on CPU Usage Frame)
Overview of xperf
Go to CPU Usage Summary TableContext-Menu Summary Table
Overview of xperf
CPU Usage Summary Table
Status Bar Report
% of Time excluding DPC and ISR
% Total Time
Selected Time Interval
Close Summary Table
Overview of xperf
DPC and Interrupt CPU Usage Frames
Overview of xperf
Go to Disk I/O Summary Table
Summary Table
Overview of xperf
Disk I/O Summary Table
Expanded
Individual I/Os
Overview of xperf
Go to Disk I/O Detail
Detail Graph
Overview of xperf
Disk I/O Detail (Disk #0)
Overview of xperf
Change Disk Select
Processes
Disk I/O Detail (Disk #1)
Overview of xperf
Disk I/O Detail (Disk #1)
Selection
Overview of xperf
Disk I/O Detail Summary Table
Default sort field
Overview of xperf
xperfinfo
• High level control and decoding• Merging and dumping of ETW traces• Many command line actions to analyze and
report on various aspects of a trace• Various buffering and log file options• Multiple timer sources• Traces of boot activity
xperfinfo Demo
• Start/stop• Provider list• Dump• Postprocessing summaries
Taking a Kernel Trace
• Start kernel trace; run scenario; stop and merge
• Start user trace; run scenario; stop
• Hint: Retrieve all known kernel flags and groups
Overview of xperfinfo
C:\analysis> xperfinfo –on base+FILE_IO+INTERRUPTC:\analysis> MyTestApp.exeC:\analysis> xperfinfo –d trace.etl
C:\analysis> xperfinfo –help providers
C:\analysis> xperfinfo –start MySession –on Kerberos+MRxSmb –f kerberos.etlC:\analysis> MyTestApp.exeC:\analysis> xperfinfo –stop MySession
Dumping a Trace
Overview of xperfinfo
C:\analysis> xperfinfo -i trace.etl –o trace.txt[1/2] 100.0%[2/2] 100.0%C:\analysis> notepad trace.txt
xperfinfo Named Providers• DISK_IO: Disk I/O• DISK_IO_INIT: Disk I/O initiation• SPLIT_IO: Split I/O• FILE_IO: File system op end
times/results• FILE_IO_INIT:
Create/open/close/read/write• FILENAME: Create/delete/rundown• HARD_FAULTS: Hard page faults• ALL_FAULTS: All page faults including
hard, copy-on-write, demand-zero faults• DPC: Delayed Procedure Calls• INTERRUPT: Interrupts• DRIVERS: Driver events• PROC_THREAD: Create/delete• CSWITCH: Context switch• COMPACT_CSWITCH• DISPATCHER: CPU Scheduler
• PREFETCH: Prefetching• LOADER: Image load/unload• SYSCALL: System calls• PROFILE: CPU sample profile• MEMORY: Memory tracing• POOL: Memory pool tracing• VIRT_ALLOC: Virtual alloc reserve and
release• NETWORKTRACE: TCP/UDP, send/rcv• REGISTRY: Registry tracing• POWER: Power management• WORKER_THREAD: System worker
thread• PERF_COUNTER: Process perf
counters• ALPC: Advanced Local Procedure Call• …
Available xperfinfo Reports (“actions”)
• tracestats• sysconfig• dumper• diskio• filename• hardfault• pagefault• dpcisr• process• cswitch
• drvdelay• marks• perfctrs• profile• registry• boot• suspend• shutdown• …
Overview of xperfinfo
The old tools• Installed with Windows
– Logman: Collects performance counters– Tracerpt: Processes ETW log files or real-time sessions
• Installed with Driver Development Kit– http://www.microsoft.com/whdc/devtools/ddk/default.mspx – Tracelog: Starts, stops, or enables trace logging
• http://msdn2.microsoft.com/en-us/library/ms797927.aspx– Tracefmt: Dumps ETW binary files into text files
• http://msdn2.microsoft.com/en-us/library/ms797564.aspx– Traceview: Controls and displays ETW information
• http://msdn2.microsoft.com/en-us/library/ms797981.aspx
Overview of xperfinfo
The traces: Benchmark Workloads
• Easy to capture and make available– TPC-C, TPC-E, TPC-H, TPC-DS?– SAP-SD– Terminal Server– NetBench– SPC?– …
Example: TPC-C Trace
• Windows Server 2008 / SQL Server 2005• ~32 minutes• 93.5 million disk I/Os (58.6M read, 35.9M write)• 16-socket, dual-core 3.4 GHz Intel Xeon (16 MB
L3 cache)• 256 GB RAM• 1106 15Krpm FCSCSI disks
– 79 database LUNs
Example: TPC-C Request Sizes
• 94.8% 8KB requests; 2.1% 16KB requests• Remaining requests:
Example: TPC-C Trace Locality• ~3% of all writes are within 128 sectors of the
previous write to the same disk
Example: Terminal Server Knowledge Worker (TS-KW) Trace
• Windows Server 2008, Office 2007– Word, Excel, and Outlook activity– 180 concurrent users
• ~45 seconds– >750,000 Context Switches
• 16 thousand disk writes (in-memory working set)– Mostly sequential
• 2-socket, quad-core 2.66 GHz Intel Xeon (4 MB L2 cache)
• 32 GB RAM (holds working set)• One ATA 120GB disk
Example: TS-KW Interarrival Times
The traces: Production Workloads• First set of internal Microsoft targets:
– SQL Server– Exchange– SharePoint– File Server– Web Server– Media Server– SAP– Active Directory– Security Server– Backup– Search– Office Desktops, Laptops, Tablets
Example: SQL Server Replica for RADIUS Authentication Data for RAS & Wireless (worldwide)
• Windows Server 2003 / SQL Server 2005• Three sequential 1-hour traces
– 126.9 thousand I/Os (16.3K read, 110.5K write)– 122.6 thousand I/Os (10.8K read, 111.8K write)– 101.6 thousand I/Os (5.9K read, 95.8K write)– Locality:
• ~25% sequential requests • More than half of all read and write requests are within 100,000 sectors
of the immediately previous request (to the same disk)– Mostly 512KB reads in first trace; no 512KB reads in other
traces• 4-socket, hyperthreaded 1.9 GHz Intel Xeon• 8 GB RAM• Dual-port Gb network card• Five 4GB “disks” (configuration unknown)
xperf Demo
• Find region of 512KB reads in Trace 1– Identify file being read– Examine corresponding Disk I/O Detail
• Hard Fault Frame• Hard Fault Summary Table
– File + File Offset Disk + Disk Offset(Top to bottom disk I/O tracing)
Select Read-Heavy Region
Overview of xperf
Disk I/O Summary Table
Overview of xperf
One particular *.mdf file
Disk I/O Detail (Disk #0)
Overview of xperf
Disk I/O Detail (Disk #3)
Overview of xperf
Most of this activity is to the mdf file indicated in the Disk I/O Summary Table
Select Hard Fault Region
Overview of xperf
Hard Fault Summary Table
Overview of xperf
ExpandFile
Offsets
Timestamps
RADIUS SQL Server Replica, Trace 2: Disk Offset Distribution for Disk 0
RADIUS SQL Server Replica, Trace 2: Disk Offset Distributions
RADIUS SQL Server Replica, Trace 2: Interarrival Times (within each disk)
The traces: System Configurations
• From mobile devices to datacenter servers– Scale-out and scale-up environments
• 1-32 sockets• 1-64 cores• 1-1000 GB RAM• NTFS, FAT, Raw• ATA, SATA, SCSI, SAS, FC• Solid state drives
The traces: Postprocessing• Simple scripts and programs (e.g., perl and C#) will
be available to:– “Sanitize” traces by replacing some or all file,
directory, and process names with generic strings– Extract basic statistics from xperf dumps on an
overall, R/W, per-disk, or per-size basis• Request sizes• Spatial distributions• Queue lengths• Interarrival times• …
The tools and traces: Availability• xperf & xperfinfo will be released with the next Windows SDK
(in conjunction with Windows Vista SP1 and Windows Server 2008)
• Benchmark traces will be provided to the SNIA IOTTA group in Sept ’07– Scripts for sanitization and basic stats analysis included
• Production traces will be provided as they are captured and sanitized, hopefully on a monthly basis for years to come– Captures are in progress on multiple Microsoft IT servers
with varying workloads– Traces will be dumped in manageable chunks
• All tools and traces have standard Microsoft disclaimers• Microsoft would like to thank Seagate for providing disk drives
to store the internal Microsoft trace repository!!!
xperf/xperfinfo Future Enhancements
• Equivalent file block disk block event correlation for write requests (a la hard fault reads)– Mapped file writes– Lazy writer– Dirty page writer– Unbuffered writes– In the mean time, write-after-read’s and (in some
cases) sequential writes can be translated• Built-in process/file/directory sanitization• Extensibility• …and much more!
Summary• Event Tracing for Windows (ETW) = the engine
– Instrumentation built into the retail Windows operating system– The NT Kernel Provider provides coverage of kernel-level activity
• xperf = the interactive browser– High-level graphs– Summary tables– Individual event detail
• xperfinfo = the command line automation tool– ETW controller and decoder– Exports human-readable decoding of all trace events– Many custom actions distilling various aspects of the trace
• Alpha version of xperf/xperfinfo can be requested from:– [email protected]– [email protected]
• Additional Resources– Event Tracing for Windows on MSDN
http://msdn2.microsoft.com/en-us/library/aa363787.aspx– “Windows Internals 4th edition” by Russinovich and Solomon
The challenge: Capture and Share Traces!
• Microsoft is committed to gathering long-term (weeks/months) traces on many production systems within the corporate IT environment
• Start the wheels rolling in your organization to allow similar traces to be captured, sanitized, and published; use existing tools to start with and xperf/xperfinfo when they become available
• Create and share post-processing tools, simulators, models, etc., via SNIA IOTTA repository– http://iotta.snia.org
• Provide feedback on xperf & xperfinfo– Be patient, as this is an engineering analysis tool (not
an MS product) and is supported as such.
Q & A
Backup Slides
What is ETW used for?
• Debug application bugs including hangs, crashes, or unexpected behavior
• Diagnose performance problems• Track computing resource consumption at
application transaction level for capacity planning
Introduction
ETW vs. Performance Counters
ETW• Individual events described using
multiple standard/custom attributes
• Each event requires a timestamp• Each event requires additional
space
• An ETW trace can be used to compute aggregations on any group of events (filtered by time or any attributes) at post-processing time
– Various perspectives
• Can zoom down to individual events
Performance Counters• Aggregate information about groups
of events
• Each sample requires a timestamp• Each sample requires additional
space• Very light, events aggregated in
place
• A sampled performance counter trace provides a bottom aggregation level
– Information below the bottom aggregation level is lost
• Individual event information is lost