capturing comprehensive storage workload traces in windows

Capturing Comprehensive Storage Workload Traces in Windows

Dr. Bruce WorthingtonWindows Server Performance

Microsoft Corporation

My Motivation• I’m tired of seeing storage research and performance

analysis limited by real-world trace availability.– It’s not all that much better now than it was when I

was in grad school in the early 90’s…• I’ve been saying I would help supply researchers with

long-term real-world traces and post-processing tools for almost a decade. – I’m finally following through on the promise to supply

production traces.– More importantly, Microsoft has made it easy for

anyone to capture detailed storage workload traces on Windows systems (along with many other types of traces and profiles).

Outline

• The tools– Event Tracing for Windows (ETW)– New: xperf, xperfinfo– Old: logman, trace*

• The traces– Benchmark (steady-state) workloads– Production workloads

• The challenge

Event Tracing for Windows (ETW)• ETW has been the core Windows tracing component

since Windows 2000 and is continually improved• Many Windows components, including the kernel,

produce events describing their behavior• Events from user-mode applications and kernel-mode

drivers can be logged• High performance, low overhead, highly scalable

– Efficient buffering and non-blocking logging mechanisms using per-processor buffers written to disk by a separate thread

• Tracing can be enabled/disabled dynamically without requiring system reboots or application restarts

Event Tracing for Windows (ETW)• Events can be sample-based, but most are single-instance (event A

occurred at time T)• Support for real-time consumption and file-based tracing• Configurable logging mode, buffer size, buffer count

– Sequential traces– Circular traces– Circular traces in memory (flight black-box) [Vista]

• Adding custom events enables better correlation of application activity with low-level resource usage

• On a standard Vista computer, the logging API (EventWrite) takes about 5,000 cycles, mostly spent in acquiring the timestamp via QueryPerformanceCounter (QPC)– About 2.5% processor overhead for a sustained rate of 10,000

events/second on a 2GHz processor – not including the cost of flushing trace buffers to disk

• Postprocessing the binary disk log correlates events with context and domain specific knowledge

ETW Architecture• Provider

– Provides event traces. Can be user-mode app, kernel-mode driver, or kernel.

– Providers use ETW APIs to register with the ETW framework to send event traces from various points in the code.

– When enabled, the provider sends event traces to a specific trace session designated by the controller.

• Controller– Assists in starting, stopping or updating trace

sessions in the kernel as well as enabling or disabling providers

– Sets trace session properties such as sequential or circular file logging or direct delivery to consumers

• Consumer– Reads trace files or listens to active trace sessions

and processes logged events– Not aware of the Providers– Only receive event traces from the trace sessions

or log files

Introduction

• Event Trace Session infrastructure– Brokers the event traces from provider(s)

to consumer(s) and adds data to each event (e.g., TimeStamp, Thread, Process, CPU)

“NT Kernel Provider”• Events

– Process, thread and image– Sampled Profile– Context Switch– Dispatcher (Ready Thread)– DPC (Deferred Procedure Call)– ISR (Interrupt Service Routine)– Disk I/O– File I/O– Registry– Hardfault– Pagefault– Driver delay– TCP/UDP– Power– ALPC

– Virtual Allocation– Heap– Memory– …

• Related providers– Thread Pool– Power Transition– Winlogon– Services– Prefetch

• Other providers– Shell– Internet Explorer– Media Foundation– Media Center– …

System Config: ETW Instrumentation• Automatically added to kernel traces• Rundown of system configuration at trace start/stop time

– CPU (number of logical and physical processors, frequency)– Memory (memory size, page size, allocation granularity)– Disk (physical disks, partitions, volumes)– Video adapters– Network adapters (IPv4, IPv6)– Services (including service tag)– Plug-and-Play Information– IRQ Assignment– Power capabilities (S1 - S5)– Network Identity (computer name, domain name)– Group Masks (what kernel flags are enabled)

Storage-Related Instrumentation• Disk Events:

– Read, Write, Flush Initiation/Completion• File Events:

– Filename Create, Delete, Rundown (when trace stopped)– File I/O Initiation, Hard Fault

• Create, Cleanup, Close, Flush, Read, Write, Set Info, Query Info, FSCTL, Delete, Rename, Directory Enumeration, Directory Notification

– File I/O Completion• Driver Events:

– Driver Call, Return (Major Function)– Driver Complete Request, Complete Request Return– Driver Completion Routine

• Binary storage-related event sizes range from ~30-80 bytes (not counting events that dump unique filenames)

The new tools: xperf & xperfinfo• Extensible performance analysis toolset• High-level control and decoding of ETW traces

– Emphasis on kernel events and system-wide resource usage

– Support for 3rd-party events, primarily in conjunction with kernel events

• Cross-platform– Windows XP SP1+, Vista– Windows Server 2003, Windows Server 2008

• Cross-architecture (x86, x64, ia64)• Capture-anywhere, process-anywhere

xperf• Detailed interactive analysis of performance

traces• High-level resource usage graphs on common

trace timeline with zoom capability• Low-level discrete graphs for resource state

transitions– Individual context switch and disk I/O events

• Powerful interactive summary tables with dynamic grouping, sorting and aggregation capability

Currently Available Graphs and Summary Tables

– Disk I/O Counts– Disk I/O Detail– Disk Utilization– File I/O– DPC (Deferred

Procedure Call)– ISR (Interrupt Service

Routine)– Hardfault– Pagefault– Driver Delay

– Sample Profile– CPU Availability– CPU Scheduling– Process Lifetime– Registry counts– Services– Plug ’n’ Play– Marks– Generic– …

Overview of xperf

Storage Activity Notes• Disk Reads, Writes and Flushes

– Vista introduced low-priority I/Os, which are deferred in a special queue to allow current and near-term future normal priority I/Os to complete

– Flushes may be “completed” by low-level storage drivers under certain conditions

• Hard Faults– Synchronous I/Os that block execution of issuing thread– Paging-in from disk pages not currently present in memory

• Communication aspect: System read-ahead and write-back– Asynchronous just-in-time prefetch for sequentially read buffered files– Asynchronous buffered writes– Issued from “System (4)” process

• xperf & xperfinfo infer disk “queue” parameters (wait time, service time, queue depths, skip behavior) assuming a single serialized spindle– not a valid assumption for disk arrays– Queue depth can be thought of as “number of requests in flight”

xperf Demo

• System Config• CPU• DPC & ISR• Process• Disk Summary Table• Disk I/O Detail• Disk I/O Detail Summary• Hard Faults

Sample xperf Screenshots

• Sidebar chart selection• Selecting a time range• CPU Usage Summary Table• DPC and ISR CPU Usage Frames• Disk I/O Summary Table• Disk I/O Detail• Disk I/O Detail Summary Table

Sidebar Frame Selector

Sidebar

Overview of xperf

Frame Scrollbar

Selecting a Time Range (on CPU Usage Frame)

Overview of xperf

Go to CPU Usage Summary TableContext-Menu Summary Table

Overview of xperf

CPU Usage Summary Table

Status Bar Report

% of Time excluding DPC and ISR

% Total Time

Selected Time Interval

Close Summary Table

Overview of xperf

DPC and Interrupt CPU Usage Frames

Overview of xperf

Go to Disk I/O Summary Table

Summary Table

Overview of xperf

Disk I/O Summary Table

Expanded

Individual I/Os

Overview of xperf

Go to Disk I/O Detail

Detail Graph

Overview of xperf

Disk I/O Detail (Disk #0)

Overview of xperf

Change Disk Select

Processes


Overview of xperf


Selection

Overview of xperf

Disk I/O Detail Summary Table

Default sort field

Overview of xperf

xperfinfo

• High level control and decoding• Merging and dumping of ETW traces• Many command line actions to analyze and

report on various aspects of a trace• Various buffering and log file options• Multiple timer sources• Traces of boot activity

xperfinfo Demo

• Start/stop• Provider list• Dump• Postprocessing summaries

Taking a Kernel Trace

• Start kernel trace; run scenario; stop and merge

• Start user trace; run scenario; stop

• Hint: Retrieve all known kernel flags and groups

Overview of xperfinfo

C:\analysis> xperfinfo –on base+FILE_IO+INTERRUPTC:\analysis> MyTestApp.exeC:\analysis> xperfinfo –d trace.etl

C:\analysis> xperfinfo –help providers

C:\analysis> xperfinfo –start MySession –on Kerberos+MRxSmb –f kerberos.etlC:\analysis> MyTestApp.exeC:\analysis> xperfinfo –stop MySession

Dumping a Trace


C:\analysis> xperfinfo -i trace.etl –o trace.txt[1/2] 100.0%[2/2] 100.0%C:\analysis> notepad trace.txt

xperfinfo Named Providers• DISK_IO: Disk I/O• DISK_IO_INIT: Disk I/O initiation• SPLIT_IO: Split I/O• FILE_IO: File system op end

times/results• FILE_IO_INIT:

Create/open/close/read/write• FILENAME: Create/delete/rundown• HARD_FAULTS: Hard page faults• ALL_FAULTS: All page faults including

hard, copy-on-write, demand-zero faults• DPC: Delayed Procedure Calls• INTERRUPT: Interrupts• DRIVERS: Driver events• PROC_THREAD: Create/delete• CSWITCH: Context switch• COMPACT_CSWITCH• DISPATCHER: CPU Scheduler

• PREFETCH: Prefetching• LOADER: Image load/unload• SYSCALL: System calls• PROFILE: CPU sample profile• MEMORY: Memory tracing• POOL: Memory pool tracing• VIRT_ALLOC: Virtual alloc reserve and

release• NETWORKTRACE: TCP/UDP, send/rcv• REGISTRY: Registry tracing• POWER: Power management• WORKER_THREAD: System worker

thread• PERF_COUNTER: Process perf

counters• ALPC: Advanced Local Procedure Call• …

Available xperfinfo Reports (“actions”)

• tracestats• sysconfig• dumper• diskio• filename• hardfault• pagefault• dpcisr• process• cswitch

• drvdelay• marks• perfctrs• profile• registry• boot• suspend• shutdown• …


The old tools• Installed with Windows

– Logman: Collects performance counters– Tracerpt: Processes ETW log files or real-time sessions

• Installed with Driver Development Kit– http://www.microsoft.com/whdc/devtools/ddk/default.mspx – Tracelog: Starts, stops, or enables trace logging

• http://msdn2.microsoft.com/en-us/library/ms797927.aspx– Tracefmt: Dumps ETW binary files into text files

• http://msdn2.microsoft.com/en-us/library/ms797564.aspx– Traceview: Controls and displays ETW information

• http://msdn2.microsoft.com/en-us/library/ms797981.aspx


http://www.microsoft.com/whdc/devtools/ddk/default.mspx

http://msdn2.microsoft.com/en-us/library/ms797927.aspx



The traces: Benchmark Workloads

• Easy to capture and make available– TPC-C, TPC-E, TPC-H, TPC-DS?– SAP-SD– Terminal Server– NetBench– SPC?– …

Example: TPC-C Trace

• Windows Server 2008 / SQL Server 2005• ~32 minutes• 93.5 million disk I/Os (58.6M read, 35.9M write)• 16-socket, dual-core 3.4 GHz Intel Xeon (16 MB

L3 cache)• 256 GB RAM• 1106 15Krpm FCSCSI disks

– 79 database LUNs

Example: TPC-C Request Sizes

• 94.8% 8KB requests; 2.1% 16KB requests• Remaining requests:

Example: TPC-C Trace Locality• ~3% of all writes are within 128 sectors of the

previous write to the same disk

Example: Terminal Server Knowledge Worker (TS-KW) Trace

• Windows Server 2008, Office 2007– Word, Excel, and Outlook activity– 180 concurrent users

• ~45 seconds– >750,000 Context Switches

• 16 thousand disk writes (in-memory working set)– Mostly sequential

• 2-socket, quad-core 2.66 GHz Intel Xeon (4 MB L2 cache)

• 32 GB RAM (holds working set)• One ATA 120GB disk

Example: TS-KW Interarrival Times

The traces: Production Workloads• First set of internal Microsoft targets:

– SQL Server– Exchange– SharePoint– File Server– Web Server– Media Server– SAP– Active Directory– Security Server– Backup– Search– Office Desktops, Laptops, Tablets

Example: SQL Server Replica for RADIUS Authentication Data for RAS & Wireless (worldwide)

• Windows Server 2003 / SQL Server 2005• Three sequential 1-hour traces

– 126.9 thousand I/Os (16.3K read, 110.5K write)– 122.6 thousand I/Os (10.8K read, 111.8K write)– 101.6 thousand I/Os (5.9K read, 95.8K write)– Locality:

• ~25% sequential requests • More than half of all read and write requests are within 100,000 sectors

of the immediately previous request (to the same disk)– Mostly 512KB reads in first trace; no 512KB reads in other

traces• 4-socket, hyperthreaded 1.9 GHz Intel Xeon• 8 GB RAM• Dual-port Gb network card• Five 4GB “disks” (configuration unknown)

xperf Demo

• Find region of 512KB reads in Trace 1– Identify file being read– Examine corresponding Disk I/O Detail

• Hard Fault Frame• Hard Fault Summary Table

– File + File Offset Disk + Disk Offset(Top to bottom disk I/O tracing)

Select Read-Heavy Region

Overview of xperf

Disk I/O Summary Table

Overview of xperf

One particular *.mdf file


Overview of xperf


Overview of xperf

Most of this activity is to the mdf file indicated in the Disk I/O Summary Table

Select Hard Fault Region

Overview of xperf

Hard Fault Summary Table

Overview of xperf

ExpandFile

Offsets

Timestamps

RADIUS SQL Server Replica, Trace 2: Disk Offset Distribution for Disk 0

RADIUS SQL Server Replica, Trace 2: Disk Offset Distributions

RADIUS SQL Server Replica, Trace 2: Interarrival Times (within each disk)

The traces: System Configurations

• From mobile devices to datacenter servers– Scale-out and scale-up environments

• 1-32 sockets• 1-64 cores• 1-1000 GB RAM• NTFS, FAT, Raw• ATA, SATA, SCSI, SAS, FC• Solid state drives

The traces: Postprocessing• Simple scripts and programs (e.g., perl and C#) will

be available to:– “Sanitize” traces by replacing some or all file,

directory, and process names with generic strings– Extract basic statistics from xperf dumps on an

overall, R/W, per-disk, or per-size basis• Request sizes• Spatial distributions• Queue lengths• Interarrival times• …

The tools and traces: Availability• xperf & xperfinfo will be released with the next Windows SDK

(in conjunction with Windows Vista SP1 and Windows Server 2008)

• Benchmark traces will be provided to the SNIA IOTTA group in Sept ’07– Scripts for sanitization and basic stats analysis included

• Production traces will be provided as they are captured and sanitized, hopefully on a monthly basis for years to come– Captures are in progress on multiple Microsoft IT servers

with varying workloads– Traces will be dumped in manageable chunks

• All tools and traces have standard Microsoft disclaimers• Microsoft would like to thank Seagate for providing disk drives

to store the internal Microsoft trace repository!!!

xperf/xperfinfo Future Enhancements

• Equivalent file block disk block event correlation for write requests (a la hard fault reads)– Mapped file writes– Lazy writer– Dirty page writer– Unbuffered writes– In the mean time, write-after-read’s and (in some

cases) sequential writes can be translated• Built-in process/file/directory sanitization• Extensibility• …and much more!

Summary• Event Tracing for Windows (ETW) = the engine

– Instrumentation built into the retail Windows operating system– The NT Kernel Provider provides coverage of kernel-level activity

• xperf = the interactive browser– High-level graphs– Summary tables– Individual event detail

• xperfinfo = the command line automation tool– ETW controller and decoder– Exports human-readable decoding of all trace events– Many custom actions distilling various aspects of the trace

• Alpha version of xperf/xperfinfo can be requested from:– [email protected]– [email protected]

• Additional Resources– Event Tracing for Windows on MSDN

http://msdn2.microsoft.com/en-us/library/aa363787.aspx– “Windows Internals 4th edition” by Russinovich and Solomon

mailto:[email protected]

mailto:[email protected]

http://msdn2.microsoft.com/en-us/library/aa363787.aspx

The challenge: Capture and Share Traces!

• Microsoft is committed to gathering long-term (weeks/months) traces on many production systems within the corporate IT environment

• Start the wheels rolling in your organization to allow similar traces to be captured, sanitized, and published; use existing tools to start with and xperf/xperfinfo when they become available

• Create and share post-processing tools, simulators, models, etc., via SNIA IOTTA repository– http://iotta.snia.org

• Provide feedback on xperf & xperfinfo– Be patient, as this is an engineering analysis tool (not

an MS product) and is supported as such.

http://iotta.snia.org/

Backup Slides

What is ETW used for?

• Debug application bugs including hangs, crashes, or unexpected behavior

• Diagnose performance problems• Track computing resource consumption at

application transaction level for capacity planning

Introduction

ETW vs. Performance Counters

ETW• Individual events described using

multiple standard/custom attributes

• Each event requires a timestamp• Each event requires additional

space

• An ETW trace can be used to compute aggregations on any group of events (filtered by time or any attributes) at post-processing time

– Various perspectives

• Can zoom down to individual events

Performance Counters• Aggregate information about groups

of events

• Each sample requires a timestamp• Each sample requires additional

space• Very light, events aggregated in

place

• A sampled performance counter trace provides a bottom aggregation level

– Information below the bottom aggregation level is lost

• Individual event information is lost

capturing comprehensive storage workload traces in windows

Software