workload selection and characterization

Workload Selection and Characterization

Andy WangCIS 5930-03

Computer SystemsPerformance Analysis

2

Workloads

• Types of workloads• Workload selection

3

Types of Workloads

• What is a Workload?• Instruction Workloads• Synthetic Workloads• Real-World Benchmarks• Application Benchmarks• “Standard” Benchmarks• Exercisers and Drivers

4

What is a Workload?

• Workload: anything a computer is asked to do

• Test workload: any workload used to analyze performance

• Real workload: any observed during normal operations

• Synthetic workload: created for controlled testing

5

Real Workloads

• Advantage: represent reality• Disadvantage: uncontrolled

– Can’t be repeated– Can’t be described simply– Difficult to analyze

• Nevertheless, often useful for “final analysis” papers– E.g., “We ran system foo and it works well”

6

Synthetic Workloads

• Advantages:– Controllable– Repeatable– Portable to other systems– Easily modified

• Disadvantage: can never be sure real world will be the same

7

Instruction Workloads

• Useful only for CPU performance– But teach useful lessons for other situations

• Development over decades– “Typical” instruction (ADD)– Instruction mix (by frequency of use)

• Sensitive to compiler, application, architecture• Still used today (GFLOPS)

– Processor clock rate• Only valid within processor family

8

Instruction Workloads (cont’d)

• Modern complexity makes mixes invalid– Pipelining– Data/instruction caching– Prefetching

• Kernel is inner loop that does useful work:– Sieve, matrix inversion, sort, etc.– Ignores setup, I/O, so can be timed by

analysis if desired (at least in theory)

9

Synthetic Workloads

• Complete programs– Designed specifically for measurement– May do real or “fake” work– May be adjustable (parameterized)

• Two major classes:– Benchmarks– Exercisers

10

Real-World Benchmarks

• Pick a representative application• Pick sample data• Run it on system to be tested• Modified Andrew Benchmark, MAB, is a

real-world benchmark• Easy to do, accurate for that sample

data• Fails to consider other applications, data

11

Application Benchmarks

• Variation on real-world benchmarks• Choose most important subset of

functions• Write benchmark to test those functions• Tests what computer will be used for• Need to be sure important

characteristics aren’t missed• Mix of functions must reflect reality

12

“Standard” Benchmarks

• Often need to compare general-purpose computer systems for general-purpose use– E.g., should I buy a Compaq or a Dell PC?– Tougher: Mac or PC?

• Desire for an easy, comprehensive answer

• People writing articles often need to compare tens of machines

13

“Standard” Benchmarks (cont’d)

• Often need to make comparisons over time– Is this year’s PowerPC faster than last year’s

Pentium?• Probably yes, but by how much?

• Don’t want to spend time writing own code– Could be buggy or not representative– Need to compare against other people’s results

• “Standard” benchmarks offer solution

14

Popular “Standard” Benchmarks

• Sieve, 8 queens, etc.• Whetstone• Linpack• Dhrystone• Debit/credit• TPC• SPEC• MAB• Winstone, webstone, etc.• ...

15

Sieve, etc.

• Prime number sieve (Erastothenes)– Nested for loops– Often such small array that it’s silly

• 8 queens– Recursive

• Many others• Generally not representative of real

problems

16

Whetstone

• Dates way back (can compare against 70’s)

• Based on real observed frequencies• Entirely synthetic (no useful result)

– Modern optimizers may delete code

• Mixed data types, but best for floating• Be careful of incomparable variants!

17

LINPACK

• Based on real programs and data• Developed by supercomputer users• Great if you’re doing serious numerical

computation

18

Dhrystone

• Bad pun on “Whetstone”• Motivated by Whetstone’s perceived

excessive emphasis on floating point• Dates to when p’s were integer-only• Very popular in PC world• Again, watch out for version

mismatches

19

Debit/Credit Benchmark

• Developed for transaction processing environments– CPU processing is usually trivial– Remarkably demanding I/O, scheduling

requirements

• Models real TPS workloads synthetically

• Modern version is TPC benchmark

20

SPEC Suite

• Result of multi-manufacturer consortium• Addresses flaws in existing benchmarks• Uses 10 real applications, trying to

characterize specific real environments• Considers multiple CPUs• Geometric mean gives SPECmark for

system• Becoming standard comparison method

21

Modified Andrew Benchmark

• Used in research to compare file system, operating system designs

• Based on software engineering workload

• Exercises copying, compiling, linking• Probably ill-designed, but common use

makes it important• Needs scaling up for modern systems

22

Winstone, Webstone, etc.

• “Stone” has become suffix meaning “benchmark”

• Many specialized suites to test specialized applications– Too many to review here

• Important to understand strengths & drawbacks– Bias toward certain workloads– Assumptions about system under test

23

Exercisers and Drivers

• For I/O, network, non-CPU measurements

• Generate a workload, feed to internal or external measured system– I/O on local OS– Network

• Sometimes uses dedicated system, interface hardware

24

Advantages of Exercisers

• Easy to develop, port• Can incorporate measurement• Easy to parameterize, adjust

25

Disadvantagesof Exercisers

• High cost if external• Often too small compared to real

workloads– Thus not representative– E.g., may use caches “incorrectly”

• Internal exercisers often don’t have real CPU activity– Affects overlap of CPU and I/O

• Synchronization effects caused by loops

26

Workload Selection

• Services exercised• Completeness

– Sample service characterization

• Level of detail• Representativeness• Timeliness• Other considerations

27

Services Exercised

• What services does system actually use?– Network performance useless for matrix

work

• What metrics measure these services?– MIPS/GIPS for CPU speed– Bandwidth/latency for network, I/O– TPS for transaction processing

28

Completeness

• Computer systems are complex– Effect of interactions hard to predict

• Dynamic voltage scaling can speed up heavy loads (e.g., accessing encrypted files)

– So must be sure to test entire system

• Important to understand balance between components– I.e., don’t use 90% CPU mix to evaluate

I/O-bound application

29

Component Testing

• Sometimes only individual components are compared– Would a new CPU speed up our system?– How does IPV6 affect Web server

performance?• But component may not be directly

related to performance– So be careful, do ANOVA (analysis of

variance), don’t extrapolate too much

30

Service Testing

• May be possible to isolate interfaces to just one component– E.g., instruction mix for CPU

• Consider services provided and used by that component

• System often has layers of services– Can cut at any point and insert workload

31

Characterizing a Service

• Identify service provided by major subsystem

• List factors affecting performance• List metrics that quantify demands and

performance• Identify workload provided to that

service

32

Example: Web Server

Web Client

Network

TCP/IP Connections

Web Server

HTTP Requests

File System

Web Page Accesses

Disk Drive

Disk Transfers

Web Page Visits

33

Web Client Analysis

• Services: visit page, follow hyperlink, display page information

• Factors: page size, number of links, fonts required, embedded graphics, sound

• Metrics: response time (both definitions)• Workload: a list of pages to be visited

and links to be followed

34

Network Analysis

• Services: connect to server, transmit request, transfer data

• Factors: bandwidth, latency, protocol used

• Metrics: connection setup time, response latency, achieved bandwidth

• Workload: a series of connections to one or more servers, with data transfer

35

Web Server Analysis

• Services: accept and validate connection, fetch & send HTTP data

• Factors: Network performance, CPU speed, system load, disk subsystem performance

• Metrics: response time, connections served

• Workload: a stream of incoming HTTP connections and requests

36

File System Analysis

• Services: open file, read file (writing often doesn’t matter for Web server)

• Factors: disk drive characteristics, file system software, cache size, partition size

• Metrics: response time, transfer rate• Workload: a series of file-transfer

requests

37

Disk Drive Analysis

• Services: read sector, write sector• Factors: seek time, transfer rate• Metrics: response time• Workload: a statistically-generated

stream of read/write requests

38

Level of Detail

• Detail trades off accuracy vs. cost• Highest detail is complete trace• Lowest is one request, usually most

common• Intermediate approach: weight by

frequency• We will return to this when we discuss

workload characterization

39

Representativeness

• Obviously, workload should represent desired application– Arrival rate of requests– Resource demands of each request– Resource usage profile of workload over

time

• Again, accuracy and cost trade off• Need to understand whether detail

matters

40

Timeliness

• Usage patterns change over time– File size grows to match disk size– Web pages grow to match network

bandwidth• If using “old” workloads, must be sure

user behavior hasn’t changed• Even worse, behavior may change after

test, as result of installing new system– “Latent demand” phenomenon

41

Other Considerations

• Loading levels– Full capacity– Beyond capacity– Actual usage

• External components not considered as parameters

• Repeatability of workload

42

Workload Characterization

• Terminology• Averaging• Specifying dispersion• Single-parameter histograms• Multi-parameter histograms• Principal-component analysis• Markov models• Clustering

43

Workload Characterization

Terminology• User (maybe nonhuman) requests

service– Also called workload component or

workload unit

• Workload parameters or workload features model or characterize the workload

44

SelectingWorkload Components

• Most important: components should be external: at interface of SUT (system under test)

• Components should be homogeneous• Should characterize activities of interest

to the study

45

ChoosingWorkload Parameters

• Select parameters that depend only on workload (not on SUT)

• Prefer controllable parameters• Omit parameters that have no effect on

system, even if important in real world

46

Averaging

• Basic character of a parameter is its average value

• Not just arithmetic mean• Good for uniform distributions or gross

studies

47

Specifying Dispersion

• Most parameters are non-uniform• Specifying variance or standard

deviation brings major improvement over average

• Average and s.d. (or C.O.V.) together allow workloads to be grouped into classes– Still ignores exact distribution

48

Single-Parameter Histograms

• Make histogram or kernel density estimate

• Fit probability distribution to shape of histogram

• Chapter 27 (not covered in course) lists many useful shapes

• Ignores multiple-parameter correlations

49

Multi-Parameter Histograms

• Use 3-D plotting package to show 2 parameters– Or plot each datum as 2-D point and look

for “black spots”

• Shows correlations– Allows identification of important

parameters

• Not practical for 3 or more parameters

50

Principal-Component Analysis (PCA)

• How to analyze more than 2 parameters?

• Could plot endless pairs– Still might not show complex relationships

• Principal-component analysis solves problem mathematically– Rotates parameter set to align with axes– Sorts axes by importance

51

Advantages of PCA

• Handles more than two parameters• Insensitive to scale of original data• Detects dispersion• Combines correlated parameters into

single variable• Identifies variables by importance

52

Disadvantages of PCA

• Tedious computation (if no software)• Still requires hand analysis of final

plotted results• Often difficult to relate results back to

original parameters

53

Markov Models

• Sometimes, distribution isn’t enough• Requests come in sequences• Sequencing affects performance• Example: disk bottleneck

– Suppose jobs need 1 disk access per CPU slice

– CPU slice is much faster than disk– Strict alternation uses CPU better– Long disk access strings slow system

54

Introduction toMarkov Models

• Represent model as state diagram• Probabilistic transitions between states• Requests generated on transitions

Network

CPU Disk

0.6

0.4

0.4

0.30.3

0.8

0.2

55

Creating a Markov Model

• Observe long string of activity• Use matrix to count pairs of states• Normalize rows to sum to 1.0

CPU Network DiskCPU 0.6 0.4Network 0.3 0.4 0.3Disk 0.8 0.2

56

Example Markov Model

• Reference string of opens, reads, closes:ORORRCOORCRRRRCC

• Pairwise frequency matrix:

Open Read Close SumOpen 1 3 4Read 1 4 3 8Close 1 1 1 3

57

Markov Modelfor I/O String

• Divide each row by its sum to get transition matrix:

• Model:

Open Read CloseOpen 0.25 0.75Read 0.13 0.50 0.37Close 0.33 0.33 0.34

Open Close

Read

0.25

0.75

0.13

0.50

0.33

0.37

0.33

0.34

58

Clustering

• Often useful to break workload into categories

• “Canonical example” of each category can be used to represent all samples

• If many samples, generating categories is difficult

• Solution: clustering algorithms

59

Steps in Clustering• Select sample• Choose and transform parameters• Drop outliers• Scale observations• Choose distance measure • Do clustering• Use results to adjust parameters, repeat• Choose representative components

60

Selecting A Sample

• Clustering algorithms are often slow– Must use subset of all observations

• Can test sample after clustering: does every observation fit into some cluster?

• Sampling options– Random– Heaviest users of component under study

61

Choosing and Transforming Parameters

• Goal is to limit complexity of problem• Concentrate on parameters with high

impact, high variance– Use principal-component analysis– Drop a parameter, re-cluster, see if

different

• Consider transformations such as Sec. 15.4 (logarithms, etc.)

62

Dropping Outliers

• Must get rid of observations that would skew results– Need great judgment here– No firm guidelines

• Drop things that you know are “unusual”• Keep things that consume major

resources– E.g., daily backups

63

Scale Observations

• Cluster analysis is often sensitive to parameter ranges, so scaling affects results

• Options:– Scale to zero mean and unit variance– Weight based on importance or variance– Normalize range to [0, 1]– Normalize 95% of data to [0, 1]

64

Choosinga Distance Measure

• Endless possibilities available• Represent observations as vectors in k-

space• Popular measures include:

– Euclidean distance, weighted or unweighted

– Chi-squared distance– Rectangular distance

65

Clustering Methods

• Many algorithms available• Computationally expensive (NP hard)• Can be simple or hierarchical• Many require you to specify number of

desired clusters• Minimum Spanning Tree is not only

option!

66

Minimum Spanning Tree Clustering

• Start with each point in a cluster• Repeat until single cluster:

– Compute centroid of each cluster– Compute intercluster distances– Find smallest distance– Merge clusters with smallest distance

• Method produces stable results– But not necessarily optimum

67

K-Means Clustering

• One of most popular methods• Number of clusters is input parameter• First randomly assign points to clusters• Repeat until no change:

– Calculate center of each cluster:– Assign each point to cluster with nearest

center

yx,

68

Interpreting Clusters

• Art, not science• Drop small clusters (if little impact on

performance)• Try to find meaningful characterizations• Choose representative components

– Number proportional to cluster size or to total resource demands

69

Drawbacks of Clustering

• Clustering is basically AI problem• Humans will often see patterns where

computer sees none• Result is extremely sensitive to:

– Choice of algorithm– Parameters of algorithm– Minor variations in points clustered

• Results may not have functional meaning

White Slide

workload selection and characterization

Documents

performancereal workload

dotest workload

sure real world

real workloadsadvantage

useful work

useful lessons

instruction workloadsuseful

useful resultmodern