workload modeling and its effect on performance evaluation dror feitelson hebrew university
Post on 21-Dec-2015
220 views
TRANSCRIPT
Workload Modelingand its Effect on
Performance Evaluation
Dror Feitelson
Hebrew University
Performance Evaluation
• In system design– Selection of algorithms– Setting parameter values
• In procurement decisions– Value for money– Meet usage goals
• For capacity planing
The Good Old Days…
• The skies were blue
• The simulation results were conclusive
• Our scheme was better than theirs
Feitelson & Jette, JSSPP 1997
But in their papers,
Their scheme was better than ours!
How could they be so wrong?
• The system’s design(What we teach in algorithms and data structures)
• Its implementation(What we teach in programming courses)
• The workload to which it is subjected
• The metric used in the evaluation
• Interactions between these factors
Performance evaluation depends on:
• The system’s design(What we teach in algorithms and data structures)
• Its implementation(What we teach in programming courses)
• The workload to which it is subjected
• The metric used in the evaluation
• Interactions between these factors
Performance evaluation depends on:
Outline for Today
• Three examples of how workloads affect performance evaluation
• Workload modeling– Getting data– Fitting, correlations, stationarity…– Heavy tails, self similarity…
• Research agenda
In the context of parallel job scheduling
Example #1
Gang Scheduling and
Job Size Distribution
Gang What?!?
Time slicing parallel jobs with coordinated context switching
Ousterhoutmatrix
Ousterhout, ICDCS 1982
Gang What?!?
Time slicing parallel jobs with coordinated context switching
Ousterhoutmatrix
Optimization:Alternativescheduling
Ousterhout, ICDCS 1982
Packing Jobs
Use a buddy system for allocating processors
Feitelson & Rudolph, Computer 1990
Packing Jobs
Use a buddy system for allocating processors
Packing Jobs
Use a buddy system for allocating processors
Packing Jobs
Use a buddy system for allocating processors
Packing Jobs
Use a buddy system for allocating processors
The Question:
• The buddy system leads to internal fragmentation
• But it also improves the chances of alternative scheduling, because processors are allocated in predefined groups
Which effect dominates the other?
The Answer (part 1):
Feitelson & Rudolph, JPDC 1996
The Answer (part 2):
The Answer (part 2):
The Answer (part 2):
The Answer (part 2):
• Many small jobs
• Many sequential jobs
• Many power of two jobs
• Practically no jobs use full machine
Conclusion: buddy system should work well
Verification
Feitelson, JSSPP 1996
Example #2
Parallel Job Scheduling
and Job Scaling
Variable Partitioning
• Each job gets a dedicated partition for the duration of its execution
• Resembles 2D bin packing
• Packing large jobs first should lead to better performance
• But what about correlation of size and runtime?
Scaling Models
• Constant work– Parallelism for speedup: Amdahl’s Law– Large first SJF
• Constant time– Size and runtime are uncorrelated
• Memory bound– Large first LJF– Full-size jobs lead to blockout
Worley, SIAM JSSC 1990
“Scan” Algorithm
• Keep jobs in separate queues according to size (sizes are powers of 2)
• Serve the queues Round Robin, scheduling all jobs from each queue (they pack perfectly)
• Assuming constant work model, large jobs only block the machine for a short time
• But the memory bound model would lead to excessive queueing of small jobs
Krueger et al., IEEE TPDS 1994
The Data
The Data
The Data
The Data
Data: SDSC Paragon, 1995/6
The Data
Data: SDSC Paragon, 1995/6
The Data
Data: SDSC Paragon, 1995/6
Conclusion
• Parallelism used for better results, not for faster results
• Constant work model is unrealistic
• Memory bound model is reasonable
• Scan algorithm will probably not perform well in practice
Example #3
Backfilling and
User Runtime Estimation
Backfilling
• Variable partitioning can suffer from external fragmentation
• Backfilling optimization: move jobs forward to fill in holes in the schedule
• Requires knowledge of expected job runtimes
Variants
•EASY backfilling
Make reservation for first queued job
•Conservative backfilling
Make reservation for all queued jobs
User Runtime Estimates
• Lower estimates improve chance of backfilling and better response time
• Too low estimates run the risk of having the job killed
• So estimates should be accurate, right?
They Aren’t
Mu’alem & Feitelson, IEEE TPDS 2001
Surprising Consequences
• Inaccurate estimates actually lead to improved performance
• Performance evaluation results may depend on the accuracy of runtime estimates– Example: EASY vs. conservative– Using different workloads– And different metrics
EASY vs. Conservative
Using CTC SP2 workload
EASY vs. Conservative
Using Jann workload model
EASY vs. Conservative
Using Feitelson workload model
Conflicting Results Explained• Jann uses accurate runtime estimates
• This leads to a tighter schedule
• EASY is not affected too much
• Conservative manages less backfilling of long jobs, because respects more reservations
Conservative is bad for the long jobsGood for short ones that are respected
Conservative
EASY
Conflicting Results Explained
• Response time sensitive to long jobs, which favor EASY
• Slowdown sensitive to short jobs, which favor conservative
• All this does not happen at CTC, because estimates are so loose that backfill can occur even under conservative
Verification
Run CTC workload with accurate estimates
But What About My Model?
Simply does not have such small long jobs
Workload Data Sources
No Data
• Innovative unprecedented systems– Wireless– Hand-held
• Use an educated guess– Self similarity– Heavy tails– Zipf distribution
Serendipitous Data
• Data may be collected for various reasons– Accounting logs– Audit logs– Debugging logs– Just-so logs
• Can lead to wealth of information
NASA Ames iPSC/860 log42050 jobs from Oct-Dec 1993 user job nodes runtime date time
user4 cmd8 32 70 11/10/93 10:13:17
user4 cmd8 32 70 11/10/93 10:19:30
user42 nqs450 32 3300 11/10/93 10:22:07
user41 cmd342 4 54 11/10/93 10:22:37
sysadmin pwd 1 6 11/10/93 10:22:42
user4 cmd8 32 60 11/10/93 10:25:42
sysadmin pwd 1 3 11/10/93 10:30:43
user41 cmd342 4 126 11/10/93 10:31:32 Feitelson & Nitzberg, JSSPP 1995
Distribution of Job Sizes
Distribution of Job Sizes
Distribution of Resource Use
Distribution of Resource Use
Degree of Multiprogramming
System Utilization
Job Arrivals
Arriving Job Sizes
Distribution of Interarrival Times
Distribution of Runtimes
User Activity
Repeated Execution
Application Moldability
Distribution of Run Lengths
Predictability in Repeated Runs
Recurring Findings
• Many small and serial jobs
• Many power-of-two jobs
• Weak correlation of job size and duration
• Job runtimes are bounded but have CV>1
• Inaccurate user runtime estimates
• Non-stationary arrivals (daily/weekly cycle)
• Power-law user activity, run lengths
Instrumentation
• Passive: snoop without interfering
• Active: modify the system– Collecting the data interferes with system
behavior– Saving or downloading the data causes
additional interference– Partial solution: model the interference
Data Sanitation
• Strange things happen
• Leaving them in is “safe” and “faithful” to the real data
• But it risks situations in which a non-representative situation dominates the evaluation results
Arrivals to SDSC SP2
Arrivals to LANL CM-5
Arrivals to CTC SP2
Arrivals to SDSC Paragon
What are they doing at 3:30
AM?
3:30 AM
• Nearly every day, a set of 16 jobs are run by the same user
• Most probably the same set, as they typically have a similar pattern of runtimes
• Most probably these are administrative jobs that are executed automatically
Arrivals to CTC SP2
Arrivals to SDSC SP2
Arrivals to LANL CM-5
Arrivals to SDSC Paragon
Are These Outliers?
• These large activity outbreaks are easily distinguished from normal activity
• They last for several days to a few weeks
• They appear at intervals of several months to more than a year
• They are each caused by a single user!– Therefore easy to remove
Two Aspects
• In workload modeling, should you include this in the model?– In a general model, probably not– Conduct separate evaluation for special
conditions (e.g. DOS attack)
• In evaluations using raw workload data, there is a danger of bias due to unknown special circumstances
Automation
• The idea:– Cluster daily data in based on various
workload attributes– Remove days that appear alone in a cluster– Repeat
• The problem:– Strange behavior often spans multiple days
n
Cirne &Berman, Wkshp Workload Charact. 2001
Workload Modeling
Statistical Modeling
• Identify attributes of the workload
• Create empirical distribution of each attribute
• Fit empirical distribution to create model
• Synthetic workload is created by sampling from the model distributions
Fitting by Moments
• Calculate model parameters to fit moments of empirical data
• Problem: does not fit the shape of the distribution
Jann et al, JSSPP 1997
Fitting by Moments
• Calculate model parameters to fit moments of empirical data
• Problem: does not fit the shape of the distribution
• Problem: very sensitive to extreme data values
Effect of Extreme Runtime Values
Change when top records omitted
omit mean CV
0.01% -2.1% -29%
0.02% -3.0% -35%
0.04% -3.7% -39%
0.08% -4.6% -39%
0.16% -5.7% -42%
0.31% -7.1% -42%Downey & Feitelson, PER 1999
Alternative: Fit to Shape
• Maximum likelihood: what distribution parameters were most likely to lead to the given observations– Needs initial guess of functional form
• Phase type distributions– Construct the desired shape
• Goodness of fit– Kolmogorov-Smirnov: difference in CDFs– Anderson-Darling: added emphasis on tail– May need to sample observations
Correlations
• Correlation can be measured by the correlation coefficient
• It can be modeled by a joint distribution function
• Both may not be very useful
Correlation Coefficient
system CC
CTC SP2 -0.029
KTH SP2 0.011
SDSC SP2 0.145
LANL CM-5 0.211
SDSCParagon 0.305
Gives low results for correlation of runtime and size in parallel systems
22yyxx
yyxx
ii
ii
Distributions
A restricted version of a joint distribution
Modeling Correlation
• Divide range of one attribute into sub-ranges
• Create a separate model of other attribute for each sub-range
• Models can be independent, or model parameter can depend on sub-range
Stationarity
• Problem of daily/weekly activity cycle– Not important if unit of activity is very small
(network packet)– Very meaningful if unit of work is long
(parallel job)
How to Modify the Load
• Multiply interarrivals or runtimes by a factor– Changes the effective length of the day
• Multiply machine size by a factor– Modifies packing properties
• Add users
Stationarity
• Problem of daily/weekly activity cycle– Not important if unit of activity is very small
(network packet)– Very meaningful if unit of work is long
(parallel job)
• Problem of new/old system– Immature workload– Leftover workload
Heavy Tails
Tail Types
When a distribution has mean m, what is the distribution of samples that are larger than x?
• Light: expected to be smaller than x+m
• Memoryless: expected to be x+m
• Heavy: expected to be larger than x+m
Formal Definition
Tail decays according to a power law
Test: log-log complementary distribution
20Pr axxXxF a
xaxF log)(log
Consequences
• Large deviations from the mean are realistic
• Mass disparity– small fraction of samples responsible for large
part of total mass– Most samples together account for negligible
part of mass
Crovella, JSSPP 2001
Unix File Sizes Survey, 1993
Unix File Sizes LLCD
Consequences
• Large deviations from the mean are realistic
• Mass disparity– small fraction of samples responsible for large
part of total mass– Most samples together account for negligible
part of mass
• Infinite moments– For mean is undefined– For variance is undefined
1a2a
Crovella, JSSPP 2001
Pareto Distribution
With parameter the density is proportional to
The expectation is then
i.e. it grows with the number of samples
1a2x
xcdxx
cxxE ln1
][2
Pareto Samples
Pareto Samples
Pareto Samples
Effect of Samples from Tail
• In simulation:– A single sample may dominate results– Example: response times of processes
• In analysis:– Average long-term behavior may never happen
in practice
Real Life
• Data samples are necessarily bounded
• The question is how to generalize to the model distribution– Arbitrary truncation– Lognormal or phase-type distributions– Something in between
Solution 1: Truncation
• Postulate an upper bound on the distribution
• Question: where to put the upper bound
• Probably OK for qualitative analysis
• May be problematic for quantitative simulations
Solution 2: Model the Sample
• Approximate the empirical distribution using a mixture of exponentials (e.g. phase-type distributions)
• In particular, exponential decay beyond highest sample
• In some cases, a lognormal distribution provides a good fit
• Good for mathematical analysis
Solution 3: Dynamic
• Place an upper bound on the distribution
• Location of bound depends on total number of samples required
• Example:
Note: does not change during simulation
NFB 211
Self Similarity
The Phenomenon
• The whole has the same structure as certain parts
• Example: fractals
The Phenomenon
• The whole has the same structure as certain parts
• Example: fractals
• In workloads: burstiness at many different time scales
Note: relates to a time series
Job Arrivals to SDSC Paragon
Process Arrivals to SDSC Paragon
Long-Range Correlation
• A burst of activity implies that values in the time series are correlated
• A burst covering a large time frame implies correlation over a long range
• This is contrary to assumptions about the independence of samples
Aggregation
• Replace each subsequence of m consecutive values by their mean
• If self-similar, the new series will have statistical properties that are similar to the original (i.e. bursty)
• If independent, will tend to average out
Poisson Arrivals
Tests
• Essentially based on the burstiness-retaining nature of aggregation
• Rescaled range (R/s) metric: the range (sum) of n samples as a function of n
R/s Metric
Tests
• Essentially based on the burstiness-retaining nature of aggregation
• Rescaled range (R/s) metric: the range (sum) of n samples as a function of n
• Variance-time metric: the variance of an aggregated time series as a function of the aggregation level
Variance Time Metric
Modeling Self Similarity
• Generate workload by an on-off process– During on period, generate work at steady pace– During off period to nothing
• On and off period lengths are heavy tailed
• Multiplex many such sources
• Leads to long-range correlation
Research Areas
Effect of Users
• Workload is generated by users
• Human users do not behave like a random sampling process– Feedback based on system performance– Repetitive working patterns
Feedback
• User population is finite• Users back off when performance is
inadequate
Negative feedbackBetter system stability
• Need to explicitly model this behavior
Locality of Sampling
• Users display different levels of activity at different times
• At any given time, only a small subset of users is active
Active Users
Locality of Sampling
• Users display different levels of activity at different times
• At any given time, only a small subset of users is active
• These users repeatedly do the same thing
• Workload observed by system is not a random sample from long-term distribution
SDSC Paragon Data
SDSC Paragon Data
Growing Variability
SDSC Paragon Data
SDSC Paragon Data
Locality of Sampling
The questions:
• How does this effect the results of performance evaluation?
• Can this be exploited by the system, e.g. by a scheduler?
Hierarchical Workload Models
• Model of user population– Modify load by adding/deleting users
• Model of a single user’s activity– Built-in self similarity using heavy-tailed on/off
times
• Model of application behavior and internal structure– Capture interaction with system attributes
A Small Problem
• We don’t have data for these models
• Especially for user behavior such as feedback– Need interaction with cognitive scientists
• And for distribution of application types and their parameters– Need detailed instrumentation
Final Words…
We like to think that we design systems based on solid foundations…
But beware:
the foundations might be unbased assumptions!
We should have more “science” in computer science:
• Collect data rather than make assumptions
• Run experiments under different conditions
• Make measurements and observations
• Make predictions and verify them
• Share data and programs to promote good
practices and ensure comparability
Computer Systems are Complex
Advice from the Experts
“Science if built of facts as a house if built of stones. But a collection of facts is no more a science than a heap of stones is a house”
-- Henri Poincaré
Advice from the Experts
“Science if built of facts as a house if built of stones. But a collection of facts is no more a science than a heap of stones is a house”
-- Henri Poincaré
“Everything should be made as simple as possible, but not simpler”
-- Albert Einstein
Acknowledgements
• Students: Ahuva Mu’alem, David Talby,
Uri Lublin
• Larry Rudolph / MIT
• Data in Parallel Workloads Archive– Joefon Jann / IBM
– Allen Downey / Welselley
– CTC SP2 log / Steven Hotovy
– SDSC Paragon log / Reagan Moore
– SDSC SP2 log / Victor Hazelwood
– LANL CM-5 log / Curt Canada
– NASA iPSC/860 log / Bill Nitzberg