active sampling for accelerated learning of performance models

Active Sampling for Accelerated Learning of

Performance Models

Piyush Shivam, Shivnath Babu, Jeff Chase

Duke University

C3

C1

C2

Site A

Site B

Site C

Task scheduler

Task workflow

A network of clusters or grid sites.

Each site is a pool of heterogeneous resources (e.g., CPU, memory, storage, network)

Managed as a shared utility.

Jobs are task/data workflows.

Challenge: choose the ‘best’ resource mapping/schedule for the job mix.

Instance of “utility resource planning”.

Solution under construction: NIMO

Networked Computing Utility

Subproblem: Predict Job Completion Time

AttributesSamples

CPU speed

Memory size

Network latency

Disk spindles Execution time

s1 2.4 GHz

2 GB 1 ms 10 2 hours

. . . . . .

. . . . . .

Premises (Limitations)• Important batch applications are run repeatedly.

– Most resources are consumed by applications we have seen in the past.

• Behavior is predictable across data sets.– …given some attributes associated with the data set.– Stable behavior per unit of data processed (D)– D is predictable from data set attributes.

• Behavior depends only on resource attributes.– CPU type and clock, seek time, spindle count.

• Utility controls the resources assigned to each job.– Virtualization enables precise control.

• Your mileage may vary.

NIMONonInvasive Modeling for

Optimization

• NIMO learns end-to-end performance models– Models predict performance as a function of, (a)

application profile, (b) data set profile, and (c) resource profile of candidate resource assignment

• NIMO is active– NIMO collects training data for learning models by

conducting proactive experiments on a ‘workbench’• NIMO is noninvasive

App/data profiles

(Target) performance

Candidate resource profiles

Model

“What if…”

Applicationprofiler

Training setdatabase

Active learning

C3

C1

C2

Site A

Site B

Site C

SchedulerResourceprofiler

The Big Picture

Jobs, benchmarks

Pervasive instrumentation

Correlate metrics

with job logs

Generic End-to-End Model

compute phases(compute resource busy)

stall phases(compute resource

stalled on I/O)

Od

(storage

occupancy)

On

(network

occupancy)

+ + )(T = D *totaldata

comp.time

Oa

(compute

occupancy)

Os

(stall occupancy)

occupancy: average time consumed per unit of datadirectly observable

Independent variables

Dependent variables

Resource profile ( )

Dataprofile ( )

Statistical Learning

Complexity (e.g., latency hiding, concurrency, arm contention) is captured implicitly in the training data rather than in the structure of the model.

Sampling Challenges

• Full system operating range– Samples must cover space of candidate resource

assignments

• Cost of sample acquisition– Acquiring a sample has a non-negligible cost, e.g.,

time to acquire a sample, or opportunity cost for the application

• Curse of dimensionality– Too many parameters!– E.g., 10 dimensions X 10 values per dimension– 5 minutes for each sample => 951 years for 1%

samples!

Active Learning in NIMO

Passive sampling

Active sampling

Number of training samples

Accuracy of

current model

100%

• Passive sampling might not expose the system operating range

• Active sampling using “design of experiments” collects most relevant training data

• Automatic and quick

How to learn accurate models quickly?

Sample Carefully

Passive sampling

Active sampling with acceleration

Number of training samples

Accuracy ofcurrent model

100%

Active samplingwithout acceleration

Active Sampling Challenges

• How to expose the main factors and interactions in the shortest time?– Which dimensions/attributes to perturb?– What values to choose for the attributes?

• Where to conduct the experiment?– On a separate system (“workbench”) or “live”?

Planning `active’ experiments

1. Choose a predictor function to refine• Focus in on the most significant/relevant

predictors….or…the least accurate• Example: CPU-intensive app needs an

accurate compute time predictor2. Choose attribute (if any) to add to the predictor

• Example: CPU speed3. Choose the values of the attributes 4. Conduct the experiment5. Compute current prediction error; Go to Step 1

Choosing the Next Predictor

• Learn the most significant/relevant predictors first.– Static vs. dynamic ordering– Static: define total order, e.g., a priori or by

pre-estimates of influence (Plackett-Burman).• Cycle through the order: round-robin vs.

improvement threshold– Dynamic: choose the predictor with maximum

current error

Choosing New Attributes

• Include the most significant/relevant attributes– Choose attributes to expose main factors and

interactions• Add an attribute when error reduction from

further training with the current set falls below threshold.

• Choose the attribute with maximum potential improvement in accuracy.– Establish total order using pre-estimate of

relevance using Plackett-Burman.

Choosing New Values• Select a new value sample to train the selected

predictor function with the chosen set of attributes.

• Range of approaches balance coverage vs. interactions

Binary search/bracketPB to identify interactions

La-Ib

a = #levels for valueb = degree of interactions

Experimental Results

• Biomedical applications– BLAST, fMRI, NAMD, CardioWave

• Resources– 5 CPU speeds, 6 Network latencies, 5 Memory

sizes– 5 X 6 X 5 = 150 resource assignments

• Goal: Learn executing time model with least number of training assignments

• Separate test set to evaluate the accuracy of the current model

BLAST Application

• Total time for 150 assignments: 130 hrs

• Active sampling: 5 hrs

• Sample space: 2%

• Incorrect order of predictor refinement

• 12 hrs• 10% sample space

BLAST Application

• Total time for 150 assignments: 130 hrs

• Active sampling: 5 hrs

• Sample space: 2%

• Incorrect order of attribute refinement

• 12 hrs• 10% sample space

Summary/Conclusions

• Current SLT – given the right data, learn the right model

• Use active sampling to acquire the right data• Ongoing experiments demonstrate the

importance/potential of guided active sampling– 2% sample space, >= 90% model accuracy

• Upcoming VLDB paper…

active sampling for accelerated learning of performance models

Documents

data set attributes

data sets

b data set profile

range active sampling

best resource mappingschedule

workbench nimo

average time

learning models