simulation remzi arpaci-dusseau, peter druschel, vivek pai, karsten schwan, david clark, john...

Simulation

Remzi Arpaci-Dusseau, Peter Druschel, Vivek Pai, Karsten Schwan, David Clark, John Jannotti, Liuba Shrira, Mike Dahlin,

Miguel Castro, Barbara Liskov, Jeff Mogul

What was the session about?

• How do you know whether your system design or implementation meets your goals?– At scales larger than you can actually test– Over longer time frames– With loads/faults/changes that aren’t normally seen

• “Simulation” as a way around this– Stretch reality beyond what you can test directly

(e.g., on a testbed)

Problems with simulation

• We don’t trust our simulators• Can we get “real scale” rather than

oversimplified simulation of scale?• Focus has been on mathematical properties

of workload rather than – errors– Unanticipated uses (attacks, semi-attacks)– Secondary behaviors

Two main issues

1. Engines to run simulation

2. Workloads/faultloads/changeloads/ topologies

Engines to run simulation

• Scale issues• Expressibility (e.g., delays, misbehavior,

heterogeneity)• Performance• Repeatability/controllability• Plugability of components• Not a good match for clusters

– Need to fix this, because big CPUs are history

Workloads/faultloads/changeloads/ topologies

• What range of things do we have to cover?• How do we find out what happens in real life?• Anticipating things that haven’t happened before• Simulating security threats (unknown worms,

botnets, etc.)• How do you manage a system that has failed?• Metaphor: preparing the surfaces before painting

the house

Approaches for engines

• Use SETI-at-home approach– To get scale and some exposure to errors– PlanetLab is too small, too well-connected– “honeypots-at-home”?

• Ask for access to Windows Update

• Fault injection

• Trace-driven or model-based

Simulation tools require community consensus

• Otherwise reviewers don’t trust results

• Provides shared “shorthand” for what a published simulation result means

• Need some sort of consensus-building process– Requires lots of effort, testing, bug-fixing– Tends to draw community into one standard

*loads

• Need “repeatable reality”

• Need some diversity

• Need enough detail– E.g., link bandwidths, error rates

• Need well-documented assumptions– And a way to describe the range of these

So we can almost do “networks”; can we do “distributed systems”?

• What do you need beyond network details?– Content-sensitive behavior– Fault models– How does user behavior change?

• Especially in response to system behavior changes

– I/O, memory, CPU, other resource constraints– Changes in configuration

Fault injection

• Problem: disk vendors don’t tell you what they know– Bigger users share anecdotes, not data– Look at what they do, infer what problem is being solved– “disk-fault-at-home”– Microsoft has Watson data – would they share?– Linux ought to be gathering similar data!

• Also need “behavior injection” for pre-fault sequences– Need more methodology

• Crazy subsystems after unexpected requests• Better-than-random fault injection• How do we model (collect data on) correlated faults?

– Does scale help or hinder independence?

Things to spend money on

• Obtain topologies/faultloads/changeloads– Increase realism (more detailed)– Maintain/evolve these as the world changes

• Pay for the maintenance costs of a community-consensus simulator– Or more than one, for different purposes

• Enough resources for repeatable results– Don’t ignore storage and storage bandwidth

Things that cannot be solved with “more money” alone

• Scale beyond what we can plausibly afford or manage

• Time scales

• Dynamic behaviors

• Access to real-world fault/behavior data

simulation remzi arpaci-dusseau, peter druschel, vivek pai, karsten schwan, david clark, john...

Documents