simulation remzi arpaci-dusseau, peter druschel, vivek pai, karsten schwan, david clark, john...

13
Simulation Remzi Arpaci-Dusseau, Peter Druschel, Vivek Pai, Karsten Schwan, David Clark, John Jannotti, Liuba Shrira, Mike Dahlin, Miguel Castro, Barbara Liskov, Jeff Mogul

Upload: godfrey-atkinson

Post on 19-Jan-2016

215 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Simulation Remzi Arpaci-Dusseau, Peter Druschel, Vivek Pai, Karsten Schwan, David Clark, John Jannotti, Liuba Shrira, Mike Dahlin, Miguel Castro, Barbara

Simulation

Remzi Arpaci-Dusseau, Peter Druschel, Vivek Pai, Karsten Schwan, David Clark, John Jannotti, Liuba Shrira, Mike Dahlin,

Miguel Castro, Barbara Liskov, Jeff Mogul

Page 2: Simulation Remzi Arpaci-Dusseau, Peter Druschel, Vivek Pai, Karsten Schwan, David Clark, John Jannotti, Liuba Shrira, Mike Dahlin, Miguel Castro, Barbara

What was the session about?

• How do you know whether your system design or implementation meets your goals?– At scales larger than you can actually test– Over longer time frames– With loads/faults/changes that aren’t normally seen

• “Simulation” as a way around this– Stretch reality beyond what you can test directly

(e.g., on a testbed)

Page 3: Simulation Remzi Arpaci-Dusseau, Peter Druschel, Vivek Pai, Karsten Schwan, David Clark, John Jannotti, Liuba Shrira, Mike Dahlin, Miguel Castro, Barbara

Problems with simulation

• We don’t trust our simulators• Can we get “real scale” rather than

oversimplified simulation of scale?• Focus has been on mathematical properties

of workload rather than – errors– Unanticipated uses (attacks, semi-attacks)– Secondary behaviors

Page 4: Simulation Remzi Arpaci-Dusseau, Peter Druschel, Vivek Pai, Karsten Schwan, David Clark, John Jannotti, Liuba Shrira, Mike Dahlin, Miguel Castro, Barbara

Two main issues

1. Engines to run simulation

2. Workloads/faultloads/changeloads/ topologies

Page 5: Simulation Remzi Arpaci-Dusseau, Peter Druschel, Vivek Pai, Karsten Schwan, David Clark, John Jannotti, Liuba Shrira, Mike Dahlin, Miguel Castro, Barbara

Engines to run simulation

• Scale issues• Expressibility (e.g., delays, misbehavior,

heterogeneity)• Performance• Repeatability/controllability• Plugability of components• Not a good match for clusters

– Need to fix this, because big CPUs are history

Page 6: Simulation Remzi Arpaci-Dusseau, Peter Druschel, Vivek Pai, Karsten Schwan, David Clark, John Jannotti, Liuba Shrira, Mike Dahlin, Miguel Castro, Barbara

Workloads/faultloads/changeloads/ topologies

• What range of things do we have to cover?• How do we find out what happens in real life?• Anticipating things that haven’t happened before• Simulating security threats (unknown worms,

botnets, etc.)• How do you manage a system that has failed?• Metaphor: preparing the surfaces before painting

the house

Page 7: Simulation Remzi Arpaci-Dusseau, Peter Druschel, Vivek Pai, Karsten Schwan, David Clark, John Jannotti, Liuba Shrira, Mike Dahlin, Miguel Castro, Barbara

Approaches for engines

• Use SETI-at-home approach– To get scale and some exposure to errors– PlanetLab is too small, too well-connected– “honeypots-at-home”?

• Ask for access to Windows Update

• Fault injection

• Trace-driven or model-based

Page 8: Simulation Remzi Arpaci-Dusseau, Peter Druschel, Vivek Pai, Karsten Schwan, David Clark, John Jannotti, Liuba Shrira, Mike Dahlin, Miguel Castro, Barbara

Simulation tools require community consensus

• Otherwise reviewers don’t trust results

• Provides shared “shorthand” for what a published simulation result means

• Need some sort of consensus-building process– Requires lots of effort, testing, bug-fixing– Tends to draw community into one standard

Page 9: Simulation Remzi Arpaci-Dusseau, Peter Druschel, Vivek Pai, Karsten Schwan, David Clark, John Jannotti, Liuba Shrira, Mike Dahlin, Miguel Castro, Barbara

*loads

• Need “repeatable reality”

• Need some diversity

• Need enough detail– E.g., link bandwidths, error rates

• Need well-documented assumptions– And a way to describe the range of these

Page 10: Simulation Remzi Arpaci-Dusseau, Peter Druschel, Vivek Pai, Karsten Schwan, David Clark, John Jannotti, Liuba Shrira, Mike Dahlin, Miguel Castro, Barbara

So we can almost do “networks”; can we do “distributed systems”?

• What do you need beyond network details?– Content-sensitive behavior– Fault models– How does user behavior change?

• Especially in response to system behavior changes

– I/O, memory, CPU, other resource constraints– Changes in configuration

Page 11: Simulation Remzi Arpaci-Dusseau, Peter Druschel, Vivek Pai, Karsten Schwan, David Clark, John Jannotti, Liuba Shrira, Mike Dahlin, Miguel Castro, Barbara

Fault injection

• Problem: disk vendors don’t tell you what they know– Bigger users share anecdotes, not data– Look at what they do, infer what problem is being solved– “disk-fault-at-home”– Microsoft has Watson data – would they share?– Linux ought to be gathering similar data!

• Also need “behavior injection” for pre-fault sequences– Need more methodology

• Crazy subsystems after unexpected requests• Better-than-random fault injection• How do we model (collect data on) correlated faults?

– Does scale help or hinder independence?

Page 12: Simulation Remzi Arpaci-Dusseau, Peter Druschel, Vivek Pai, Karsten Schwan, David Clark, John Jannotti, Liuba Shrira, Mike Dahlin, Miguel Castro, Barbara

Things to spend money on

• Obtain topologies/faultloads/changeloads– Increase realism (more detailed)– Maintain/evolve these as the world changes

• Pay for the maintenance costs of a community-consensus simulator– Or more than one, for different purposes

• Enough resources for repeatable results– Don’t ignore storage and storage bandwidth

Page 13: Simulation Remzi Arpaci-Dusseau, Peter Druschel, Vivek Pai, Karsten Schwan, David Clark, John Jannotti, Liuba Shrira, Mike Dahlin, Miguel Castro, Barbara

Things that cannot be solved with “more money” alone

• Scale beyond what we can plausibly afford or manage

• Time scales

• Dynamic behaviors

• Access to real-world fault/behavior data