poster chep2012 reduced_original1

1
Analysing DIRAC's Behavior using Model Checking with Process Algebra Motivation Why Formal Methods? Language & Toolset From DIRAC to mCRL2 State-space generation Analysis & Issues Daniela Remenska - Jeff Templon - Tim Willemse - Henri Bal - Kees Verstoep - Wan Fokkink Philippe Charpentier - Ricardo Graciani - Elisa Lanciotti - Krzysztof Daniel Ciba - Stefan Roiser Some drawbacks... Abstraction of the "real" behavior is needed. This means one must build a sound model. Expertise in formal methods and the system domain is necessary. The state-space of the model can explode. Actions: atomic building blocks can carry data parameters Processes: composed of actions, using algebra operators Built-in data types integers, booleans, lists, sets, bags Abstract data types Agents and storage become processes. Control-flow is abstracted using mCRL2 non-deterministic choice and if-then-else constructs. States of entities are described using custom abstract data types. Future Work Automate (to some degree) the translation from code to model. Verification Figure 3: State-space visualisation with LTSView DIRAC background production activities and user analysis for LHCb distributed services and light-weight agents "blackboard" or "shared-memory" paradigm jobs often get into incorrect (or inconsistent) states staging requests become stuck difficult to trace the root of such unexpected behavior many scenarios and components manual intervention necessary Based on process algebra laws no ambiguity Model checking tools full control over the execution of parallel processes. This way one gains more insight into the system behavior. Stronger than testing There are formal or systematic approaches to tackle this! Automatically explore the entire state-space and check if some "interesting" properties hold. DIRAC (Python) ~150000 loc Abstracting the implementation depends on the focus of the analysis. Check for race-conditions Agents update the state of shared entities. Systems: Storage and Workload Mgmt Entities: Jobs, Cache-Replicas, Tasks Figure 1: DIRAC subsystems Figure 2: Job state machine Problems can be discovered while building and debugging the model: Properties (Satefy / Progress / Deadlock) Model-checker automatically probes them. Property violated: counter-example trace is provided. Figure 5: State-transition visualisation with DiaGraphica Conclusions Formal methods are a more rigorous addition to testing, as a way to improve software quality. A sound model needs to be written manually. This requires experience and can be error-prone. Similar techniques can be re-applied to similar systems, once the learning curve has lapsed. Distributed systems are difficult to reason about; many components, all run in parallel. Figure 6: Violation of progress and safety requirements Figure 7: "Zombie" job starts running after being killed Figure 4a: XSim simulator trace of a job workflow Figure 4b: DIRAC logging info of a job workflow

Upload: daniela-remenska

Post on 11-Jul-2015

74 views

Category:

Documents


4 download

TRANSCRIPT

Page 1: Poster chep2012 reduced_original1

Analysing DIRAC's Behavior using Model Checking with Process Algebra

Motivation

Why Formal Methods?

Language & Toolset

From DIRAC to mCRL2

State-space generation

Analysis & Issues

Daniela Remenska - Jeff Templon - Tim Willemse - Henri Bal - Kees Verstoep - Wan FokkinkPhilippe Charpentier - Ricardo Graciani - Elisa Lanciotti - Krzysztof Daniel Ciba - Stefan Roiser

Some drawbacks... Abstraction of the "real" behavior is needed.This means one must build a sound model.

Expertise in formal methods and the systemdomain is necessary.

The state-space of the model can explode.

Actions: atomic building blockscan carry data parameters

Processes: composed of actions, using algebra operators

Built-in data types integers, booleans, lists, sets, bags

Abstract data types

Agents and storage become processes.

Control-flow is abstracted using mCRL2non-deterministic choice and if-then-else constructs.

States of entities are described usingcustom abstract data types.

Future WorkAutomate (to some degree) the translation from code to model.

Verification

Figure 3: State-space visualisation with LTSView

DIRAC backgroundproduction activities and user analysis for LHCb▪

distributed services and light-weight agents▪

"blackboard"or

"shared-memory"paradigm

jobs often get into incorrect (or inconsistent) states

staging requests become stuck

difficult to trace the root of such unexpected behavior many scenarios and components

manual intervention necessary

Based on process algebra lawsno ambiguity

Model checking tools full control over the execution of parallel processes. This way one gains more insight into the system behavior.

Stronger than testing

There are formal or systematic approaches to tackle this!

Automatically explore the entire state-space and check if some "interesting" properties hold.

DIRAC (Python) ~150000 loc

Abstracting the implementation dependson the focus of the analysis.

Check for race-conditions Agents update the state of shared entities.

Systems: Storage and Workload MgmtEntities: Jobs, Cache-Replicas, Tasks

Figure 1: DIRAC subsystems

Figure 2: Job state machine

Problems can be discovered while building and debugging the model:

Properties (Satefy / Progress / Deadlock)Model-checker automatically probes them.

Property violated: counter-example traceis provided.

Figure 5: State-transition visualisation with DiaGraphica

Conclusions

Formal methods are a more rigorous addition to testing, as a way to improve software quality.

A sound model needs to be written manually. This requires experienceand can be error-prone.

Similar techniques can be re-appliedto similar systems, once the learningcurve has lapsed.

Distributed systems are difficult to reason about; many components,all run in parallel.

Figure 6: Violation of progress and safety requirements

Figure 7: "Zombie" job starts running after being killed

Figure 4a: XSim simulator trace of a job workflow Figure 4b: DIRAC logging info of a job workflow