automatic undo for cloud management via ai planning

19
NICTA Copyright 2012 From imagination to impact Automatic Undo for Cloud Management via AI Planning Ingo Weber, Hiroshi Wada, Alan Fekete, Anna Liu and Len Bass Based on presentation at USENIX HotDep ‘12

Upload: hiroshi-wada

Post on 07-Jul-2015

119 views

Category:

Documents


0 download

DESCRIPTION

presented at Berkeley AMPLab Seminar on Oct 17, 2012 by Ingo Weber, Hiroshi Wada, Alan Fekete, Anna Liu, Len Bass

TRANSCRIPT

Page 1: Automatic Undo for Cloud Management via AI Planning

NICTA Copyright 2012 From imagination to impact

Automatic Undo for

Cloud Management via

AI Planning

Ingo Weber, Hiroshi Wada, Alan Fekete,

Anna Liu and Len Bass

Based on presentation at USENIX HotDep ‘12

Page 2: Automatic Undo for Cloud Management via AI Planning

NICTA Copyright 2012 From imagination to impact

NICTA in Brief

• Australia’s National Centre of Excellence in

Information and Communication Technology

• Five Research Labs:

– ATP: Australian Technology Park, Sydney

– NRL: UNSW, Sydney

– CRL: ANU, Canberra

– VRL: Uni. Melbourne

– QRL: Uni. Queensland and QUT

• 700 staff including 270 PhD students

• Budget: ~$90M/yr from Fed/State Gov and

industry

• ~600 research papers/year, ~150 patents total

Page 3: Automatic Undo for Cloud Management via AI Planning

NICTA Copyright 2012 From imagination to impact

our spin-out

3

Page 4: Automatic Undo for Cloud Management via AI Planning

NICTA Copyright 2012 From imagination to impact

Motivation of this work

• Yuruware Bolt: site-replication across regions

– Executes long-running operations - create, delete and

update variety resources via AWS API

• Issues we face

– High cost of writing unit tests

• Preparing a test bed, reset after each test, and error recovery

• “Why there is no DBUnit for cloud!”

– High cost of error handling

• Resources often get stuck in an unexpected state

• Well-coordinated and tailored clean up for each case

4

Page 5: Automatic Undo for Cloud Management via AI Planning

NICTA Copyright 2012 From imagination to impact

Our Goal

• Provide an “undo” button to cloud users

– Allow for rolling back to a previous state

– e.g., undelete deleted resources and reconstruct the

relations among resources

• We’re users - cannot alter API or implementation

– Some operations are undoable, e.g., detach a VIP

– Some operations are semi-undoable, e.g., stop a VM

– Some operations are not undoable, e.g., delete a disk

• Less invasive

– Minimum changes in existing code or scripts

5

Page 6: Automatic Undo for Cloud Management via AI Planning

NICTA Copyright 2012 From imagination to impact

Status quo

6

dodoOperations(API calls)

Administrator/script

Cloud resources

API calls

• Administrators and/or scripts talk to cloud via API

Page 7: Automatic Undo for Cloud Management via AI Planning

NICTA Copyright 2012 From imagination to impact

Goal

7

Checkpoint

Rollback

dodoOperations(API calls)

Administrator/scriptCommit

Cloud resources

API calls

• Provide the ability to go back to a checkpoint

• No change in scripts except checkpoint and commit

Page 8: Automatic Undo for Cloud Management via AI Planning

NICTA Copyright 2012 From imagination to impact

Our approach

• “undo one by one in reverse order“ does not

always work

– No undo action is available

• e.g., no “undeleteing a deleted resouce “

– Undo requires a different sequence of API calls

• e.g., “deleteing an autoscaling group“ does not work

– Simple undo could results in a different state

• e.g., Undo “stopping an instance with a VIP“

– Undo operation may fail

• e.g., detaching a volume could fail. Need an alternative.

– Not optimal

• e.g., Bolt‘s operations (creating many temporary resouces)

• Tedious to hand-code for all possible cases! 8

Page 9: Automatic Undo for Cloud Management via AI Planning

NICTA Copyright 2012 From imagination to impact

Pseudo-delete for not-undoable ops

9

Checkpoint

Rollback

dodoOperations(API calls)

Administrator/scriptCommit

API Wrapper

Logical state(e.g., pseudo-delete flags)

Cloud resources

Wrapped API calls

API calls

Apply changes

Execute deletes

• Execute API calls if they are undoable

• Defer the execution of non-undoable calls until commit

Page 10: Automatic Undo for Cloud Management via AI Planning

NICTA Copyright 2012 From imagination to impact

AI planning for the rest of cases

10 Undo System

Checkpoint

Rollback

dodoOperations(API calls)

Sense cloud resources

states

Sense cloud resources

states

AI Planner

Administrator/script

Resource state

(PDDL)

Generate

Input as

goal state

Commit

API Wrapper

Logical state(e.g., pseudo-delete flags)

Cloud resources

Wrapped API calls

API calls

Apply changes

Execute deletes

Resource state

(PDDL)

Input as

initial state

Generate

Domain model (PDDL)

InputGenerate

Compen-sation plan

Compen-sation script

Code generator

Execute rollback

• Changes made by

(semi-)undoable API

calls are compensated

by an AI planner

• AI planner finds ways

to handle errors

potentially occur during

undo as well

Page 11: Automatic Undo for Cloud Management via AI Planning

NICTA Copyright 2012 From imagination to impact

AI Planning 101

• Given the initial state of the world, goal state,

and a set of available actions, find a sequence of

actions that leads from the initial to the goal

• Highly optimized heuristics tame the problem for

practical purposes

11

http://en.wikipedia.org/wiki/Tower_of_Hanoi

Page 12: Automatic Undo for Cloud Management via AI Planning

NICTA Copyright 2012 From imagination to impact

Planning under uncertainty

• In our problem

– Initial state: state at rollback

– Goal state: state at checkpoint

– Actions: AWS APIs

• We use FF [*] with an extension to handle

actions with alternative outcomes

• Finds “maximal“ contingency plans

– e.g., if detachnig a volume fails, stop the attached

instance if possible. If a planner cannot solve, ask

human intervention

12

[*] H OFFMANN , J., and N EBEL , B. “The FF planning system: Fast plan generation

through heuristic search.” Journal of Artificial Intelligence Research, 14 (2001),

Page 13: Automatic Undo for Cloud Management via AI Planning

NICTA Copyright 2012 From imagination to impact

Domain model: example

• Action to delete a disk volume in PDDL (:action Delete-Volume

:parameters (?vol - tVolume)

:precondition

(and

(volumeAvailable ?vol)

(not (unrecoverableFailure ?vol)))

:effect

(oneof

(and

(deleted ?vol)

(not (volumeAvailable ?vol)))

(unrecoverableFailure ?vol)))

13

Page 14: Automatic Undo for Cloud Management via AI Planning

NICTA Copyright 2012 From imagination to impact

Domain model: actions

Resource type Actions

Virtual machine launch, terminate, start, stop, change

VM size

Disk volume create, delete, create-from-snapshot,

attach, detach

Disk snapshot create, delete

Elastic IP address allocate, release, associate, disassociate

Security group create, delete

Auto-scaling group

create, delete, change-sizes, change-

launch-config

Auto-scaling launch config create, delete

Tag create, delete

Others AZ, cluster online/offline, ...

14

Page 15: Automatic Undo for Cloud Management via AI Planning

NICTA Copyright 2012 From imagination to impact

Evaluation

• Scalability of the planner based on an internally

released prototype

– AWS cmd line tool replacement

• Use cases: ~70 different planning settings tested

• Evaluation 1: against plan length

• Evaluation 2: against # of unrelated resources

15

Page 16: Automatic Undo for Cloud Management via AI Planning

NICTA Copyright 2012 From imagination to impact

Evaluation 1: vs plan length

16

0

0.5

1

1.5

2

2.5

0 10 20 30 40 50 60 70

Seco

nd

s

Plan length

20 length is the maximum we need in our problem

Executing a plan with 10 steps takes ~145 sec

Page 17: Automatic Undo for Cloud Management via AI Planning

NICTA Copyright 2012 From imagination to impact

Evaluation 2: vs # of unrelated resources

17

0

50

100

150

200

250

300

350

400

450

500

35 350 3500

Seco

nd

s

Facts + Actions (in 1000s)

Basis: most difficult problem from previous slide

Planner‘s cost is small unless having 1000s of resouces

+ 500

unrelated

instances and

volumes

~50 related

resources

Page 18: Automatic Undo for Cloud Management via AI Planning

NICTA Copyright 2012 From imagination to impact

Conclusion & future work

• Rollback in cloud using AI planner

– Formalized part of AWS APIs in a planning domain

– Planning execution time is marginal for practical

system sizes

• Future work

– Extending checkpoints to capture internal resource

state

– Parallelizing plans

– Finding forward plans with constraints

18

Page 19: Automatic Undo for Cloud Management via AI Planning

NICTA Copyright 2012 From imagination to impact

Questions?

The paper is available at www.nicta.com.au/pub?doc=5994

19