automation of deep learning training with aws step functions

Post on 14-Apr-2017

119 Views

Category:

Technology

1 Downloads

Preview:

Click to see full reader

TRANSCRIPT

AUTOMATED DEEP LEARNING TRAININGWITH

AWS STEP FUNCTIONS / AWS LAMBDA

@mizti

PROBLEM WITH DEEP LEARNING TRAING

1.

SERVER WITH GPUREQUIRED

SERVERS WITH GNU AREEXPENSIVE• Some thousands dollars / month with on demand

instance• Spot instance with bidding system: much low priced,

but not ignorable price for me

NOT IGNORABLE PRICE ? • It costs equal to 1 or 2 “Tirol choco” for each

server / hour

• Not much, but I worry about…

* WELL-KNOWN IN JAPAN, THE PRONOUN OF CHEAP CONFECTION

AND IT TAKES VERY LONG TIME

Half day, One day,Occasionally some days

I WANT TO TERMINATE SERVERS

ONCE TRAINING COMPLETED

SO

PROBLEM WITHDEEP LEARNING TRAINING

2.

ANNOYING COLLECTION OFTRAINED DATA

WITH ONE SERVER,IT TAKES ONLY FEW MINUTES

WITH SCP

WITH MANY SERVERS,IT TAKES LONG TIME

WHAT IS WORSE, WE DON’T KNOW WHENEACH TASK COMPETE

IN EACH SERVER

AND I GET CONFUSED“WHAT WAS THE SETTING FOR THIS

SERVER?”

AT LAST, I TERMINATE SERVER

WITHOUT EXTRACTING DATA

I WANT TO GATHER DATA INTOONE PLACE AUTOMATICALLY

SO

AND WANT TO LABEL TRAINING CONDITIONS…

SERVER-LESS ARCHITECTURE

• Serverless computing (with my understanding) is

• Generate servers when I need, Terminate servers once task completed

• Does not use any server to control above.

• Thus, I don’t need have any server usually, and can generate any numbers of server when / as many as I need.

• ( becoming buzz-word these days ?)

SERVER-LESS SERVICES IN AWS • AWS Lambda• Users can register code with Node.js /

Python / Java / C#• Registered codes can be hooked with events from inside of AWS (and can be kicked by hand, of cause)

• Users can automate AWS control with AWS SDK for each languages ( like boto3 for Python )

• No special libraries for AWS Lambda,IOW: AWS Lambda is just a register / starting mechanism of codes

• One Lambda function can be alive only 60 seconds at most, so AWS Lambda is not suitable forlong-time / many-state jobs.

SERVER-LESS SERVICES IN AWS • AWS Step Functions• Users can define multi-state machine like “cell automaton”

• Fork / Parallel processes are also can be defined

• Each state inputs / receives data into / from AWS Lambda functions.

• You can check status of states (process) with Web UI visually.

• Users can control long-time / multi-path process

WHAT I WANTED TO MAKE:

1. Create S3 bucket for each execution2. Bid a spot instance3. If the bidding suceeds, and a spot instance is generated,

• Notify with AWS SNS (Email or SMS)• Prepare to training ( Downloading training etc.)• Start training• Periodically upload model dump / output data / logs into S3 bucket

4. Once training completed• Notify with AWS SNS (Email or SMS)• Terminate instance after a certain period of times

I MADE:Create S3 bucket

Request Spot Instance

Check if the bidding succeededNotify bidding success

Check if the task completed

Wait for the task completed

Notify task completed

Terminate Spot instance

USAGE• Input a set of json like below to start Step Function

• exec_name: name of this execution (also become a name of S3 bucket)• repository url: git repository of code to exec ( used like git clone {repository url} )• data_dir / output_dir: directory of training data and output data• data_get_command: command executed before training. (typically, getting

training data for machine learning)• exec_command: executed command for training.

USAGE

Input a json, and ..

USAGE

Just push "Start Execution"

USAGE

・ Progress can be checked on Web UI・ Output result is automatically carried into S3 bucket.

BENEFIT• Start and Forget. Sleep peacefully.

• Make it easy to parallel execution with many patterns of hyper-params

• No need of modifying training / model codes

• Maybe used also for many kinds ofbatch-like process

MISC• Author: @mizti

any comments / questions welcomed

• Details: wrote in my blog (but in Japanese lang ; )http://mizti.hatenablog.com/entry/deeplearningwithawsstepfunction

• Code repository:https://github.com/mizti/aws_stepfunc_chainer

• Illustration in this slides:http://www.irasutoya.com/

top related