batchjobs and batchexperiments

30

Upload: others

Post on 08-May-2022

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: BatchJobs and BatchExperiments

BatchJobs and BatchExperiments

Abstraction mechanisms for using R in batch environments

M. Lang B. Bischl O. Mersmann J. Rahnenführer C. Weihs

Faculty of Statistics � TU Dortmund

1 / 30

Page 2: BatchJobs and BatchExperiments

Links and references

I http://batchjobs.googlecode.comI Installation infosI R documentationI Wiki + FAQI Mailing listI Recent development version in SVN

I Our tech report:Bischl, Lang, Mersmann, Rahnenführer, Weihs:�Computing on high performance clusters with R: Packages BatchJobs

and BatchExperiments�

Available on project page

I Article submitted to JSS

I Material for this tutorial:http://www.statistik.tu-dortmund.de/~bischl/reisensburg2012

2 / 30

Page 3: BatchJobs and BatchExperiments

Available computational resources for statisticians in Dortmund

Faculty

I 2 general �compute� machines (6 cores each for users)

I 5 student pool machines (6 cores each)

I Individual servers of di�erent chairs

I All run Linux and are SSH-accessible

I Some people use multicore or snowfall here, others start R processesmanually to parallelize their code...

LIDO

I Linux HPC cluster of the university

I Runs MAUI / TORQUE

I More than 3000 cores

I Have used it successfully for several years nowMany studies added up to years of total computation time.

D-Grid

I Have access in principle but we have not tried it yet.3 / 30

Page 4: BatchJobs and BatchExperiments

Batch Systems

OXYGEN

Gateway

#1 Task cyan with.

#2 Task red and.

#3 Task blue that.

#4 Task black, for.

#5 Task that.

Scheduler

Node

Node

Node

www.oxygen-icons.org

LIDO overviewhttp://lidong.itmc.tu-dortmund.de/ldw/index.php/System_overview

http://lidong.itmc.tu-dortmund.de/ldw/status.html

4 / 30

Page 5: BatchJobs and BatchExperiments

Batch Systems

+ They are pretty fast!

+ Many statistical tasks are embarrassingly parallel

- Job description �les needed

- Requesting many nodes at once increases time spend in queue

- Auxiliary scripts to create �les and submit jobs necessary

- Functions to collect results can get complicated and lengthy

- If some jobs fail (e. g., singularities), debugging is awful

5 / 30

Page 6: BatchJobs and BatchExperiments

Usual work�ow on a batch system

I Unroll your R loop(s) so that your script computes a single iteration

I Write a script that writes R scripts for each iteration setting theiteration counter(s) at the beginning

I Write a script that writes job description �les for each R script

I Write a script that submits your job description �les

I Crawl through �le system checking for existence of results or log �les

I Write a script that combines your scattered result �les

I Found a bug in your code? Write a script that kills all running jobs,�x the bug, submit everything again

I Some jobs have hit the wall time? Write a script that �nds outwhich jobs you need to resubmit with weaker constraints

I Want to try your model on another data set or using otherparameters? Eventually start from scratch, it might get ugly

6 / 30

Page 7: BatchJobs and BatchExperiments

Demo: Batch computing with R and PBS

7 / 30

Page 8: BatchJobs and BatchExperiments

BatchJobs

BatchJobs

I Basic infrastructure to communicate with batch systems from within R

I De�ne, submit, supervise and kill batch jobs

I Implementation of higher order functionsMap, Reduce and Filter

to de�ne jobs and collect results

I Con�gurable integrated status mailer

I Supported batch systems: Torque/PBS, LSF, SGE

I Other modes: Interactive, Multicore, SSH

8 / 30

Page 9: BatchJobs and BatchExperiments

BatchJobs' functions Common functions BatchExperiments' functions

(1) Create Registry makeRegistry makeExperimentRegistry

(2) De�ne Jobs

batchMap

batchReduce

batchExpandGrid

batchMapResults

batchReduceResults

addProblem

addAlgorithm

makeDesign

addExperiments

(3) Subset Jobs findJobs

findDone

findErrors

...

findExperiments

(4) Submit Jobs submitJobs

(5) Status & Debugging

showStatus

testJob

showLog

summarizeExperiments

(6) Collect Results

loadResult[s]

reduceResults

filterResults

reduceResults[AggrType]

reduceResultsExperiments

9 / 30

Page 10: BatchJobs and BatchExperiments

BatchJobs' functions Common functions BatchExperiments' functions

(1) Create Registry makeRegistry makeExperimentRegistry

(2) De�ne Jobs

batchMap

batchReduce

batchExpandGrid

batchMapResults

batchReduceResults

addProblem

addAlgorithm

makeDesign

addExperiments

(3) Subset Jobs findJobs

findDone

findErrors

...

findExperiments

(4) Submit Jobs submitJobs

(5) Status & Debugging

showStatus

testJob

showLog

summarizeExperiments

(6) Collect Results

loadResult[s]

reduceResults

filterResults

reduceResults[AggrType]

reduceResultsExperiments

10 / 30

Page 11: BatchJobs and BatchExperiments

(1) Create a registry

Registry

I Object used to access and exchange informations: �le paths, jobparameters, computational events, . . .

I Manages SQLite and �le system backend

I All information is stored in a single, portable directory

I Initialization of a new registry:

l i b r a r y ( BatchJobs )reg <− makeRegistry (

id = " tryout " , # name o f your studyf i l e . d i r = "~/myProject " , # a c c e s s i b l e on a l l nodesseed = 1) # i n i t i a l seed f o r f i r s t job

I loadRegistry(dir) to resume working with an existing registry

11 / 30

Page 12: BatchJobs and BatchExperiments

BatchJobs' functions Common functions BatchExperiments' functions

(1) Create Registry makeRegistry makeExperimentRegistry

(2) De�ne Jobs

batchMap

batchReduce

batchExpandGrid

batchMapResults

batchReduceResults

addProblem

addAlgorithm

makeDesign

addExperiments

(3) Subset Jobs findJobs

findDone

findErrors

...

findExperiments

(4) Submit Jobs submitJobs

(5) Status & Debugging

showStatus

testJob

showLog

summarizeExperiments

(6) Collect Results

loadResult[s]

reduceResults

filterResults

reduceResults[AggrType]

reduceResultsExperiments

12 / 30

Page 13: BatchJobs and BatchExperiments

(2) De�ne Jobs

batchMap

I Like lapply or mapply

I (x1, x2)× (y1, y2)→ (f(x1, y1), f(x2, y2))

I 10 Jobs to calculate 1 + 9, 2 + 8, . . . , 9 + 1

map <− f unc t i on ( i , j ) i+jbatchMap ( reg , map , i =1:9 , k=9:1)

I Stores function on �le system

I Inserts job de�nitions into the database

I Parameters serialized into the database for fast access

I All jobs get unique positive integers as IDs

13 / 30

Page 14: BatchJobs and BatchExperiments

BatchJobs' functions Common functions BatchExperiments' functions

(1) Create Registry makeRegistry makeExperimentRegistry

(2) De�ne Jobs

batchMap

batchReduce

batchExpandGrid

batchMapResults

batchReduceResults

addProblem

addAlgorithm

makeDesign

addExperiments

(3) Subset Jobs findJobs

findDone

findErrors

...

findExperiments

(4) Submit Jobs submitJobs

(5) Status & Debugging

showStatus

testJob

showLog

summarizeExperiments

(6) Collect Results

loadResult[s]

reduceResults

filterResults

reduceResults[AggrType]

reduceResultsExperiments

14 / 30

Page 15: BatchJobs and BatchExperiments

(3) Subset Jobs

I Query job IDs by computational status: find* functionsfindSubmitted, findRunning, findDone, . . .

I Query job IDs by parameters: findJobs(reg, pars)

i d s <− f i ndJobs ( reg , pars=( i ==1))

I Set operations on ID vectors: intersect, setdiff, union

I Vector of IDs can be passed to all functions interacting with the batchsystem

15 / 30

Page 16: BatchJobs and BatchExperiments

BatchJobs' functions Common functions BatchExperiments' functions

(1) Create Registry makeRegistry makeExperimentRegistry

(2) De�ne Jobs

batchMap

batchReduce

batchExpandGrid

batchMapResults

batchReduceResults

addProblem

addAlgorithm

makeDesign

addExperiments

(3) Subset Jobs findJobs

findDone

findErrors

...

findExperiments

(4) Submit Jobs submitJobs

(5) Status & Debugging

showStatus

testJob

showLog

summarizeExperiments

(6) Collect Results

loadResult[s]

reduceResults

filterResults

reduceResults[AggrType]

reduceResultsExperiments

16 / 30

Page 17: BatchJobs and BatchExperiments

(4) Submit Jobs

submitJobs

I Creates R script �les and job description �les on the �y

I Resources can be provided as named list

# 1 hour maximal execut ion time , about 2 GB of RAMre s = l i s t ( wal l t ime=60∗ 60 , memory=2000)

# . . . and submitsubmitJobs ( reg , r e s ou r c e s=r e s )

I Submits all jobs not yet submitted per default

I Jobs can be grouped into chunks by providing a list of ID vectors.Each chunk gets executed sequentially as single batch job

17 / 30

Page 18: BatchJobs and BatchExperiments

BatchJobs' functions Common functions BatchExperiments' functions

(1) Create Registry makeRegistry makeExperimentRegistry

(2) De�ne Jobs

batchMap

batchReduce

batchExpandGrid

batchMapResults

batchReduceResults

addProblem

addAlgorithm

makeDesign

addExperiments

(3) Subset Jobs findJobs

findDone

findErrors

...

findExperiments

(4) Submit Jobs submitJobs

(5) Status & Debugging

showStatus

testJob

showLog

summarizeExperiments

(6) Collect Results

loadResult[s]

reduceResults

filterResults

reduceResults[AggrType]

reduceResultsExperiments

18 / 30

Page 19: BatchJobs and BatchExperiments

(5) Supervise and debug

I Quick overview of what is going on: showStatus(reg)

Status f o r j obs : 10Submitted : 10 (100.0%)Started : 10 (100.0%)Errors : 0 ( 0.0%)Running : 2 ( 20.0%)Expired : 0 ( 0.0%)Done : 8 ( 80.0%)Time : min=1.50 s avg=5.20 s max=8.80 s

I Display log �les with a customizable pager (less, vi, ...):showLog(reg, findErrors(reg)[1])

I Found a bug? killJobs(reg, findRunning(reg))

I Run a job in the current R session: testJob(reg, id)

19 / 30

Page 20: BatchJobs and BatchExperiments

BatchJobs' functions Common functions BatchExperiments' functions

(1) Create Registry makeRegistry makeExperimentRegistry

(2) De�ne Jobs

batchMap

batchReduce

batchExpandGrid

batchMapResults

batchReduceResults

addProblem

addAlgorithm

makeDesign

addExperiments

(3) Subset Jobs findJobs

findDone

findErrors

...

findExperiments

(4) Submit Jobs submitJobs

(5) Status & Debugging

showStatus

testJob

showLog

summarizeExperiments

(6) Collect Results

loadResult[s]

reduceResults

filterResults

reduceResults[AggrType]

reduceResultsExperiments

20 / 30

Page 21: BatchJobs and BatchExperiments

(6) Collect results

Reduce

# combine in numeric vec to rreduceResu l t s ( reg , i d s = findDone ( reg ) , i n i t = numeric ( 0 ) ,

fun = func t i on ( aggr , job , r e s ) c ( aggr , r e s ) )

I Convenience wrappers around reduceResults:reduceResults[Vector|Matrix|DataFrame|List]

I batchReduceResults to reduce results on the nodes

Iterate over results

i d s <− f indDone ( reg )l app ly ( ids , f unc t i on ( id ) { r e s <− l oadResu l t ( reg , id ) ; . . . })

21 / 30

Page 22: BatchJobs and BatchExperiments

Demo: BatchJobs

I Con�guration

I Brew scripts for job �les

I Some toy examples

22 / 30

Page 23: BatchJobs and BatchExperiments

BatchExperiments

I Builds on top of BatchJobs

I Intended as abstraction for typical statistical tasks:

Applying algorithms on problems

I More aimed at the end user

I Convenient for simulation studies, comparison and benchmarkexperiments, sensitivity analysis, . . .

I Work�ow di�ers only in job de�nition

23 / 30

Page 24: BatchJobs and BatchExperiments

BatchJobs' functions Common functions BatchExperiments' functions

(1) Create Registry makeRegistry makeExperimentRegistry

(2) De�ne Jobs

batchMap

batchReduce

batchExpandGrid

batchMapResults

batchReduceResults

addProblem

addAlgorithm

makeDesign

addExperiments

(3) Subset Jobs findJobs

findDone

findErrors

...

findExperiments

(4) Submit Jobs submitJobs

(5) Status & Debugging

showStatus

testJob

showLog

summarizeExperiments

(6) Collect Results

loadResult[s]

reduceResults

filterResults

reduceResults[AggrType]

reduceResultsExperiments

24 / 30

Page 25: BatchJobs and BatchExperiments

Job de�nition steps in BatchExperiments

1. Add problems to registry: addProblemI E�cient storage: Separation of static and dynamic problem partsI Can be connected with an experimental design

2. Add algorithms to registry: addAlgorithmI Problem instance gets passed to algorithmI Can be connected with an experimental designI Return value will be saved on the �le system

3. Connect experimental designs: makeDesign

4. Add experiments to registry: addExperimentsI Experiment: problem instance + algorithm + algorithm parametersI Job: Experiment + replication number

25 / 30

Page 26: BatchJobs and BatchExperiments

Function interplay

result

static problem part

static in addProblem()

dynamic problem function

dynamic in addProblem()

algorithm function

algorithm in addAlgorithm()

problem iterator algorithm iterator

problem design

design/exhaustive in makeDesign()

algorithm design

design/exhaustive in makeDesign()

problem

instance

problem parameters algorithm parameters

.

26 / 30

Page 27: BatchJobs and BatchExperiments

What you get

I Reproducibility: Every computation is seeded, seeds are stored in adatabase

I Portability: Data, algorithms, results and job information reside in asingle directory

I Extensibility: Add more problems or algorithms, try di�erentparameters or increase the replication numbers at any computationalstate

I Exchangeability: Share your �le directory to allow others to extendyour study with their data sets and algorithms

27 / 30

Page 28: BatchJobs and BatchExperiments

Demo: BatchExperiments

I Toy classi�cation example

I Comparing some models on the iris data

I Bootstrapping

28 / 30

Page 29: BatchJobs and BatchExperiments

Advanced Stu�

I Transforming results in parallel on the nodes

I Problem synchronization: all algorithms see the same probleminstances (per replication)

I Using MPI jobs as batch jobs

I Chunking small jobs to avoid �ooding of the scheduler

I Optional splitting of result �les for fast access of partial results

I Integrating your own batch system

29 / 30

Page 30: BatchJobs and BatchExperiments

Conclusions

I Greatly simpli�es the work with batch systems

I Interactively control batch systems from within R (no shell required)

I Do reproducible research

I Exchange code and results with others

Outlook

I Support for more batch systems: Slurm, Condor, Amazon EC2, . . .

I Support for systems with unshared �le systems

I Enhance caching of queries

I Support for dependent jobs

http://batchjobs.googlecode.com

30 / 30