Download - BatchJobs and BatchExperiments
BatchJobs and BatchExperiments
Abstraction mechanisms for using R in batch environments
M. Lang B. Bischl O. Mersmann J. Rahnenführer C. Weihs
Faculty of Statistics � TU Dortmund
1 / 30
Links and references
I http://batchjobs.googlecode.comI Installation infosI R documentationI Wiki + FAQI Mailing listI Recent development version in SVN
I Our tech report:Bischl, Lang, Mersmann, Rahnenführer, Weihs:�Computing on high performance clusters with R: Packages BatchJobs
and BatchExperiments�
Available on project page
I Article submitted to JSS
I Material for this tutorial:http://www.statistik.tu-dortmund.de/~bischl/reisensburg2012
2 / 30
Available computational resources for statisticians in Dortmund
Faculty
I 2 general �compute� machines (6 cores each for users)
I 5 student pool machines (6 cores each)
I Individual servers of di�erent chairs
I All run Linux and are SSH-accessible
I Some people use multicore or snowfall here, others start R processesmanually to parallelize their code...
LIDO
I Linux HPC cluster of the university
I Runs MAUI / TORQUE
I More than 3000 cores
I Have used it successfully for several years nowMany studies added up to years of total computation time.
D-Grid
I Have access in principle but we have not tried it yet.3 / 30
Batch Systems
OXYGEN
Gateway
#1 Task cyan with.
#2 Task red and.
#3 Task blue that.
#4 Task black, for.
#5 Task that.
Scheduler
Node
Node
Node
www.oxygen-icons.org
LIDO overviewhttp://lidong.itmc.tu-dortmund.de/ldw/index.php/System_overview
http://lidong.itmc.tu-dortmund.de/ldw/status.html
4 / 30
Batch Systems
+ They are pretty fast!
+ Many statistical tasks are embarrassingly parallel
- Job description �les needed
- Requesting many nodes at once increases time spend in queue
- Auxiliary scripts to create �les and submit jobs necessary
- Functions to collect results can get complicated and lengthy
- If some jobs fail (e. g., singularities), debugging is awful
5 / 30
Usual work�ow on a batch system
I Unroll your R loop(s) so that your script computes a single iteration
I Write a script that writes R scripts for each iteration setting theiteration counter(s) at the beginning
I Write a script that writes job description �les for each R script
I Write a script that submits your job description �les
I Crawl through �le system checking for existence of results or log �les
I Write a script that combines your scattered result �les
I Found a bug in your code? Write a script that kills all running jobs,�x the bug, submit everything again
I Some jobs have hit the wall time? Write a script that �nds outwhich jobs you need to resubmit with weaker constraints
I Want to try your model on another data set or using otherparameters? Eventually start from scratch, it might get ugly
6 / 30
Demo: Batch computing with R and PBS
7 / 30
BatchJobs
BatchJobs
I Basic infrastructure to communicate with batch systems from within R
I De�ne, submit, supervise and kill batch jobs
I Implementation of higher order functionsMap, Reduce and Filter
to de�ne jobs and collect results
I Con�gurable integrated status mailer
I Supported batch systems: Torque/PBS, LSF, SGE
I Other modes: Interactive, Multicore, SSH
8 / 30
BatchJobs' functions Common functions BatchExperiments' functions
(1) Create Registry makeRegistry makeExperimentRegistry
(2) De�ne Jobs
batchMap
batchReduce
batchExpandGrid
batchMapResults
batchReduceResults
addProblem
addAlgorithm
makeDesign
addExperiments
(3) Subset Jobs findJobs
findDone
findErrors
...
findExperiments
(4) Submit Jobs submitJobs
(5) Status & Debugging
showStatus
testJob
showLog
summarizeExperiments
(6) Collect Results
loadResult[s]
reduceResults
filterResults
reduceResults[AggrType]
reduceResultsExperiments
9 / 30
BatchJobs' functions Common functions BatchExperiments' functions
(1) Create Registry makeRegistry makeExperimentRegistry
(2) De�ne Jobs
batchMap
batchReduce
batchExpandGrid
batchMapResults
batchReduceResults
addProblem
addAlgorithm
makeDesign
addExperiments
(3) Subset Jobs findJobs
findDone
findErrors
...
findExperiments
(4) Submit Jobs submitJobs
(5) Status & Debugging
showStatus
testJob
showLog
summarizeExperiments
(6) Collect Results
loadResult[s]
reduceResults
filterResults
reduceResults[AggrType]
reduceResultsExperiments
10 / 30
(1) Create a registry
Registry
I Object used to access and exchange informations: �le paths, jobparameters, computational events, . . .
I Manages SQLite and �le system backend
I All information is stored in a single, portable directory
I Initialization of a new registry:
l i b r a r y ( BatchJobs )reg <− makeRegistry (
id = " tryout " , # name o f your studyf i l e . d i r = "~/myProject " , # a c c e s s i b l e on a l l nodesseed = 1) # i n i t i a l seed f o r f i r s t job
I loadRegistry(dir) to resume working with an existing registry
11 / 30
BatchJobs' functions Common functions BatchExperiments' functions
(1) Create Registry makeRegistry makeExperimentRegistry
(2) De�ne Jobs
batchMap
batchReduce
batchExpandGrid
batchMapResults
batchReduceResults
addProblem
addAlgorithm
makeDesign
addExperiments
(3) Subset Jobs findJobs
findDone
findErrors
...
findExperiments
(4) Submit Jobs submitJobs
(5) Status & Debugging
showStatus
testJob
showLog
summarizeExperiments
(6) Collect Results
loadResult[s]
reduceResults
filterResults
reduceResults[AggrType]
reduceResultsExperiments
12 / 30
(2) De�ne Jobs
batchMap
I Like lapply or mapply
I (x1, x2)× (y1, y2)→ (f(x1, y1), f(x2, y2))
I 10 Jobs to calculate 1 + 9, 2 + 8, . . . , 9 + 1
map <− f unc t i on ( i , j ) i+jbatchMap ( reg , map , i =1:9 , k=9:1)
I Stores function on �le system
I Inserts job de�nitions into the database
I Parameters serialized into the database for fast access
I All jobs get unique positive integers as IDs
13 / 30
BatchJobs' functions Common functions BatchExperiments' functions
(1) Create Registry makeRegistry makeExperimentRegistry
(2) De�ne Jobs
batchMap
batchReduce
batchExpandGrid
batchMapResults
batchReduceResults
addProblem
addAlgorithm
makeDesign
addExperiments
(3) Subset Jobs findJobs
findDone
findErrors
...
findExperiments
(4) Submit Jobs submitJobs
(5) Status & Debugging
showStatus
testJob
showLog
summarizeExperiments
(6) Collect Results
loadResult[s]
reduceResults
filterResults
reduceResults[AggrType]
reduceResultsExperiments
14 / 30
(3) Subset Jobs
I Query job IDs by computational status: find* functionsfindSubmitted, findRunning, findDone, . . .
I Query job IDs by parameters: findJobs(reg, pars)
i d s <− f i ndJobs ( reg , pars=( i ==1))
I Set operations on ID vectors: intersect, setdiff, union
I Vector of IDs can be passed to all functions interacting with the batchsystem
15 / 30
BatchJobs' functions Common functions BatchExperiments' functions
(1) Create Registry makeRegistry makeExperimentRegistry
(2) De�ne Jobs
batchMap
batchReduce
batchExpandGrid
batchMapResults
batchReduceResults
addProblem
addAlgorithm
makeDesign
addExperiments
(3) Subset Jobs findJobs
findDone
findErrors
...
findExperiments
(4) Submit Jobs submitJobs
(5) Status & Debugging
showStatus
testJob
showLog
summarizeExperiments
(6) Collect Results
loadResult[s]
reduceResults
filterResults
reduceResults[AggrType]
reduceResultsExperiments
16 / 30
(4) Submit Jobs
submitJobs
I Creates R script �les and job description �les on the �y
I Resources can be provided as named list
# 1 hour maximal execut ion time , about 2 GB of RAMre s = l i s t ( wal l t ime=60∗ 60 , memory=2000)
# . . . and submitsubmitJobs ( reg , r e s ou r c e s=r e s )
I Submits all jobs not yet submitted per default
I Jobs can be grouped into chunks by providing a list of ID vectors.Each chunk gets executed sequentially as single batch job
17 / 30
BatchJobs' functions Common functions BatchExperiments' functions
(1) Create Registry makeRegistry makeExperimentRegistry
(2) De�ne Jobs
batchMap
batchReduce
batchExpandGrid
batchMapResults
batchReduceResults
addProblem
addAlgorithm
makeDesign
addExperiments
(3) Subset Jobs findJobs
findDone
findErrors
...
findExperiments
(4) Submit Jobs submitJobs
(5) Status & Debugging
showStatus
testJob
showLog
summarizeExperiments
(6) Collect Results
loadResult[s]
reduceResults
filterResults
reduceResults[AggrType]
reduceResultsExperiments
18 / 30
(5) Supervise and debug
I Quick overview of what is going on: showStatus(reg)
Status f o r j obs : 10Submitted : 10 (100.0%)Started : 10 (100.0%)Errors : 0 ( 0.0%)Running : 2 ( 20.0%)Expired : 0 ( 0.0%)Done : 8 ( 80.0%)Time : min=1.50 s avg=5.20 s max=8.80 s
I Display log �les with a customizable pager (less, vi, ...):showLog(reg, findErrors(reg)[1])
I Found a bug? killJobs(reg, findRunning(reg))
I Run a job in the current R session: testJob(reg, id)
19 / 30
BatchJobs' functions Common functions BatchExperiments' functions
(1) Create Registry makeRegistry makeExperimentRegistry
(2) De�ne Jobs
batchMap
batchReduce
batchExpandGrid
batchMapResults
batchReduceResults
addProblem
addAlgorithm
makeDesign
addExperiments
(3) Subset Jobs findJobs
findDone
findErrors
...
findExperiments
(4) Submit Jobs submitJobs
(5) Status & Debugging
showStatus
testJob
showLog
summarizeExperiments
(6) Collect Results
loadResult[s]
reduceResults
filterResults
reduceResults[AggrType]
reduceResultsExperiments
20 / 30
(6) Collect results
Reduce
# combine in numeric vec to rreduceResu l t s ( reg , i d s = findDone ( reg ) , i n i t = numeric ( 0 ) ,
fun = func t i on ( aggr , job , r e s ) c ( aggr , r e s ) )
I Convenience wrappers around reduceResults:reduceResults[Vector|Matrix|DataFrame|List]
I batchReduceResults to reduce results on the nodes
Iterate over results
i d s <− f indDone ( reg )l app ly ( ids , f unc t i on ( id ) { r e s <− l oadResu l t ( reg , id ) ; . . . })
21 / 30
Demo: BatchJobs
I Con�guration
I Brew scripts for job �les
I Some toy examples
22 / 30
BatchExperiments
I Builds on top of BatchJobs
I Intended as abstraction for typical statistical tasks:
Applying algorithms on problems
I More aimed at the end user
I Convenient for simulation studies, comparison and benchmarkexperiments, sensitivity analysis, . . .
I Work�ow di�ers only in job de�nition
23 / 30
BatchJobs' functions Common functions BatchExperiments' functions
(1) Create Registry makeRegistry makeExperimentRegistry
(2) De�ne Jobs
batchMap
batchReduce
batchExpandGrid
batchMapResults
batchReduceResults
addProblem
addAlgorithm
makeDesign
addExperiments
(3) Subset Jobs findJobs
findDone
findErrors
...
findExperiments
(4) Submit Jobs submitJobs
(5) Status & Debugging
showStatus
testJob
showLog
summarizeExperiments
(6) Collect Results
loadResult[s]
reduceResults
filterResults
reduceResults[AggrType]
reduceResultsExperiments
24 / 30
Job de�nition steps in BatchExperiments
1. Add problems to registry: addProblemI E�cient storage: Separation of static and dynamic problem partsI Can be connected with an experimental design
2. Add algorithms to registry: addAlgorithmI Problem instance gets passed to algorithmI Can be connected with an experimental designI Return value will be saved on the �le system
3. Connect experimental designs: makeDesign
4. Add experiments to registry: addExperimentsI Experiment: problem instance + algorithm + algorithm parametersI Job: Experiment + replication number
25 / 30
Function interplay
result
static problem part
static in addProblem()
dynamic problem function
dynamic in addProblem()
algorithm function
algorithm in addAlgorithm()
problem iterator algorithm iterator
problem design
design/exhaustive in makeDesign()
algorithm design
design/exhaustive in makeDesign()
problem
instance
problem parameters algorithm parameters
.
26 / 30
What you get
I Reproducibility: Every computation is seeded, seeds are stored in adatabase
I Portability: Data, algorithms, results and job information reside in asingle directory
I Extensibility: Add more problems or algorithms, try di�erentparameters or increase the replication numbers at any computationalstate
I Exchangeability: Share your �le directory to allow others to extendyour study with their data sets and algorithms
27 / 30
Demo: BatchExperiments
I Toy classi�cation example
I Comparing some models on the iris data
I Bootstrapping
28 / 30
Advanced Stu�
I Transforming results in parallel on the nodes
I Problem synchronization: all algorithms see the same probleminstances (per replication)
I Using MPI jobs as batch jobs
I Chunking small jobs to avoid �ooding of the scheduler
I Optional splitting of result �les for fast access of partial results
I Integrating your own batch system
29 / 30
Conclusions
I Greatly simpli�es the work with batch systems
I Interactively control batch systems from within R (no shell required)
I Do reproducible research
I Exchange code and results with others
Outlook
I Support for more batch systems: Slurm, Condor, Amazon EC2, . . .
I Support for systems with unshared �le systems
I Enhance caching of queries
I Support for dependent jobs
http://batchjobs.googlecode.com
30 / 30