parallel computing with r
DESCRIPTION
A quick review and demonstration on how to get started on parallel computing with R. Includes an example of SNOW cluster set up in the departmental lab.TRANSCRIPT
Parallel Computing with R
Parallel Computing with RLiterature Seminar
Abhirup [email protected]
School of StatisticsUniversity of Minnesota
November 15, 2013
Parallel Computing with R
Why Parallel?
Why Parallel?
I R does not take advantage of multiple cores by default
I Does not support passing by reference
I Can not read files dynamically ... etc..
Parallel Computing with R
Why Parallel?
Why Parallel?
I R does not take advantage of multiple cores by default
I Does not support passing by reference
I Can not read files dynamically ... etc..
Parallel Computing with R
Why Parallel?
Why Parallel?
I R does not take advantage of multiple cores by default
I Does not support passing by reference
I Can not read files dynamically ... etc..
Parallel Computing with R
What is Parallel computing with R
What is Parallel?
I ’Parallel’ : Doing more than one tasks at the same time.
I Use different cores of a same CPU for different tasks.
I Use different computers in a cluster for different tasks.
Parallel Computing with R
What is Parallel computing with R
What is Parallel?
I ’Parallel’ : Doing more than one tasks at the same time.
I Use different cores of a same CPU for different tasks.
I Use different computers in a cluster for different tasks.
Parallel Computing with R
What is Parallel computing with R
What is Parallel?
I ’Parallel’ : Doing more than one tasks at the same time.
I Use different cores of a same CPU for different tasks.
I Use different computers in a cluster for different tasks.
Parallel Computing with R
How to go Parallel?
Using Multicore (Implicit Parallelism)
Main process forks to child process which runs in parallel indifferent cores.
1 library(parallel)
2 mclapply(X, FUN , ...)
Or use
1 library(parallel)
2 ... setup stuff..
3 for (isplit in 1: nsplit) {
4 mcparallel(some R expression involving isplit)
5 }
6 out <- collect ()
Parallel Computing with R
How to go Parallel?
Warnings:
I All child process compete for memory.
I Closing terminal or closing any graphical window only killsparent.
I ’CRTL + C’ Kills the parent, not the children.
I Kill the children if they are unresponsive.
Parallel Computing with R
How to go Parallel?
Using SNOW (Explicit Parallelism)Make a cluster by any one of these options
1 cl <- makeCluster(spec , type , ...)
2 cl <- makePSOCKcluster(names , ...)
3 cl <- makeForkCluster(nnodes = , ...)
Export essential objects to the cluster:
1 clusterExport(cl, c(var1 , fun1 , ..))
Evaluate on cluster:
1 clusterEvalQ(cl, expr)
2 parLapply(cl = NULL , X, fun , ...)
3 parSapply(cl = NULL , X, fun , ...)
Stop the cluster
Parallel Computing with R
Demonstration
Demonstration
Using Swiss fertility data from 1888 (R-base).
1 > str(swiss)
2 ’data.frame’: 47 obs. of 6 variables:
3 $ Fertility : num 80.2 83.1 92.5 85.8 76.9 76.1 ...
4 $ Agriculture : num 17 45.1 39.7 36.5 43.5 35.3 ...
5 $ Examination : int 15 6 5 12 17 9 16 14 12 16 ...
6 $ Education : int 12 9 5 7 15 7 7 8 7 13 ...
7 $ Catholic : num 9.96 84.84 93.4 33.77 5.16 ...
8 $ Infant.Mortality: num 22.2 22.2 20.2 20.3 20.6 26.6 ...
Parallel Computing with R
Demonstration
Demonstration10 fold cross validation
1 fold <- sample(seq(1 , 10), size =nrow(swiss),
2 replace = TRUE )
Cross validation for ’i’th Fold
1 fold.cv <- function(i) {
2 train <- swiss[ fold != i , ]
3 test <- swiss[ fold == i , ]
4 swiss.rf <- randomForest(sqrt(Fertility) ~ .
5 - Catholic + I(Catholic < 50), data=train)
6 predict.test <- predict(swiss.rf , test , type = "response")
7 actual.test <- sqrt(test$Fertility)
8 err <- predict.test - actual.test
9 sum(err*err)
10 }
Parallel Computing with R
Demonstration
How to create a cluster?
Create a local cluster of size 4 (parallel socket)
1 cl <- makePSOCKcluster (4)
Create a local cluster on different cores of the CPU (8 cores).
1 cl <- makeForkCluster (8)
Parallel Computing with R
Demonstration
How to create a cluster in our LAB?Create password less log in using ssh keygen (from Shell):
1 ssh -keygen -t dsa
2 cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
#check which computers are running
1 grephosts LAB
2 \#Then ssh all the computers you want to connect to once ,
and it will be remembered for the session.
Now we are ready to make a cluster:
1 library(parallel)
2 machines <- c("crab", "sugar", "strike", "hyland", "lovejoy"
, "driller")
3 address <- rapply(lapply(machines , nsl), c)
4 cl <- makePSOCKcluster(address)
Parallel Computing with R
Demonstration
How to create a cluster in our LAB?
If you are connecting to stat.umn.edu from your own computer, tocreate a password-less ssh session:
1 ssh -keygen -t dsa
2 \# Then use scp to copy id_dsa.pub to ~/.ssh/authorized_keys
Parallel Computing with R
Demonstration
ComparisonOn cluster:
1 > system.time({
2 + garbage <- clusterEvalQ(cl, data(swiss))
3 + garbage <- clusterEvalQ(cl, library(randomForest))
4 + clusterExport(cl, c("fold", "fold.cv"))
5 + clusterSetRNGStream (cl , 123)
6 + res3 <- do.call(c, parLapply(cl, 1:10, fold.cv))
7 + stopCluster(cl)
8 + })
9 user system elapsed
10 0.008 0.000 0.838
On Multicore:
1 > system.time({
2 + res1 <- do.call(c, mclapply (1:10, fold.cv,mc.cores = 8))
})
3 user system elapsed
4 0.386 0.162 0.120
Parallel Computing with R
Demonstration
Using Fork cluster:
1 > system.time({
2 + cl <- makeForkCluster (8)
3 + garbage <- clusterEvalQ(cl, data(swiss))
4 + garbage <- clusterEvalQ(cl, library(randomForest))
5 + clusterExport(cl, c("fold", "fold.cv"))
6 + clusterSetRNGStream (cl , 123)
7 + res3 <- do.call(c, parLapply(cl, 1:10, fold.cv))
8 + stopCluster(cl)
9 + })
10 user system elapsed
11 0.010 0.054 0.153
Without any parallelization:
1 > system.time({
2 + res2 <- do.call(c, lapply (1:10 , fold.cv))
3 + })
4 user system elapsed
5 0.233 0.000 0.235
Parallel Computing with R
When to go Parallel?
When to go Parallel?
I When gain from parallelization is much more than the cost ofdata transfer, network delays, etc...
I If the problem is Embarrassingly parallel: No dependencybetween the parallel tasks.
I Cross validation or Bootstrapping are examples where goingparallel would work.
I Iterative numerical methods like co-ordinate descent orNewton-Rapson, going parallel may not be possible.
Parallel Computing with R
When to go Parallel?
When to go Parallel?
I When gain from parallelization is much more than the cost ofdata transfer, network delays, etc...
I If the problem is Embarrassingly parallel: No dependencybetween the parallel tasks.
I Cross validation or Bootstrapping are examples where goingparallel would work.
I Iterative numerical methods like co-ordinate descent orNewton-Rapson, going parallel may not be possible.
Parallel Computing with R
When to go Parallel?
When to go Parallel?
I When gain from parallelization is much more than the cost ofdata transfer, network delays, etc...
I If the problem is Embarrassingly parallel: No dependencybetween the parallel tasks.
I Cross validation or Bootstrapping are examples where goingparallel would work.
I Iterative numerical methods like co-ordinate descent orNewton-Rapson, going parallel may not be possible.
Parallel Computing with R
When to go Parallel?
When to go Parallel?
I When gain from parallelization is much more than the cost ofdata transfer, network delays, etc...
I If the problem is Embarrassingly parallel: No dependencybetween the parallel tasks.
I Cross validation or Bootstrapping are examples where goingparallel would work.
I Iterative numerical methods like co-ordinate descent orNewton-Rapson, going parallel may not be possible.
Parallel Computing with R
To infinity and beyond
What is beyond the wall?
I Parallelization in Big data framework: RHadoop
I Other and related implementations of parallelization: MPI,NWS, etc...
I Other cool libraries: foreach, snowfall, etc...
I GPU !!
Parallel Computing with R
Where to get codes?
Where to get the codes?
All the codes in this presentation is available at :https://github.com/abhirupkgp/parallelseminar/blob/master/cv.R
Parallel Computing with R
References
Acknowledgements and References
I Sincere thanks to Charles Geyer
I Resourceful slides by Ryan Rosario.
I Some other and more resourceful slides.
I Parallel R Book
Parallel Computing with R
Thank You
Thank You !!