parallel computing with r

25
Parallel Computing with R Parallel Computing with R Literature Seminar Abhirup Mallik [email protected] School of Statistics University of Minnesota November 15, 2013

Upload: abhirup-mallik

Post on 28-Dec-2014

705 views

Category:

Education


3 download

DESCRIPTION

A quick review and demonstration on how to get started on parallel computing with R. Includes an example of SNOW cluster set up in the departmental lab.

TRANSCRIPT

Page 1: Parallel Computing with R

Parallel Computing with R

Parallel Computing with RLiterature Seminar

Abhirup [email protected]

School of StatisticsUniversity of Minnesota

November 15, 2013

Page 2: Parallel Computing with R

Parallel Computing with R

Why Parallel?

Why Parallel?

I R does not take advantage of multiple cores by default

I Does not support passing by reference

I Can not read files dynamically ... etc..

Page 3: Parallel Computing with R

Parallel Computing with R

Why Parallel?

Why Parallel?

I R does not take advantage of multiple cores by default

I Does not support passing by reference

I Can not read files dynamically ... etc..

Page 4: Parallel Computing with R

Parallel Computing with R

Why Parallel?

Why Parallel?

I R does not take advantage of multiple cores by default

I Does not support passing by reference

I Can not read files dynamically ... etc..

Page 5: Parallel Computing with R

Parallel Computing with R

What is Parallel computing with R

What is Parallel?

I ’Parallel’ : Doing more than one tasks at the same time.

I Use different cores of a same CPU for different tasks.

I Use different computers in a cluster for different tasks.

Page 6: Parallel Computing with R

Parallel Computing with R

What is Parallel computing with R

What is Parallel?

I ’Parallel’ : Doing more than one tasks at the same time.

I Use different cores of a same CPU for different tasks.

I Use different computers in a cluster for different tasks.

Page 7: Parallel Computing with R

Parallel Computing with R

What is Parallel computing with R

What is Parallel?

I ’Parallel’ : Doing more than one tasks at the same time.

I Use different cores of a same CPU for different tasks.

I Use different computers in a cluster for different tasks.

Page 8: Parallel Computing with R

Parallel Computing with R

How to go Parallel?

Using Multicore (Implicit Parallelism)

Main process forks to child process which runs in parallel indifferent cores.

1 library(parallel)

2 mclapply(X, FUN , ...)

Or use

1 library(parallel)

2 ... setup stuff..

3 for (isplit in 1: nsplit) {

4 mcparallel(some R expression involving isplit)

5 }

6 out <- collect ()

Page 9: Parallel Computing with R

Parallel Computing with R

How to go Parallel?

Warnings:

I All child process compete for memory.

I Closing terminal or closing any graphical window only killsparent.

I ’CRTL + C’ Kills the parent, not the children.

I Kill the children if they are unresponsive.

Page 10: Parallel Computing with R

Parallel Computing with R

How to go Parallel?

Using SNOW (Explicit Parallelism)Make a cluster by any one of these options

1 cl <- makeCluster(spec , type , ...)

2 cl <- makePSOCKcluster(names , ...)

3 cl <- makeForkCluster(nnodes = , ...)

Export essential objects to the cluster:

1 clusterExport(cl, c(var1 , fun1 , ..))

Evaluate on cluster:

1 clusterEvalQ(cl, expr)

2 parLapply(cl = NULL , X, fun , ...)

3 parSapply(cl = NULL , X, fun , ...)

Stop the cluster

Page 11: Parallel Computing with R

Parallel Computing with R

Demonstration

Demonstration

Using Swiss fertility data from 1888 (R-base).

1 > str(swiss)

2 ’data.frame’: 47 obs. of 6 variables:

3 $ Fertility : num 80.2 83.1 92.5 85.8 76.9 76.1 ...

4 $ Agriculture : num 17 45.1 39.7 36.5 43.5 35.3 ...

5 $ Examination : int 15 6 5 12 17 9 16 14 12 16 ...

6 $ Education : int 12 9 5 7 15 7 7 8 7 13 ...

7 $ Catholic : num 9.96 84.84 93.4 33.77 5.16 ...

8 $ Infant.Mortality: num 22.2 22.2 20.2 20.3 20.6 26.6 ...

Page 12: Parallel Computing with R

Parallel Computing with R

Demonstration

Demonstration10 fold cross validation

1 fold <- sample(seq(1 , 10), size =nrow(swiss),

2 replace = TRUE )

Cross validation for ’i’th Fold

1 fold.cv <- function(i) {

2 train <- swiss[ fold != i , ]

3 test <- swiss[ fold == i , ]

4 swiss.rf <- randomForest(sqrt(Fertility) ~ .

5 - Catholic + I(Catholic < 50), data=train)

6 predict.test <- predict(swiss.rf , test , type = "response")

7 actual.test <- sqrt(test$Fertility)

8 err <- predict.test - actual.test

9 sum(err*err)

10 }

Page 13: Parallel Computing with R

Parallel Computing with R

Demonstration

How to create a cluster?

Create a local cluster of size 4 (parallel socket)

1 cl <- makePSOCKcluster (4)

Create a local cluster on different cores of the CPU (8 cores).

1 cl <- makeForkCluster (8)

Page 14: Parallel Computing with R

Parallel Computing with R

Demonstration

How to create a cluster in our LAB?Create password less log in using ssh keygen (from Shell):

1 ssh -keygen -t dsa

2 cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys

#check which computers are running

1 grephosts LAB

2 \#Then ssh all the computers you want to connect to once ,

and it will be remembered for the session.

Now we are ready to make a cluster:

1 library(parallel)

2 machines <- c("crab", "sugar", "strike", "hyland", "lovejoy"

, "driller")

3 address <- rapply(lapply(machines , nsl), c)

4 cl <- makePSOCKcluster(address)

Page 15: Parallel Computing with R

Parallel Computing with R

Demonstration

How to create a cluster in our LAB?

If you are connecting to stat.umn.edu from your own computer, tocreate a password-less ssh session:

1 ssh -keygen -t dsa

2 \# Then use scp to copy id_dsa.pub to ~/.ssh/authorized_keys

Page 16: Parallel Computing with R

Parallel Computing with R

Demonstration

ComparisonOn cluster:

1 > system.time({

2 + garbage <- clusterEvalQ(cl, data(swiss))

3 + garbage <- clusterEvalQ(cl, library(randomForest))

4 + clusterExport(cl, c("fold", "fold.cv"))

5 + clusterSetRNGStream (cl , 123)

6 + res3 <- do.call(c, parLapply(cl, 1:10, fold.cv))

7 + stopCluster(cl)

8 + })

9 user system elapsed

10 0.008 0.000 0.838

On Multicore:

1 > system.time({

2 + res1 <- do.call(c, mclapply (1:10, fold.cv,mc.cores = 8))

})

3 user system elapsed

4 0.386 0.162 0.120

Page 17: Parallel Computing with R

Parallel Computing with R

Demonstration

Using Fork cluster:

1 > system.time({

2 + cl <- makeForkCluster (8)

3 + garbage <- clusterEvalQ(cl, data(swiss))

4 + garbage <- clusterEvalQ(cl, library(randomForest))

5 + clusterExport(cl, c("fold", "fold.cv"))

6 + clusterSetRNGStream (cl , 123)

7 + res3 <- do.call(c, parLapply(cl, 1:10, fold.cv))

8 + stopCluster(cl)

9 + })

10 user system elapsed

11 0.010 0.054 0.153

Without any parallelization:

1 > system.time({

2 + res2 <- do.call(c, lapply (1:10 , fold.cv))

3 + })

4 user system elapsed

5 0.233 0.000 0.235

Page 18: Parallel Computing with R

Parallel Computing with R

When to go Parallel?

When to go Parallel?

I When gain from parallelization is much more than the cost ofdata transfer, network delays, etc...

I If the problem is Embarrassingly parallel: No dependencybetween the parallel tasks.

I Cross validation or Bootstrapping are examples where goingparallel would work.

I Iterative numerical methods like co-ordinate descent orNewton-Rapson, going parallel may not be possible.

Page 19: Parallel Computing with R

Parallel Computing with R

When to go Parallel?

When to go Parallel?

I When gain from parallelization is much more than the cost ofdata transfer, network delays, etc...

I If the problem is Embarrassingly parallel: No dependencybetween the parallel tasks.

I Cross validation or Bootstrapping are examples where goingparallel would work.

I Iterative numerical methods like co-ordinate descent orNewton-Rapson, going parallel may not be possible.

Page 20: Parallel Computing with R

Parallel Computing with R

When to go Parallel?

When to go Parallel?

I When gain from parallelization is much more than the cost ofdata transfer, network delays, etc...

I If the problem is Embarrassingly parallel: No dependencybetween the parallel tasks.

I Cross validation or Bootstrapping are examples where goingparallel would work.

I Iterative numerical methods like co-ordinate descent orNewton-Rapson, going parallel may not be possible.

Page 21: Parallel Computing with R

Parallel Computing with R

When to go Parallel?

When to go Parallel?

I When gain from parallelization is much more than the cost ofdata transfer, network delays, etc...

I If the problem is Embarrassingly parallel: No dependencybetween the parallel tasks.

I Cross validation or Bootstrapping are examples where goingparallel would work.

I Iterative numerical methods like co-ordinate descent orNewton-Rapson, going parallel may not be possible.

Page 22: Parallel Computing with R

Parallel Computing with R

To infinity and beyond

What is beyond the wall?

I Parallelization in Big data framework: RHadoop

I Other and related implementations of parallelization: MPI,NWS, etc...

I Other cool libraries: foreach, snowfall, etc...

I GPU !!

Page 23: Parallel Computing with R

Parallel Computing with R

Where to get codes?

Where to get the codes?

All the codes in this presentation is available at :https://github.com/abhirupkgp/parallelseminar/blob/master/cv.R

Page 24: Parallel Computing with R

Parallel Computing with R

References

Acknowledgements and References

I Sincere thanks to Charles Geyer

I Resourceful slides by Ryan Rosario.

I Some other and more resourceful slides.

I Parallel R Book

Page 25: Parallel Computing with R

Parallel Computing with R

Thank You

Thank You !!