march 3rd, 2006 chen peng, lilly system biology1 cluster and sge

29
March 3rd, 2006 Chen Peng, Lilly System Biology 1 Cluster and SGE

Upload: stephanie-walsh

Post on 12-Jan-2016

218 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: March 3rd, 2006 Chen Peng, Lilly System Biology1 Cluster and SGE

March 3rd, 2006 Chen Peng, Lilly System Biology 1

Cluster and SGE

Page 2: March 3rd, 2006 Chen Peng, Lilly System Biology1 Cluster and SGE

March 3rd, 2006 Chen Peng, Lilly System Biology 2

Objectives

To bring awareness of our resources To explore the possibilities

Not a hand-on training for SGE No details on parallel programming and bix environment

Page 3: March 3rd, 2006 Chen Peng, Lilly System Biology1 Cluster and SGE

March 3rd, 2006 Chen Peng, Lilly System Biology 3

Agenda

Introduction Cluster@LSB

Hardware/software; What can we do and what can not ?

Working with cluster SGE: power and limitation; How cluster can help us? Use SGE to manage jobs as a non-privileged user; Why cluster can be evil?

Q/A

Page 4: March 3rd, 2006 Chen Peng, Lilly System Biology1 Cluster and SGE

March 3rd, 2006 Chen Peng, Lilly System Biology 4

Cluster, cheap super-computing

Features: Composed with commodity PC Inter-connected with high speed network Resources managed by OS or software Handling complex tasks as a single unit High scalability Low cost: a top 10 HPC cluster (2400 CPUs)

with less than 6 million dollars

Page 5: March 3rd, 2006 Chen Peng, Lilly System Biology1 Cluster and SGE

March 3rd, 2006 Chen Peng, Lilly System Biology 5

A cluster in Singapore

64 nodes (2cpu/node) Gigabit Ethernet Extended to 75 nodes

and 80 nodes Managed by LSF Implemented with less

than 400K SGD

Page 6: March 3rd, 2006 Chen Peng, Lilly System Biology1 Cluster and SGE

March 3rd, 2006 Chen Peng, Lilly System Biology 6

Clusters: quick overview

Various Types: High Availability (DB, file server) Load Balancing (Web server, search engine) High Performance Computing (our focus

today)

Page 7: March 3rd, 2006 Chen Peng, Lilly System Biology1 Cluster and SGE

March 3rd, 2006 Chen Peng, Lilly System Biology 7

HPC Cluster: Beowulf

Beowulf First implementation for cluster computing Taget at parallel computer tasks Need fast inter-connections Develop parallel code

MPI (Message Passing Interface) PVM (Parallel Virtual Machine)

Page 8: March 3rd, 2006 Chen Peng, Lilly System Biology1 Cluster and SGE

March 3rd, 2006 Chen Peng, Lilly System Biology 8

HPC Cluster: Mosix

Mosix/OpenMosix/openSSI Based on Beowulf implementation SSI: Single System Image

OS shared by all the nodes kernel level integration

Automatic process migration among nodes Ideal for parallel tasks and IO intensive work Still need to develop parallel code

Page 9: March 3rd, 2006 Chen Peng, Lilly System Biology1 Cluster and SGE

March 3rd, 2006 Chen Peng, Lilly System Biology 9

HPC Cluster: Compute farm (1)

Jobs are managed by Distributed Resource Management (DRM) system (LSF, SGE, PBS, Torque).

Page 10: March 3rd, 2006 Chen Peng, Lilly System Biology1 Cluster and SGE

March 3rd, 2006 Chen Peng, Lilly System Biology 10

HPC Cluster: Compute Farm (2)

Embarrassingly parallel:

Large numbers of nearly identical jobs, with different input or parameters;

User or user-space application prepares the input or paremeters;

Little inter-job communication, mostly independent tasks;

High Throughput Blast.

Page 11: March 3rd, 2006 Chen Peng, Lilly System Biology1 Cluster and SGE

March 3rd, 2006 Chen Peng, Lilly System Biology 11

Introduction: summary

Features Various types HPC architectures compared

Beowulf: inter-connected PC to run parallel code using MPI/PVM

Mosix: kernel integration for process migration Compute farm: run many smiliar jobs in a

“embarrassingly parallel” manner

Page 12: March 3rd, 2006 Chen Peng, Lilly System Biology1 Cluster and SGE

March 3rd, 2006 Chen Peng, Lilly System Biology 12

Agenda

Introduction Cluster@LSB

Hardware/software; What can we do and what can not ?

Working with cluster SGE: power and limitation; How cluster can help us? Use SGE to manage jobs as a non-privileged user; Why cluster can be evil?

Q/A

Page 13: March 3rd, 2006 Chen Peng, Lilly System Biology1 Cluster and SGE

March 3rd, 2006 Chen Peng, Lilly System Biology 13

Hardware

It is a comput farm Head node (pecos)

2xIntel P3 1.2Ghz, 4GB 32 Compute node

2xIntel P3 1.2Ghz, 2GB 100Mb Ethernet

Page 14: March 3rd, 2006 Chen Peng, Lilly System Biology1 Cluster and SGE

March 3rd, 2006 Chen Peng, Lilly System Biology 14

Software

OS: RedHat 9 Sun Grid Engine (SGE) MPI libraries: LAM/MPI, MPICH PVM libraries: PVM-v3 Parallel computing packages for R:

Rmpi, rpmv, SNOW Matlab for distributed computing

Coming soon... Parallel blast implemented by “Scalable System”

A general command line interface Wrappers for HT Blast and mpiBLAST

Page 15: March 3rd, 2006 Chen Peng, Lilly System Biology1 Cluster and SGE

March 3rd, 2006 Chen Peng, Lilly System Biology 15

About current configuration

64 CPUs are available Limited memory on each node, not shared Slow inter-connections

What’s the meaning of all these? Capable of managing large amount of jobs,

each job has little communications with others Not recommended for serious parallel job, but

may work for proof-of-concept task

Page 16: March 3rd, 2006 Chen Peng, Lilly System Biology1 Cluster and SGE

March 3rd, 2006 Chen Peng, Lilly System Biology 16

Agenda

Introduction Cluster@LSB

Hardware/software; What can we do and what can not ?

Working with cluster SGE: power and limitation; How cluster can help us? Use SGE to manage jobs as a non-privileged user; Why cluster may be evil?

Q/A

Page 17: March 3rd, 2006 Chen Peng, Lilly System Biology1 Cluster and SGE

March 3rd, 2006 Chen Peng, Lilly System Biology 17

Sun Grid Engine

Sun Grid Engine is a Distributed Resource Management (DRM) software.

It is helpful to: optimally place computing tasks allow users to queue more computing tasks ensure that tasks are executed fairly with

respect to priority

Page 18: March 3rd, 2006 Chen Peng, Lilly System Biology1 Cluster and SGE

March 3rd, 2006 Chen Peng, Lilly System Biology 18

You might have been like this...

User 1 has 100 analysis. It will take 20 hours to run serially on volga, so he runs part of jobs to hudson…

Evil user2 is running heavy programs on hudson and it hooks up all the CPUs!

Page 19: March 3rd, 2006 Chen Peng, Lilly System Biology1 Cluster and SGE

March 3rd, 2006 Chen Peng, Lilly System Biology 19

Life could be easier with cluster

User submits 100 analysis to cluster using SGE SGE finds the best available node to run the analysis The results could return in less than one hour!

Page 20: March 3rd, 2006 Chen Peng, Lilly System Biology1 Cluster and SGE

March 3rd, 2006 Chen Peng, Lilly System Biology 20

How can cluster help us?

Use SGE to help manage analysis Eg. run SIG3 with 23 cell lines (90+ runs) What we did:

Three people: each manages 30+ runs (around 3 hours);

What we could do:One person prepares the SGE jobs in 10 minutes, submits and gets results in half hour;

Page 21: March 3rd, 2006 Chen Peng, Lilly System Biology1 Cluster and SGE

March 3rd, 2006 Chen Peng, Lilly System Biology 21

How can cluster help us? (cont’d)

Use cluster/SGE to speed up analysis; Annotation pipeline SRS indexing

Explore the parallel code; Random forest algorithm, etc.

Page 22: March 3rd, 2006 Chen Peng, Lilly System Biology1 Cluster and SGE

March 3rd, 2006 Chen Peng, Lilly System Biology 22

Commonly used SGE cmds

Submit job: qsub Cancel a job: qdel Check my job: qstat –j jobID

Check queue status: qstat Check cluster status: qhost

Page 23: March 3rd, 2006 Chen Peng, Lilly System Biology1 Cluster and SGE

March 3rd, 2006 Chen Peng, Lilly System Biology 23

SGE Demo

Ramneek’s script: extract_gene.pl It took 3.5-4 minutes to run with four genes. We will run the script in parallel and complete

the analysis in one minute.

Page 24: March 3rd, 2006 Chen Peng, Lilly System Biology1 Cluster and SGE

March 3rd, 2006 Chen Peng, Lilly System Biology 24

Demo: script “sge_run”

It is (4-line) shell wrapper for “extract_gene.pl”: Split the input file into smaller pieces Submit the jobs to SGE

Page 25: March 3rd, 2006 Chen Peng, Lilly System Biology1 Cluster and SGE

March 3rd, 2006 Chen Peng, Lilly System Biology 25

Demo: run the script

Page 26: March 3rd, 2006 Chen Peng, Lilly System Biology1 Cluster and SGE

March 3rd, 2006 Chen Peng, Lilly System Biology 26

Sun Grid Engine: Limitations

Embarrassingly parallel: jobs need to prepared. Submission host != execution host. User needs to

redirect output to files in shared location. It is difficult to debug cluster jobs. Limited support for automated job recovery.

Page 27: March 3rd, 2006 Chen Peng, Lilly System Biology1 Cluster and SGE

March 3rd, 2006 Chen Peng, Lilly System Biology 27

The evil side

Potential threat to the entire system Concurrent requests may burn the file server

or database server Network traffic (switch, router etc)

How to fix? User education Good management practice Admin validates in-house developed code

Page 28: March 3rd, 2006 Chen Peng, Lilly System Biology1 Cluster and SGE

March 3rd, 2006 Chen Peng, Lilly System Biology 28

Cluster and SGE: summary

SGE: powerful, but with limitations How cluster may help us? Importance to use the cluster correctly

Page 29: March 3rd, 2006 Chen Peng, Lilly System Biology1 Cluster and SGE

March 3rd, 2006 Chen Peng, Lilly System Biology 29