gridbus2003 university of melbourne, australia, june 7, 2003 opensce middleware and tools set for...

43
Gridbus2003 University of Melbourne, Australia, June 7, 2003 OpenSCE Middleware and Tools set for Cluster and Grid System Putchong Uthayopas Director of High Performance Computing and Networking Center Associate Professor in Computer Engineering Faculty of Engineering, Kasetsart University Bangkok, Thailand

Upload: melvin-cobb

Post on 29-Dec-2015

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Gridbus2003 University of Melbourne, Australia, June 7, 2003 OpenSCE Middleware and Tools set for Cluster and Grid System Putchong Uthayopas Director of

Gridbus2003 University of Melbourne, Australia, June 7, 2003

OpenSCEMiddleware and Tools set

for Cluster and Grid SystemPutchong Uthayopas

Director of High Performance Computing and Networking CenterAssociate Professor in Computer EngineeringFaculty of Engineering, Kasetsart University

Bangkok, Thailand

Page 2: Gridbus2003 University of Melbourne, Australia, June 7, 2003 OpenSCE Middleware and Tools set for Cluster and Grid System Putchong Uthayopas Director of

Gridbus2003 University of Melbourne, Australia, June 7, 2003

OpenSCE :Scalable Cluster Environment

• An open source project that intends to deliver an integrated open source cluster environment

• Phase 1: 1997-2000 as a SMILE project– Scalable Multicomputer Implemented using Lowcost

Equipment

• Phase 2: 2001-2003 OpenSCE project• www.opensce.org

Page 3: Gridbus2003 University of Melbourne, Australia, June 7, 2003 OpenSCE Middleware and Tools set for Cluster and Grid System Putchong Uthayopas Director of

Gridbus2003 University of Melbourne, Australia, June 7, 2003

SCE Components

MPview – MPI program visualization

• MPITH – Quick and simple MPI runtime

• SQMS – Batch scheduler for cluster

• SCMS/ SCMSWEB cluster management tool

• Beowulf Builder (BB, SBB) cluster builder

• KSIX – cluster middleware

Page 4: Gridbus2003 University of Melbourne, Australia, June 7, 2003 OpenSCE Middleware and Tools set for Cluster and Grid System Putchong Uthayopas Director of

Gridbus2003 University of Melbourne, Australia, June 7, 2003

SCE Structures

KSIX Middleware

SCMSSystem

Management

SQMS Scheduler

Beowulf BuilderTool

Real Time Monitoring

MPITH

MPVIEW

Hardware and Interconnection network

Page 5: Gridbus2003 University of Melbourne, Australia, June 7, 2003 OpenSCE Middleware and Tools set for Cluster and Grid System Putchong Uthayopas Director of

Gridbus2003 University of Melbourne, Australia, June 7, 2003

KSIX Middleware

• Presenting a single system image to application– Unify process space, process group– Distributed signal management– Membership services– Simple I/O redirection

Page 6: Gridbus2003 University of Melbourne, Australia, June 7, 2003 OpenSCE Middleware and Tools set for Cluster and Grid System Putchong Uthayopas Director of

Gridbus2003 University of Melbourne, Australia, June 7, 2003

KSIX User Level Process Migration

• LibMIG– Checkpointing

– Migration

– Pure user level code

– No recompilation

• Next version of KSIX will support load balancing

• Algorithm?

Page 7: Gridbus2003 University of Melbourne, Australia, June 7, 2003 OpenSCE Middleware and Tools set for Cluster and Grid System Putchong Uthayopas Director of

Gridbus2003 University of Melbourne, Australia, June 7, 2003

AMATA HA architecture

• AMATA is a project to build – scalable high availability

extension to linux clustering

• AMATA – Define uniform HA archit

ecture on Linux

– Services, API, Signal

AMA TA

Page 8: Gridbus2003 University of Melbourne, Australia, June 7, 2003 OpenSCE Middleware and Tools set for Cluster and Grid System Putchong Uthayopas Director of

Gridbus2003 University of Melbourne, Australia, June 7, 2003

SQMS: Queuing Management System

• Batch scheduler for sequential an parallel MPI task

• Static and dynamic load balancing

• Reconfigurable scheduling policy

• Multiple resource and policy view

• Simple accounting and economic modeling support (Cluster Bank server)

Submitter

Task

TaskQueue

NodeAllocator

Scheduler

Cluster Nodes

RemoteQueue

Page 9: Gridbus2003 University of Melbourne, Australia, June 7, 2003 OpenSCE Middleware and Tools set for Cluster and Grid System Putchong Uthayopas Director of

Gridbus2003 University of Melbourne, Australia, June 7, 2003

SCMS: Cluster Management Tool for Beowulf Cluster

• A collection of system management tools for Beowulf cluster

• Package includes– Portable real-time monitoring – Parallel Unix command– Alarm system – Large collection of graphical user interface tool

s for users and system administrator

Page 10: Gridbus2003 University of Melbourne, Australia, June 7, 2003 OpenSCE Middleware and Tools set for Cluster and Grid System Putchong Uthayopas Director of

Gridbus2003 University of Melbourne, Australia, June 7, 2003

MPITH

• Small MPI runtime (40-50 functions)– OO design– C++ Language– More than 15000 lines of C++ code– Linux operating system

• Architecture

• Selected implementation issue

Page 11: Gridbus2003 University of Melbourne, Australia, June 7, 2003 OpenSCE Middleware and Tools set for Cluster and Grid System Putchong Uthayopas Director of

Gridbus2003 University of Melbourne, Australia, June 7, 2003

Preliminaries Study

• Only 20-30 functions are used by most developers

52

38

11

21

14

0

10

20

30

40

50

60

PETSc MPI Blacs MPI Povray HPL PGAPack

Application

Fu

nct

ion

Cou

nt

Page 12: Gridbus2003 University of Melbourne, Australia, June 7, 2003 OpenSCE Middleware and Tools set for Cluster and Grid System Putchong Uthayopas Director of

Gridbus2003 University of Melbourne, Australia, June 7, 2003

MPITH

API

Engine

Device Handler

Device

P IC A

Controller

UDP DP TCP VIA

Device Handler

Communicator

ProtocolBuffer

managerAlgorithm

Page 13: Gridbus2003 University of Melbourne, Australia, June 7, 2003 OpenSCE Middleware and Tools set for Cluster and Grid System Putchong Uthayopas Director of

Gridbus2003 University of Melbourne, Australia, June 7, 2003

Broadcast Performance

Broadcast time of 1 B - 1 MB on 16 nodes

10.00

100.00

1000.00

10000.00

100000.00

1000000.00

0.0001 0.001 0.01 0.1 1 10 100 1000 10000Message size (kByte)

Ttim

e (m

icro

seco

nd)

MPITH

MPICH

LAM

Broadcast time of 64 kB on 2 - 16 nodes

0

10000

20000

30000

40000

50000

60000

0 2 4 6 8 10 12 14 16 18Number of nodes

Tti

me

(mic

rose

con

d)

MPITH

MPICH

LAM

Send/Receive time of 1 B to 1 MB

10.00

100.00

1000.00

10000.00

100000.00

1000000.00

0.0001 0.001 0.01 0.1 1 10 100 1000 10000Message Size (kByte)

Tim

e (m

icro

seco

nd)

MPITH

MPICH

LAM

Page 14: Gridbus2003 University of Melbourne, Australia, June 7, 2003 OpenSCE Middleware and Tools set for Cluster and Grid System Putchong Uthayopas Director of

Gridbus2003 University of Melbourne, Australia, June 7, 2003

Parallel Gaussian Elimination

Speed up of Gaussian eliminition running time of 2400 variables on 1 to 16 nodes

0123456789

0 2 4 6 8 10 12 14 16

Number of nodes

Spee

d up

MPITH

MPICH

LAM

Gaussian eliminition running time of 400 to 2400 variables on 16 nodes

0

10

20

30

40

50

60

0 500 1000 1500 2000 2500

Problem size

Tim

e (s

econ

d)

MPITH

MPICH

LAM

Page 15: Gridbus2003 University of Melbourne, Australia, June 7, 2003 OpenSCE Middleware and Tools set for Cluster and Grid System Putchong Uthayopas Director of

Gridbus2003 University of Melbourne, Australia, June 7, 2003

Energy Model for Implicit Coscheduling

• Each process has stored “Energy”

• Process charge/discharge “energy” while it executes

• Charge/Discharge rate is calculated from process statistics– Communication Frequency– Message Size– Amount of running process in

the system

• The charging and discharging state changes when communication state changes

• Local scheduling priority are calculated from– Static priority– Energy level

State Change

State Change (SwitchS triggered)

State Change (SwitchS triggered)

Time

En

erg

y

State Change

State Change

Page 16: Gridbus2003 University of Melbourne, Australia, June 7, 2003 OpenSCE Middleware and Tools set for Cluster and Grid System Putchong Uthayopas Director of

Gridbus2003 University of Melbourne, Australia, June 7, 2003

Implementation Details

• Implemented in kernel-level as Linux Kernel Module (LKM)– kernel version 2.4.19 (the latest at the time)– Using Linux timer mechanism to periodically inspect

the kernel task queue and adjust the value of each task_struct

– User need to tell the system which process to do the coscheduling by using command line.

– _exit system call is trapped to ensure that all internal variable is cleared when process exit

Page 17: Gridbus2003 University of Melbourne, Australia, June 7, 2003 OpenSCE Middleware and Tools set for Cluster and Grid System Putchong Uthayopas Director of

Gridbus2003 University of Melbourne, Australia, June 7, 2003

Runtime of parallel application against sequential workload

• Single MG against 1-10 sequential workload

Page 18: Gridbus2003 University of Melbourne, Australia, June 7, 2003 OpenSCE Middleware and Tools set for Cluster and Grid System Putchong Uthayopas Director of

Gridbus2003 University of Melbourne, Australia, June 7, 2003

Efficient Collective Communication

Algorithm over Grid system

• Genetic Algorithms-based Dynamic Tree (GADT)– Heuristic based on

genetic algorithm

– Total transmission time is used as fitness value

5

0

7 4

3

6

2

1

1 2 3

1 1

1 2

0 1 7 0 2 02

Parent array (n-1)

5 9 7 20 8 215

Priority array (n-1)

Page 19: Gridbus2003 University of Melbourne, Australia, June 7, 2003 OpenSCE Middleware and Tools set for Cluster and Grid System Putchong Uthayopas Director of

Gridbus2003 University of Melbourne, Australia, June 7, 2003

Algorithms Comparison

Transmission of 3 algorithms

0

5000

10000

15000

20000

25000

30000

1 2 3 4 5 6 7 8 9 10

Pattern

Tim

e Optimal

GADT

Binomial

Page 20: Gridbus2003 University of Melbourne, Australia, June 7, 2003 OpenSCE Middleware and Tools set for Cluster and Grid System Putchong Uthayopas Director of

Gridbus2003 University of Melbourne, Australia, June 7, 2003

OpenSCE and Grid Computing

• Software– Grid Observer

– SCEGrid Grid scheduler

– HyperGrid Simulator

OpenSCEOpenSCE OpenSCEOpenSCE

GlobusGlobus

SCE/GridSCE/Grid GridObserverGridObserver

Page 21: Gridbus2003 University of Melbourne, Australia, June 7, 2003 OpenSCE Middleware and Tools set for Cluster and Grid System Putchong Uthayopas Director of

Gridbus2003 University of Melbourne, Australia, June 7, 2003

SCE/Grid Architecture

• Distributed resource manager

• Running on top of Globus

• Automatically discovering resources

• Automatically choosing target site

Site A

SCEGrid

Site B

SCEGrid

Site CSCEGrid

GRID

Page 22: Gridbus2003 University of Melbourne, Australia, June 7, 2003 OpenSCE Middleware and Tools set for Cluster and Grid System Putchong Uthayopas Director of

Gridbus2003 University of Melbourne, Australia, June 7, 2003

Structure

Remote Execution Machine Submission Machine

SCE/Grid Scheduler

Job Queue

Policy Module

SCE/Grid Dispatcher

Launcher Launcher Launcher (Globus)

Globus GASS Server

Globus Gate Keeper

Globus GRAM

Local Job Scheduler (PBS, SGE,..)

SCE/Grid REM

Job

Policy Module

Policy Module

Page 23: Gridbus2003 University of Melbourne, Australia, June 7, 2003 OpenSCE Middleware and Tools set for Cluster and Grid System Putchong Uthayopas Director of

Gridbus2003 University of Melbourne, Australia, June 7, 2003

Grid Observer (KU)

• Building technology to monitor the grid

• Software is now used by APGrid Test Bed

Sensors

Sensors

Collector Presenter

Collector Presenter

Other Monitoring System(SNMP, NWS, Ganglia etc. )

Data

Analyser

Analyser

Data

Page 24: Gridbus2003 University of Melbourne, Australia, June 7, 2003 OpenSCE Middleware and Tools set for Cluster and Grid System Putchong Uthayopas Director of

Gridbus2003 University of Melbourne, Australia, June 7, 2003

Grid CFD

ThaiGrid

•Front End•Sequential Solver•Visualization

•Front End•Sequential Solver•Visualization

Parallel CFDSolver

Parallel CFDSolver

Page 25: Gridbus2003 University of Melbourne, Australia, June 7, 2003 OpenSCE Middleware and Tools set for Cluster and Grid System Putchong Uthayopas Director of

Gridbus2003 University of Melbourne, Australia, June 7, 2003

Grid Scheduling

• Problem– How to efficiently use distributed/heteorgenous resources

• Efficiently• Cost effectively

• Approach– Model the grid scheduling problem– Finding good heuristic algorithms

• Grid Scheduling– Partial State Scheduling– C- sufferage with cost scheduling– Vector Space Modeling of computational Grid– CFD Task mapping using GA

Page 26: Gridbus2003 University of Melbourne, Australia, June 7, 2003 OpenSCE Middleware and Tools set for Cluster and Grid System Putchong Uthayopas Director of

Gridbus2003 University of Melbourne, Australia, June 7, 2003

Grid Model

• Grid– Collection of autonomous

system

• Autonomous system– Collection of computing node– Contain a local scheduler

System A

System BSystem C

GRID

• Local Scheduler– Resource manager– Maintain local task queue

and manage resource pool e.g. computing node

Page 27: Gridbus2003 University of Melbourne, Australia, June 7, 2003 OpenSCE Middleware and Tools set for Cluster and Grid System Putchong Uthayopas Director of

Gridbus2003 University of Melbourne, Australia, June 7, 2003

Grid Vector Space Model

• Each node has m resources

• Each system has n nodes

nmnnn

m

m

m

m

RRRR

RRRR

RRRR

RRRR

S

RRRRN

321

3333231

2232221

1131211

321

Page 28: Gridbus2003 University of Melbourne, Australia, June 7, 2003 OpenSCE Middleware and Tools set for Cluster and Grid System Putchong Uthayopas Director of

Gridbus2003 University of Melbourne, Australia, June 7, 2003

Execution Model

• Each task has W works to be done

• Estimated execution time depends on execution rate of each node

i

ii

iexec

WT

WT

i

execution rate

load

speed

Page 29: Gridbus2003 University of Melbourne, Australia, June 7, 2003 OpenSCE Middleware and Tools set for Cluster and Grid System Putchong Uthayopas Director of

Gridbus2003 University of Melbourne, Australia, June 7, 2003

Resource Commerce Model (RC)

• Proposed task allocation model on Grid system– Batch scheduling– Sequential job – Economic model : rental cost structure,

objective function – Framework for several proposed heuristics

Page 30: Gridbus2003 University of Melbourne, Australia, June 7, 2003 OpenSCE Middleware and Tools set for Cluster and Grid System Putchong Uthayopas Director of

Gridbus2003 University of Melbourne, Australia, June 7, 2003

RC for On-line scheduling

• Single task– On-line

– Let Ci be rental cost of running the task t on node Si

– Result: On-line minimum cost assignment is O(nlogn)

• Multiple task– Batch

– Parallel

– Let Cij be rental cost of running task tj on node Si

amount of required resources vector

cost rate vector

mt

t

t

t

mi

R

R

R

R

RRRRC

3

2

1

321

Page 31: Gridbus2003 University of Melbourne, Australia, June 7, 2003 OpenSCE Middleware and Tools set for Cluster and Grid System Putchong Uthayopas Director of

Gridbus2003 University of Melbourne, Australia, June 7, 2003

Objective function for RC model

• pij = priority index of running job i on machine j

• eij = execution time of job i on machine j

• Let rj be ready time of machine j

• Let ft be time factor

• Let ftb be time balance factor

• Let fc be cost factor

• Let fcb be cost balance factor

jcbijcjtbijtij cfCfrfefp

Page 32: Gridbus2003 University of Melbourne, Australia, June 7, 2003 OpenSCE Middleware and Tools set for Cluster and Grid System Putchong Uthayopas Director of

Gridbus2003 University of Melbourne, Australia, June 7, 2003

Some Algorithms

• C-Max/Min

• C-Min/Min

• C- Sufferage

• C-Sufferage with Deadline

Page 33: Gridbus2003 University of Melbourne, Australia, June 7, 2003 OpenSCE Middleware and Tools set for Cluster and Grid System Putchong Uthayopas Director of

Gridbus2003 University of Melbourne, Australia, June 7, 2003

Cost

1.00E+00

1.00E+01

1.00E+02

1.00E+03

1.00E+04

1.00E+05

1.00E+06

1.00E+07

1.00E+08

1.00E+09

1.00E+10

1.00E+11

100 1000 10000

Number of Machines

Co

st (

$G)

CMax-Min

CMin-Min

CSufferage

Max-Min

Min-Min

Sufferage

Page 34: Gridbus2003 University of Melbourne, Australia, June 7, 2003 OpenSCE Middleware and Tools set for Cluster and Grid System Putchong Uthayopas Director of

Gridbus2003 University of Melbourne, Australia, June 7, 2003

Hypersim Simulator

• Discrete event simulation engine from AIT/KU Collaboration– C++ Class – Event-based Model– Fast event processing

• Concept– User define the system using event graph

• When A occurs and condition (i) is true, event B is scheduled to occur at current time + t

– Hypersim maintain event state, state transition

A B

t

(i)

Page 35: Gridbus2003 University of Melbourne, Australia, June 7, 2003 OpenSCE Middleware and Tools set for Cluster and Grid System Putchong Uthayopas Director of

Gridbus2003 University of Melbourne, Australia, June 7, 2003

Grid Model

INPUT

EENTER

Tie

SCHEDULE START STAGE_IN STAGE_EXE

EXECUTESTAGE_OUTFINISH

Tsl

ESCHEDULE

Tse

ESTART

EFINISH

ENTER

Page 36: Gridbus2003 University of Melbourne, Australia, June 7, 2003 OpenSCE Middleware and Tools set for Cluster and Grid System Putchong Uthayopas Director of

Gridbus2003 University of Melbourne, Australia, June 7, 2003

Some Results

0.01

0.1

1

10

100

1000

10000

100000

0 5000 10000 15000 20000

Number of Tasks

Ru

n T

ime

(sec

on

ds)

GridSimSimGridHyperSim

Page 37: Gridbus2003 University of Melbourne, Australia, June 7, 2003 OpenSCE Middleware and Tools set for Cluster and Grid System Putchong Uthayopas Director of

Gridbus2003 University of Melbourne, Australia, June 7, 2003

Future Work

• More understanding about Grid economy• Complete our MPI , use it on the grid

( before SC2003)• Many new algorithms• Tools for ApGrid/ PRAGMA• Collaboration

– GridBank Grid Market Interface for OpenSCE scheduler

– GridScape for our portal

Page 38: Gridbus2003 University of Melbourne, Australia, June 7, 2003 OpenSCE Middleware and Tools set for Cluster and Grid System Putchong Uthayopas Director of

Gridbus2003 University of Melbourne, Australia, June 7, 2003

The End

Page 39: Gridbus2003 University of Melbourne, Australia, June 7, 2003 OpenSCE Middleware and Tools set for Cluster and Grid System Putchong Uthayopas Director of

Gridbus2003 University of Melbourne, Australia, June 7, 2003

Kasetsart University

• Leading multidisciplinary academics institute in Thailand

• Second oldest university in Thailand

• About 25000 students in 5 campuses around the country

• Leading in– Biotechnology

– Computational chemistry

– Computer science and engineering

– Agricultural technology

Page 40: Gridbus2003 University of Melbourne, Australia, June 7, 2003 OpenSCE Middleware and Tools set for Cluster and Grid System Putchong Uthayopas Director of

Gridbus2003 University of Melbourne, Australia, June 7, 2003

KU HPC Research

• Many advanced research are being pursue by KU researchers– Computer-Aided Molecular Modeling and

Design of HIV-1 Inhibitors– Bioinformatics research to improve rice

quality– Computational Fluid dynamics for

CAD/CAM, vehicle design, clean room– VLSI test simulation– Massive information and knowledge,

analysis, storage , retrieval• All these research require a massive

amount of computing power!

Page 41: Gridbus2003 University of Melbourne, Australia, June 7, 2003 OpenSCE Middleware and Tools set for Cluster and Grid System Putchong Uthayopas Director of

Gridbus2003 University of Melbourne, Australia, June 7, 2003

SMILE

SMILE2

SMILE3

AMATA1

AMATA2

PIRUN

GASS

MAEKA

2 200 660 900

6000

10000

20000

40000

0

5000

10000

15000

20000

25000

30000

35000

40000

KU Cluster Evolution

Mflops

Since 1999 KU always own the fastest Since 1999 KU always own the fastest Computing system in ThailandComputing system in Thailand

Page 42: Gridbus2003 University of Melbourne, Australia, June 7, 2003 OpenSCE Middleware and Tools set for Cluster and Grid System Putchong Uthayopas Director of

Gridbus2003 University of Melbourne, Australia, June 7, 2003

MAEKA SystemMassive Adaptable Environment for Kasetsart Applications

• Collaboration with AMD Inc.• Initial Phase

– 32 processors (16 dual processors node) Opteron system

– Gigabit Ethernet – Massive and scalable storage – 50-80 Gigaflops

• Fastest computing system Fastest computing system in Thailand.in Thailand.

• Much larger system will be built this year

Page 43: Gridbus2003 University of Melbourne, Australia, June 7, 2003 OpenSCE Middleware and Tools set for Cluster and Grid System Putchong Uthayopas Director of

Gridbus2003 University of Melbourne, Australia, June 7, 2003

Structures and Components

Scheduler Dispatcher

GIIS/GRIS Gatekeeper

jobmanager

Local SchedulerPBS, Condor, SQMS, ...

LDAP GRAM

GRID

User [1] an user submits a job

[2] queries available resources

[3] chooses the target site and dispatches the job

[4] submits the job to the target site[5] waits until finish