gridbus2003 university of melbourne, australia, june 7, 2003 opensce middleware and tools set for...

Gridbus2003 University of Melbourne, Australia, June 7, 2003

OpenSCEMiddleware and Tools set

for Cluster and Grid SystemPutchong Uthayopas

Director of High Performance Computing and Networking CenterAssociate Professor in Computer EngineeringFaculty of Engineering, Kasetsart University

Bangkok, Thailand


OpenSCE :Scalable Cluster Environment

• An open source project that intends to deliver an integrated open source cluster environment

• Phase 1: 1997-2000 as a SMILE project– Scalable Multicomputer Implemented using Lowcost

Equipment

• Phase 2: 2001-2003 OpenSCE project• www.opensce.org


SCE Components

MPview – MPI program visualization

• MPITH – Quick and simple MPI runtime

• SQMS – Batch scheduler for cluster

• SCMS/ SCMSWEB cluster management tool

• Beowulf Builder (BB, SBB) cluster builder

• KSIX – cluster middleware


SCE Structures

KSIX Middleware

SCMSSystem

Management

SQMS Scheduler

Beowulf BuilderTool

Real Time Monitoring

MPITH

MPVIEW

Hardware and Interconnection network


KSIX Middleware

• Presenting a single system image to application– Unify process space, process group– Distributed signal management– Membership services– Simple I/O redirection


KSIX User Level Process Migration

• LibMIG– Checkpointing

– Migration

– Pure user level code

– No recompilation

• Next version of KSIX will support load balancing

• Algorithm?


AMATA HA architecture

• AMATA is a project to build – scalable high availability

extension to linux clustering

• AMATA – Define uniform HA archit

ecture on Linux

– Services, API, Signal

AMA TA


SQMS: Queuing Management System

• Batch scheduler for sequential an parallel MPI task

• Static and dynamic load balancing

• Reconfigurable scheduling policy

• Multiple resource and policy view

• Simple accounting and economic modeling support (Cluster Bank server)

Submitter

Task

TaskQueue

NodeAllocator

Scheduler

Cluster Nodes

RemoteQueue


SCMS: Cluster Management Tool for Beowulf Cluster

• A collection of system management tools for Beowulf cluster

• Package includes– Portable real-time monitoring – Parallel Unix command– Alarm system – Large collection of graphical user interface tool

s for users and system administrator


MPITH

• Small MPI runtime (40-50 functions)– OO design– C++ Language– More than 15000 lines of C++ code– Linux operating system

• Architecture

• Selected implementation issue


Preliminaries Study

• Only 20-30 functions are used by most developers

52

38

11

21

14

0

10

20

30

40

50

60

PETSc MPI Blacs MPI Povray HPL PGAPack

Application

Fu

nct

ion

Cou

nt


MPITH

API

Engine

Device Handler

Device

P IC A

Controller

UDP DP TCP VIA

Device Handler

Communicator

ProtocolBuffer

managerAlgorithm


Broadcast Performance

Broadcast time of 1 B - 1 MB on 16 nodes

10.00

100.00

1000.00

10000.00

100000.00

1000000.00

0.0001 0.001 0.01 0.1 1 10 100 1000 10000Message size (kByte)

Ttim

e (m

icro

seco

nd)

MPITH

MPICH

LAM

Broadcast time of 64 kB on 2 - 16 nodes

0

10000

20000

30000

40000

50000

60000

0 2 4 6 8 10 12 14 16 18Number of nodes

Tti

me

(mic

rose

con

d)

MPITH

MPICH

LAM

Send/Receive time of 1 B to 1 MB

10.00

100.00

1000.00

10000.00

100000.00

1000000.00

0.0001 0.001 0.01 0.1 1 10 100 1000 10000Message Size (kByte)

Tim

e (m

icro

seco

nd)

MPITH

MPICH

LAM


Parallel Gaussian Elimination

Speed up of Gaussian eliminition running time of 2400 variables on 1 to 16 nodes

0123456789

0 2 4 6 8 10 12 14 16

Number of nodes

Spee

d up

MPITH

MPICH

LAM

Gaussian eliminition running time of 400 to 2400 variables on 16 nodes

0

10

20

30

40

50

60

0 500 1000 1500 2000 2500

Problem size

Tim

e (s

econ

d)

MPITH

MPICH

LAM


Energy Model for Implicit Coscheduling

• Each process has stored “Energy”

• Process charge/discharge “energy” while it executes

• Charge/Discharge rate is calculated from process statistics– Communication Frequency– Message Size– Amount of running process in

the system

• The charging and discharging state changes when communication state changes

• Local scheduling priority are calculated from– Static priority– Energy level

State Change

State Change (SwitchS triggered)

State Change (SwitchS triggered)

Time

En

erg

y

State Change

State Change


Implementation Details

• Implemented in kernel-level as Linux Kernel Module (LKM)– kernel version 2.4.19 (the latest at the time)– Using Linux timer mechanism to periodically inspect

the kernel task queue and adjust the value of each task_struct

– User need to tell the system which process to do the coscheduling by using command line.

– _exit system call is trapped to ensure that all internal variable is cleared when process exit


Runtime of parallel application against sequential workload

• Single MG against 1-10 sequential workload


Efficient Collective Communication

Algorithm over Grid system

• Genetic Algorithms-based Dynamic Tree (GADT)– Heuristic based on

genetic algorithm

– Total transmission time is used as fitness value

5

0

7 4

3

6

2

1

1 2 3

1 1

1 2

0 1 7 0 2 02

Parent array (n-1)

5 9 7 20 8 215

Priority array (n-1)


Algorithms Comparison

Transmission of 3 algorithms

0

5000

10000

15000

20000

25000

30000

1 2 3 4 5 6 7 8 9 10

Pattern

Tim

e Optimal

GADT

Binomial


OpenSCE and Grid Computing

• Software– Grid Observer

– SCEGrid Grid scheduler

– HyperGrid Simulator

OpenSCEOpenSCE OpenSCEOpenSCE

GlobusGlobus

SCE/GridSCE/Grid GridObserverGridObserver


SCE/Grid Architecture

• Distributed resource manager

• Running on top of Globus

• Automatically discovering resources

• Automatically choosing target site

Site A

SCEGrid

Site B

SCEGrid

Site CSCEGrid

GRID


Structure

Remote Execution Machine Submission Machine

SCE/Grid Scheduler

Job Queue

Policy Module

SCE/Grid Dispatcher

Launcher Launcher Launcher (Globus)

Globus GASS Server

Globus Gate Keeper

Globus GRAM

Local Job Scheduler (PBS, SGE,..)

SCE/Grid REM

Job

Policy Module

Policy Module


Grid Observer (KU)

• Building technology to monitor the grid

• Software is now used by APGrid Test Bed

Sensors

Sensors

Collector Presenter

Collector Presenter

Other Monitoring System(SNMP, NWS, Ganglia etc. )

Data

Analyser

Analyser

Data


Grid CFD

ThaiGrid

•Front End•Sequential Solver•Visualization

•Front End•Sequential Solver•Visualization

Parallel CFDSolver

Parallel CFDSolver


Grid Scheduling

• Problem– How to efficiently use distributed/heteorgenous resources

• Efficiently• Cost effectively

• Approach– Model the grid scheduling problem– Finding good heuristic algorithms

• Grid Scheduling– Partial State Scheduling– C- sufferage with cost scheduling– Vector Space Modeling of computational Grid– CFD Task mapping using GA


Grid Model

• Grid– Collection of autonomous

system

• Autonomous system– Collection of computing node– Contain a local scheduler

System A

System BSystem C

GRID

• Local Scheduler– Resource manager– Maintain local task queue

and manage resource pool e.g. computing node


Grid Vector Space Model

• Each node has m resources

• Each system has n nodes

nmnnn

m

m

m

m

RRRR

RRRR

RRRR

RRRR

S

RRRRN

321

3333231

2232221

1131211

321


Execution Model

• Each task has W works to be done

• Estimated execution time depends on execution rate of each node

i

ii

iexec

WT

WT

i

execution rate

load

speed


Resource Commerce Model (RC)

• Proposed task allocation model on Grid system– Batch scheduling– Sequential job – Economic model : rental cost structure,

objective function – Framework for several proposed heuristics


RC for On-line scheduling

• Single task– On-line

– Let Ci be rental cost of running the task t on node Si

– Result: On-line minimum cost assignment is O(nlogn)

• Multiple task– Batch

– Parallel

– Let Cij be rental cost of running task tj on node Si

amount of required resources vector

cost rate vector

mt

t

t

t

mi

R

R

R

R

RRRRC

3

2

1

321


Objective function for RC model

• pij = priority index of running job i on machine j

• eij = execution time of job i on machine j

• Let rj be ready time of machine j

• Let ft be time factor

• Let ftb be time balance factor

• Let fc be cost factor

• Let fcb be cost balance factor

jcbijcjtbijtij cfCfrfefp


Some Algorithms

• C-Max/Min

• C-Min/Min

• C- Sufferage

• C-Sufferage with Deadline


Cost

1.00E+00

1.00E+01

1.00E+02

1.00E+03

1.00E+04

1.00E+05

1.00E+06

1.00E+07

1.00E+08

1.00E+09

1.00E+10

1.00E+11

100 1000 10000

Number of Machines

Co

st (

$G)

CMax-Min

CMin-Min

CSufferage

Max-Min

Min-Min

Sufferage


Hypersim Simulator

• Discrete event simulation engine from AIT/KU Collaboration– C++ Class – Event-based Model– Fast event processing

• Concept– User define the system using event graph

• When A occurs and condition (i) is true, event B is scheduled to occur at current time + t

– Hypersim maintain event state, state transition

A B

t

(i)


Grid Model

INPUT

EENTER

Tie

SCHEDULE START STAGE_IN STAGE_EXE

EXECUTESTAGE_OUTFINISH

Tsl

ESCHEDULE

Tse

ESTART

EFINISH

ENTER


Some Results

0.01

0.1

1

10

100

1000

10000

100000

0 5000 10000 15000 20000

Number of Tasks

Ru

n T

ime

(sec

on

ds)

GridSimSimGridHyperSim


Future Work

• More understanding about Grid economy• Complete our MPI , use it on the grid

( before SC2003)• Many new algorithms• Tools for ApGrid/ PRAGMA• Collaboration

– GridBank Grid Market Interface for OpenSCE scheduler

– GridScape for our portal


The End


Kasetsart University

• Leading multidisciplinary academics institute in Thailand

• Second oldest university in Thailand

• About 25000 students in 5 campuses around the country

• Leading in– Biotechnology

– Computational chemistry

– Computer science and engineering

– Agricultural technology


KU HPC Research

• Many advanced research are being pursue by KU researchers– Computer-Aided Molecular Modeling and

Design of HIV-1 Inhibitors– Bioinformatics research to improve rice

quality– Computational Fluid dynamics for

CAD/CAM, vehicle design, clean room– VLSI test simulation– Massive information and knowledge,

analysis, storage , retrieval• All these research require a massive

amount of computing power!


SMILE

SMILE2

SMILE3

AMATA1

AMATA2

PIRUN

GASS

MAEKA

2 200 660 900

6000

10000

20000

40000

0

5000

10000

15000

20000

25000

30000

35000

40000

KU Cluster Evolution

Mflops

Since 1999 KU always own the fastest Since 1999 KU always own the fastest Computing system in ThailandComputing system in Thailand


MAEKA SystemMassive Adaptable Environment for Kasetsart Applications

• Collaboration with AMD Inc.• Initial Phase

– 32 processors (16 dual processors node) Opteron system

– Gigabit Ethernet – Massive and scalable storage – 50-80 Gigaflops

• Fastest computing system Fastest computing system in Thailand.in Thailand.

• Much larger system will be built this year


Structures and Components

Scheduler Dispatcher

GIIS/GRIS Gatekeeper

jobmanager

Local SchedulerPBS, Condor, SQMS, ...

LDAP GRAM

GRID

User [1] an user submits a job

[2] queries available resources

[3] chooses the target site and dispatches the job

[4] submits the job to the target site[5] waits until finish

gridbus2003 university of melbourne, australia, june 7, 2003 opensce middleware and tools set for...

Documents

scalable cluster environment

parallel mpi taskstatic

single system image

open source project

applicationunify process

amata ha architectureamata

load balancingalgorithm

opensce projectwww