gridbus2003 university of melbourne, australia, june 7, 2003 opensce middleware and tools set for...
TRANSCRIPT
Gridbus2003 University of Melbourne, Australia, June 7, 2003
OpenSCEMiddleware and Tools set
for Cluster and Grid SystemPutchong Uthayopas
Director of High Performance Computing and Networking CenterAssociate Professor in Computer EngineeringFaculty of Engineering, Kasetsart University
Bangkok, Thailand
Gridbus2003 University of Melbourne, Australia, June 7, 2003
OpenSCE :Scalable Cluster Environment
• An open source project that intends to deliver an integrated open source cluster environment
• Phase 1: 1997-2000 as a SMILE project– Scalable Multicomputer Implemented using Lowcost
Equipment
• Phase 2: 2001-2003 OpenSCE project• www.opensce.org
Gridbus2003 University of Melbourne, Australia, June 7, 2003
SCE Components
MPview – MPI program visualization
• MPITH – Quick and simple MPI runtime
• SQMS – Batch scheduler for cluster
• SCMS/ SCMSWEB cluster management tool
• Beowulf Builder (BB, SBB) cluster builder
• KSIX – cluster middleware
Gridbus2003 University of Melbourne, Australia, June 7, 2003
SCE Structures
KSIX Middleware
SCMSSystem
Management
SQMS Scheduler
Beowulf BuilderTool
Real Time Monitoring
MPITH
MPVIEW
Hardware and Interconnection network
Gridbus2003 University of Melbourne, Australia, June 7, 2003
KSIX Middleware
• Presenting a single system image to application– Unify process space, process group– Distributed signal management– Membership services– Simple I/O redirection
Gridbus2003 University of Melbourne, Australia, June 7, 2003
KSIX User Level Process Migration
• LibMIG– Checkpointing
– Migration
– Pure user level code
– No recompilation
• Next version of KSIX will support load balancing
• Algorithm?
Gridbus2003 University of Melbourne, Australia, June 7, 2003
AMATA HA architecture
• AMATA is a project to build – scalable high availability
extension to linux clustering
• AMATA – Define uniform HA archit
ecture on Linux
– Services, API, Signal
AMA TA
Gridbus2003 University of Melbourne, Australia, June 7, 2003
SQMS: Queuing Management System
• Batch scheduler for sequential an parallel MPI task
• Static and dynamic load balancing
• Reconfigurable scheduling policy
• Multiple resource and policy view
• Simple accounting and economic modeling support (Cluster Bank server)
Submitter
Task
TaskQueue
NodeAllocator
Scheduler
Cluster Nodes
RemoteQueue
Gridbus2003 University of Melbourne, Australia, June 7, 2003
SCMS: Cluster Management Tool for Beowulf Cluster
• A collection of system management tools for Beowulf cluster
• Package includes– Portable real-time monitoring – Parallel Unix command– Alarm system – Large collection of graphical user interface tool
s for users and system administrator
Gridbus2003 University of Melbourne, Australia, June 7, 2003
MPITH
• Small MPI runtime (40-50 functions)– OO design– C++ Language– More than 15000 lines of C++ code– Linux operating system
• Architecture
• Selected implementation issue
Gridbus2003 University of Melbourne, Australia, June 7, 2003
Preliminaries Study
• Only 20-30 functions are used by most developers
52
38
11
21
14
0
10
20
30
40
50
60
PETSc MPI Blacs MPI Povray HPL PGAPack
Application
Fu
nct
ion
Cou
nt
Gridbus2003 University of Melbourne, Australia, June 7, 2003
MPITH
API
Engine
Device Handler
Device
P IC A
Controller
UDP DP TCP VIA
Device Handler
Communicator
ProtocolBuffer
managerAlgorithm
Gridbus2003 University of Melbourne, Australia, June 7, 2003
Broadcast Performance
Broadcast time of 1 B - 1 MB on 16 nodes
10.00
100.00
1000.00
10000.00
100000.00
1000000.00
0.0001 0.001 0.01 0.1 1 10 100 1000 10000Message size (kByte)
Ttim
e (m
icro
seco
nd)
MPITH
MPICH
LAM
Broadcast time of 64 kB on 2 - 16 nodes
0
10000
20000
30000
40000
50000
60000
0 2 4 6 8 10 12 14 16 18Number of nodes
Tti
me
(mic
rose
con
d)
MPITH
MPICH
LAM
Send/Receive time of 1 B to 1 MB
10.00
100.00
1000.00
10000.00
100000.00
1000000.00
0.0001 0.001 0.01 0.1 1 10 100 1000 10000Message Size (kByte)
Tim
e (m
icro
seco
nd)
MPITH
MPICH
LAM
Gridbus2003 University of Melbourne, Australia, June 7, 2003
Parallel Gaussian Elimination
Speed up of Gaussian eliminition running time of 2400 variables on 1 to 16 nodes
0123456789
0 2 4 6 8 10 12 14 16
Number of nodes
Spee
d up
MPITH
MPICH
LAM
Gaussian eliminition running time of 400 to 2400 variables on 16 nodes
0
10
20
30
40
50
60
0 500 1000 1500 2000 2500
Problem size
Tim
e (s
econ
d)
MPITH
MPICH
LAM
Gridbus2003 University of Melbourne, Australia, June 7, 2003
Energy Model for Implicit Coscheduling
• Each process has stored “Energy”
• Process charge/discharge “energy” while it executes
• Charge/Discharge rate is calculated from process statistics– Communication Frequency– Message Size– Amount of running process in
the system
• The charging and discharging state changes when communication state changes
• Local scheduling priority are calculated from– Static priority– Energy level
State Change
State Change (SwitchS triggered)
State Change (SwitchS triggered)
Time
En
erg
y
State Change
State Change
Gridbus2003 University of Melbourne, Australia, June 7, 2003
Implementation Details
• Implemented in kernel-level as Linux Kernel Module (LKM)– kernel version 2.4.19 (the latest at the time)– Using Linux timer mechanism to periodically inspect
the kernel task queue and adjust the value of each task_struct
– User need to tell the system which process to do the coscheduling by using command line.
– _exit system call is trapped to ensure that all internal variable is cleared when process exit
Gridbus2003 University of Melbourne, Australia, June 7, 2003
Runtime of parallel application against sequential workload
• Single MG against 1-10 sequential workload
Gridbus2003 University of Melbourne, Australia, June 7, 2003
Efficient Collective Communication
Algorithm over Grid system
• Genetic Algorithms-based Dynamic Tree (GADT)– Heuristic based on
genetic algorithm
– Total transmission time is used as fitness value
5
0
7 4
3
6
2
1
1 2 3
1 1
1 2
0 1 7 0 2 02
Parent array (n-1)
5 9 7 20 8 215
Priority array (n-1)
Gridbus2003 University of Melbourne, Australia, June 7, 2003
Algorithms Comparison
Transmission of 3 algorithms
0
5000
10000
15000
20000
25000
30000
1 2 3 4 5 6 7 8 9 10
Pattern
Tim
e Optimal
GADT
Binomial
Gridbus2003 University of Melbourne, Australia, June 7, 2003
OpenSCE and Grid Computing
• Software– Grid Observer
– SCEGrid Grid scheduler
– HyperGrid Simulator
OpenSCEOpenSCE OpenSCEOpenSCE
GlobusGlobus
SCE/GridSCE/Grid GridObserverGridObserver
Gridbus2003 University of Melbourne, Australia, June 7, 2003
SCE/Grid Architecture
• Distributed resource manager
• Running on top of Globus
• Automatically discovering resources
• Automatically choosing target site
Site A
SCEGrid
Site B
SCEGrid
Site CSCEGrid
GRID
Gridbus2003 University of Melbourne, Australia, June 7, 2003
Structure
Remote Execution Machine Submission Machine
SCE/Grid Scheduler
Job Queue
Policy Module
SCE/Grid Dispatcher
Launcher Launcher Launcher (Globus)
Globus GASS Server
Globus Gate Keeper
Globus GRAM
Local Job Scheduler (PBS, SGE,..)
SCE/Grid REM
Job
Policy Module
Policy Module
Gridbus2003 University of Melbourne, Australia, June 7, 2003
Grid Observer (KU)
• Building technology to monitor the grid
• Software is now used by APGrid Test Bed
Sensors
Sensors
Collector Presenter
Collector Presenter
Other Monitoring System(SNMP, NWS, Ganglia etc. )
Data
Analyser
Analyser
Data
Gridbus2003 University of Melbourne, Australia, June 7, 2003
Grid CFD
ThaiGrid
•Front End•Sequential Solver•Visualization
•Front End•Sequential Solver•Visualization
Parallel CFDSolver
Parallel CFDSolver
Gridbus2003 University of Melbourne, Australia, June 7, 2003
Grid Scheduling
• Problem– How to efficiently use distributed/heteorgenous resources
• Efficiently• Cost effectively
• Approach– Model the grid scheduling problem– Finding good heuristic algorithms
• Grid Scheduling– Partial State Scheduling– C- sufferage with cost scheduling– Vector Space Modeling of computational Grid– CFD Task mapping using GA
Gridbus2003 University of Melbourne, Australia, June 7, 2003
Grid Model
• Grid– Collection of autonomous
system
• Autonomous system– Collection of computing node– Contain a local scheduler
System A
System BSystem C
GRID
• Local Scheduler– Resource manager– Maintain local task queue
and manage resource pool e.g. computing node
Gridbus2003 University of Melbourne, Australia, June 7, 2003
Grid Vector Space Model
• Each node has m resources
• Each system has n nodes
nmnnn
m
m
m
m
RRRR
RRRR
RRRR
RRRR
S
RRRRN
321
3333231
2232221
1131211
321
Gridbus2003 University of Melbourne, Australia, June 7, 2003
Execution Model
• Each task has W works to be done
• Estimated execution time depends on execution rate of each node
i
ii
iexec
WT
WT
i
execution rate
load
speed
Gridbus2003 University of Melbourne, Australia, June 7, 2003
Resource Commerce Model (RC)
• Proposed task allocation model on Grid system– Batch scheduling– Sequential job – Economic model : rental cost structure,
objective function – Framework for several proposed heuristics
Gridbus2003 University of Melbourne, Australia, June 7, 2003
RC for On-line scheduling
• Single task– On-line
– Let Ci be rental cost of running the task t on node Si
– Result: On-line minimum cost assignment is O(nlogn)
• Multiple task– Batch
– Parallel
– Let Cij be rental cost of running task tj on node Si
amount of required resources vector
cost rate vector
mt
t
t
t
mi
R
R
R
R
RRRRC
3
2
1
321
Gridbus2003 University of Melbourne, Australia, June 7, 2003
Objective function for RC model
• pij = priority index of running job i on machine j
• eij = execution time of job i on machine j
• Let rj be ready time of machine j
• Let ft be time factor
• Let ftb be time balance factor
• Let fc be cost factor
• Let fcb be cost balance factor
jcbijcjtbijtij cfCfrfefp
Gridbus2003 University of Melbourne, Australia, June 7, 2003
Some Algorithms
• C-Max/Min
• C-Min/Min
• C- Sufferage
• C-Sufferage with Deadline
Gridbus2003 University of Melbourne, Australia, June 7, 2003
Cost
1.00E+00
1.00E+01
1.00E+02
1.00E+03
1.00E+04
1.00E+05
1.00E+06
1.00E+07
1.00E+08
1.00E+09
1.00E+10
1.00E+11
100 1000 10000
Number of Machines
Co
st (
$G)
CMax-Min
CMin-Min
CSufferage
Max-Min
Min-Min
Sufferage
Gridbus2003 University of Melbourne, Australia, June 7, 2003
Hypersim Simulator
• Discrete event simulation engine from AIT/KU Collaboration– C++ Class – Event-based Model– Fast event processing
• Concept– User define the system using event graph
• When A occurs and condition (i) is true, event B is scheduled to occur at current time + t
– Hypersim maintain event state, state transition
A B
t
(i)
Gridbus2003 University of Melbourne, Australia, June 7, 2003
Grid Model
INPUT
EENTER
Tie
SCHEDULE START STAGE_IN STAGE_EXE
EXECUTESTAGE_OUTFINISH
Tsl
ESCHEDULE
Tse
ESTART
EFINISH
ENTER
Gridbus2003 University of Melbourne, Australia, June 7, 2003
Some Results
0.01
0.1
1
10
100
1000
10000
100000
0 5000 10000 15000 20000
Number of Tasks
Ru
n T
ime
(sec
on
ds)
GridSimSimGridHyperSim
Gridbus2003 University of Melbourne, Australia, June 7, 2003
Future Work
• More understanding about Grid economy• Complete our MPI , use it on the grid
( before SC2003)• Many new algorithms• Tools for ApGrid/ PRAGMA• Collaboration
– GridBank Grid Market Interface for OpenSCE scheduler
– GridScape for our portal
Gridbus2003 University of Melbourne, Australia, June 7, 2003
The End
Gridbus2003 University of Melbourne, Australia, June 7, 2003
Kasetsart University
• Leading multidisciplinary academics institute in Thailand
• Second oldest university in Thailand
• About 25000 students in 5 campuses around the country
• Leading in– Biotechnology
– Computational chemistry
– Computer science and engineering
– Agricultural technology
Gridbus2003 University of Melbourne, Australia, June 7, 2003
KU HPC Research
• Many advanced research are being pursue by KU researchers– Computer-Aided Molecular Modeling and
Design of HIV-1 Inhibitors– Bioinformatics research to improve rice
quality– Computational Fluid dynamics for
CAD/CAM, vehicle design, clean room– VLSI test simulation– Massive information and knowledge,
analysis, storage , retrieval• All these research require a massive
amount of computing power!
Gridbus2003 University of Melbourne, Australia, June 7, 2003
SMILE
SMILE2
SMILE3
AMATA1
AMATA2
PIRUN
GASS
MAEKA
2 200 660 900
6000
10000
20000
40000
0
5000
10000
15000
20000
25000
30000
35000
40000
KU Cluster Evolution
Mflops
Since 1999 KU always own the fastest Since 1999 KU always own the fastest Computing system in ThailandComputing system in Thailand
Gridbus2003 University of Melbourne, Australia, June 7, 2003
MAEKA SystemMassive Adaptable Environment for Kasetsart Applications
• Collaboration with AMD Inc.• Initial Phase
– 32 processors (16 dual processors node) Opteron system
– Gigabit Ethernet – Massive and scalable storage – 50-80 Gigaflops
• Fastest computing system Fastest computing system in Thailand.in Thailand.
• Much larger system will be built this year
Gridbus2003 University of Melbourne, Australia, June 7, 2003
Structures and Components
Scheduler Dispatcher
GIIS/GRIS Gatekeeper
jobmanager
Local SchedulerPBS, Condor, SQMS, ...
LDAP GRAM
GRID
User [1] an user submits a job
[2] queries available resources
[3] chooses the target site and dispatches the job
[4] submits the job to the target site[5] waits until finish