resource allocation and utilization in the blue gene/l ... · ibm labs in haifa © 2004 ibm...
TRANSCRIPT
![Page 1: Resource allocation and utilization in the Blue Gene/L ... · IBM Labs in Haifa © 2004 IBM Corporation Blue Gene/L- Overview First memberof IBM Blue Gene family of supercomputers](https://reader036.vdocuments.us/reader036/viewer/2022071212/60282b8c77f881704d3ab629/html5/thumbnails/1.jpg)
IBM Labs in Haifa © 2004 IBM Corporation
Resource allocation and utilization in the Blue Gene/L supercomputer
Tamar Domany, Y. Aridor, O. Goldshmidt, Y. Kliteynik, E.Shmueli, U. Silbershtein
![Page 2: Resource allocation and utilization in the Blue Gene/L ... · IBM Labs in Haifa © 2004 IBM Corporation Blue Gene/L- Overview First memberof IBM Blue Gene family of supercomputers](https://reader036.vdocuments.us/reader036/viewer/2022071212/60282b8c77f881704d3ab629/html5/thumbnails/2.jpg)
IBM Labs in Haifa
© 2004 IBM Corporation
Agenda� Blue Gene/L Background
� Blue Gene/L Topology
� Resource Allocation
� Simulation Results
![Page 3: Resource allocation and utilization in the Blue Gene/L ... · IBM Labs in Haifa © 2004 IBM Corporation Blue Gene/L- Overview First memberof IBM Blue Gene family of supercomputers](https://reader036.vdocuments.us/reader036/viewer/2022071212/60282b8c77f881704d3ab629/html5/thumbnails/3.jpg)
IBM Labs in Haifa
© 2004 IBM Corporation
Blue Gene/L�- Overview� First member�of IBM Blue Gene family of supercomputers� Machine configurations range from 1000 to 64,000 nodes � The world fastest supercomputer
� Rated first in the last top500 list (November 2004) � Machine size of 16K nodes
� Selected customers:� Lawrence Livermore National Laboratory � Japan's National Institute of Advanced Industrial
Science and Technology� Lofar radio telescope run by Astron in the Netherlands� Argonne National Laboratory
![Page 4: Resource allocation and utilization in the Blue Gene/L ... · IBM Labs in Haifa © 2004 IBM Corporation Blue Gene/L- Overview First memberof IBM Blue Gene family of supercomputers](https://reader036.vdocuments.us/reader036/viewer/2022071212/60282b8c77f881704d3ab629/html5/thumbnails/4.jpg)
IBM Labs in Haifa
© 2004 IBM Corporation
Blue Gene/L Philosophy� Designed for highly parallel applications� Traditional Linux and MPI programming models� Extendable and manageable
� simple to build and operate� Vastly improved price/performance
� choosing simple low power building block � highest possible single threaded performance is not
relevant, aggregate is!� Floor space and power efficiency� BlueGene/L = Cellular architecture + aggressive packaging
+ scalable software
![Page 5: Resource allocation and utilization in the Blue Gene/L ... · IBM Labs in Haifa © 2004 IBM Corporation Blue Gene/L- Overview First memberof IBM Blue Gene family of supercomputers](https://reader036.vdocuments.us/reader036/viewer/2022071212/60282b8c77f881704d3ab629/html5/thumbnails/5.jpg)
IBM Labs in Haifa
© 2004 IBM Corporation
BlueGene/L cellular architecture
� Very large number (64K) of simple identical nodes � Low cost, low power, PPC microprocessors
(700Mhz)� Geometry: 64x32x32, based on 3D torus
� Low latency, high bandwidth propriety interconnect
� I/O physically separated from computations� At most one process per CPU at a time
� Scalable and extendable architecture � Computational power of the machine can be
expanded by adding more “building blocks”
� The design of BlueGene/L is substantially different from the traditional supercomputers (NEC Earth Simulator, ASCI machines) that uses large clusters of SMP nodes
![Page 6: Resource allocation and utilization in the Blue Gene/L ... · IBM Labs in Haifa © 2004 IBM Corporation Blue Gene/L- Overview First memberof IBM Blue Gene family of supercomputers](https://reader036.vdocuments.us/reader036/viewer/2022071212/60282b8c77f881704d3ab629/html5/thumbnails/6.jpg)
IBM Labs in Haifa
© 2004 IBM Corporation
Jobs in Blue Gene/L� Blue Gene/L runs parallel jobs
� Set of task running together, communicating via message-passing
� Each job has a set of attributes� Size – # of threads (and thus nodes)� 3D Shape
� size 8 can be “slim” (e.g. 8x1x1) or “fat” (2x2x2) � Communication pattern – torus or mesh
torus mesh
![Page 7: Resource allocation and utilization in the Blue Gene/L ... · IBM Labs in Haifa © 2004 IBM Corporation Blue Gene/L- Overview First memberof IBM Blue Gene family of supercomputers](https://reader036.vdocuments.us/reader036/viewer/2022071212/60282b8c77f881704d3ab629/html5/thumbnails/7.jpg)
IBM Labs in Haifa
© 2004 IBM Corporation
What is a Job Partition ?� A partition is
� A set of nodes � A set of communication links
� Which connect the nodes as a torus or a mesh
� Partitions are isolated� A single partition accommodates a single job� No sharing of nodes or links between partitions
![Page 8: Resource allocation and utilization in the Blue Gene/L ... · IBM Labs in Haifa © 2004 IBM Corporation Blue Gene/L- Overview First memberof IBM Blue Gene family of supercomputers](https://reader036.vdocuments.us/reader036/viewer/2022071212/60282b8c77f881704d3ab629/html5/thumbnails/8.jpg)
IBM Labs in Haifa
© 2004 IBM Corporation
Job Management for Blue Gene/L� Users submit jobs to the Blue Gene/L scheduler
� The scheduler maintains a queue of submitted jobs� The scheduler’s task:
� Choose the next job to run from the queue � Allocate resources for the job� Launch the job� Monitor the job until termination� Signals, debugging…
![Page 9: Resource allocation and utilization in the Blue Gene/L ... · IBM Labs in Haifa © 2004 IBM Corporation Blue Gene/L- Overview First memberof IBM Blue Gene family of supercomputers](https://reader036.vdocuments.us/reader036/viewer/2022071212/60282b8c77f881704d3ab629/html5/thumbnails/9.jpg)
IBM Labs in Haifa
© 2004 IBM Corporation
Job Management Challenges� How do we scale beyond a few thousands nodes ?
� Group nodes into midplanes� How do we maximize machine utilization ?
� Extend toroidal topology to multi-toroidal topology
![Page 10: Resource allocation and utilization in the Blue Gene/L ... · IBM Labs in Haifa © 2004 IBM Corporation Blue Gene/L- Overview First memberof IBM Blue Gene family of supercomputers](https://reader036.vdocuments.us/reader036/viewer/2022071212/60282b8c77f881704d3ab629/html5/thumbnails/10.jpg)
IBM Labs in Haifa
© 2004 IBM Corporation
Scalability via Midplanes� Nodes are grouped into 512-node units called midplanes
� A midplane is an 8x8x8 3D mesh � Each internal node is connected directly to at most six internal
neighbors� Midplanes are connected to each other through switches
� Scalability achieved by sacrificing granularity of management� Midplane is the minimal allocation unit
� Not all nodes may be utilized for a given job
� In practice, we deal with a 128-node machine instead of 64K nodes� For all aspects of job management
![Page 11: Resource allocation and utilization in the Blue Gene/L ... · IBM Labs in Haifa © 2004 IBM Corporation Blue Gene/L- Overview First memberof IBM Blue Gene family of supercomputers](https://reader036.vdocuments.us/reader036/viewer/2022071212/60282b8c77f881704d3ab629/html5/thumbnails/11.jpg)
IBM Labs in Haifa
© 2004 IBM Corporation
BG/L Topology
X-line
Y-line
Z-line
X0
0
X1
1
X2
2
X3
3
X4
4
X5
5
X6
6
X7
7
MidplaneY
X
Z
X-switches
midplanes
![Page 12: Resource allocation and utilization in the Blue Gene/L ... · IBM Labs in Haifa © 2004 IBM Corporation Blue Gene/L- Overview First memberof IBM Blue Gene family of supercomputers](https://reader036.vdocuments.us/reader036/viewer/2022071212/60282b8c77f881704d3ab629/html5/thumbnails/12.jpg)
IBM Labs in Haifa
© 2004 IBM Corporation
Line connectivity - properties� Lines have “multi-toroidal topology”
� Can be easily extended
X2 X3
X4 X5
X6 X7
X0 X1
X8 X9
![Page 13: Resource allocation and utilization in the Blue Gene/L ... · IBM Labs in Haifa © 2004 IBM Corporation Blue Gene/L- Overview First memberof IBM Blue Gene family of supercomputers](https://reader036.vdocuments.us/reader036/viewer/2022071212/60282b8c77f881704d3ab629/html5/thumbnails/13.jpg)
IBM Labs in Haifa
© 2004 IBM Corporation
Line connectivity - properties� Lines have “multi-toroidal topology”
� Can be easily extended� Can be connected as a torus
X2 X3
X4 X5
X6 X7
X0 X1
X2 X3
X4 X5
X6 X7
X0 X1
3D torus
![Page 14: Resource allocation and utilization in the Blue Gene/L ... · IBM Labs in Haifa © 2004 IBM Corporation Blue Gene/L- Overview First memberof IBM Blue Gene family of supercomputers](https://reader036.vdocuments.us/reader036/viewer/2022071212/60282b8c77f881704d3ab629/html5/thumbnails/14.jpg)
IBM Labs in Haifa
© 2004 IBM Corporation
Line connectivity - properties� Lines have “multi-toroidal topology”
� Can be easily extended� Can be connected as one torus� Multiple toroidal partitions can co-
existX2 X3
X4 X5
X6 X7
X0 X1
X2 X3
X4 X5
X6 X7
X0 X1
3D torus
![Page 15: Resource allocation and utilization in the Blue Gene/L ... · IBM Labs in Haifa © 2004 IBM Corporation Blue Gene/L- Overview First memberof IBM Blue Gene family of supercomputers](https://reader036.vdocuments.us/reader036/viewer/2022071212/60282b8c77f881704d3ab629/html5/thumbnails/15.jpg)
IBM Labs in Haifa
© 2004 IBM Corporation
Line connectivity�- properties� Lines have “multi-toroidal topology”
� Can be easily extended� Can be connected as a torus� Multiple toroidal partitions can co-
exist� More than one way to wire a set of
midplanes
X2 X3
X4 X5
X6 X7
X0 X1
![Page 16: Resource allocation and utilization in the Blue Gene/L ... · IBM Labs in Haifa © 2004 IBM Corporation Blue Gene/L- Overview First memberof IBM Blue Gene family of supercomputers](https://reader036.vdocuments.us/reader036/viewer/2022071212/60282b8c77f881704d3ab629/html5/thumbnails/16.jpg)
IBM Labs in Haifa
© 2004 IBM Corporation
Resource Allocation� Challenges
� High machine utilization � Short response time (of jobs)� On-line problem
� Requirements� Satisfy job requests for size, shape, and connectivity
(torus or mesh)� Deal with faulty resources (nodes and wires)
� Two kinds of dedicated resources to manage� Node allocation� Link allocation
![Page 17: Resource allocation and utilization in the Blue Gene/L ... · IBM Labs in Haifa © 2004 IBM Corporation Blue Gene/L- Overview First memberof IBM Blue Gene family of supercomputers](https://reader036.vdocuments.us/reader036/viewer/2022071212/60282b8c77f881704d3ab629/html5/thumbnails/17.jpg)
IBM Labs in Haifa
© 2004 IBM Corporation
Allocation Algorithm� Finding a partition: scan the 3D machine
� Find all free partitions that match the shape/size of a job� For each candidate partition, find if and how it can be
wired� From all wireable partitions, choose the “best” partition
� use flexible criteria e.g. minimal number of links� Wiring a partition
� Static wire lookup tables per dimension� Availability of wires (previous allocation or faults) is checked
� Find suitable links in (almost) constant time� Small memory footprint despite the huge number of links
![Page 18: Resource allocation and utilization in the Blue Gene/L ... · IBM Labs in Haifa © 2004 IBM Corporation Blue Gene/L- Overview First memberof IBM Blue Gene family of supercomputers](https://reader036.vdocuments.us/reader036/viewer/2022071212/60282b8c77f881704d3ab629/html5/thumbnails/18.jpg)
IBM Labs in Haifa
© 2004 IBM Corporation
Simulated Environment� Faithful simulation of Blue Gene/L
� 128 midplanes� Scheduler invoked when a job arrives or terminates � Scheduling policy
� Aggressive backfilling� If the job at the head of the queue cannot be accommodated we
try to allocate another job out of order� Workloads (benchmarks)
� Arrival times, runtimes, size, shape, torus/mesh� Based on real parallel systems’ logs
� This presentation: San Diego Supercomputer Center (SDSC)
![Page 19: Resource allocation and utilization in the Blue Gene/L ... · IBM Labs in Haifa © 2004 IBM Corporation Blue Gene/L- Overview First memberof IBM Blue Gene family of supercomputers](https://reader036.vdocuments.us/reader036/viewer/2022071212/60282b8c77f881704d3ab629/html5/thumbnails/19.jpg)
IBM Labs in Haifa
© 2004 IBM Corporation
The benefits of multi-toroidal topologySystem Utilization vs Load
SDSCfat jobs
�.��.��.��.��.��.��.�.
�.� �.� �.� �.� �.� �.� �. �. �
load
uti
lizat
ion
BlueGene/L - ���% mesh BlueGene/L - ���% torus BlueGene/L - ��% T ��% M �D/TORUS - ���% torus
![Page 20: Resource allocation and utilization in the Blue Gene/L ... · IBM Labs in Haifa © 2004 IBM Corporation Blue Gene/L- Overview First memberof IBM Blue Gene family of supercomputers](https://reader036.vdocuments.us/reader036/viewer/2022071212/60282b8c77f881704d3ab629/html5/thumbnails/20.jpg)
IBM Labs in Haifa
© 2004 IBM Corporation
The influence of job shapes on utilization
System Utilization vs LoadSDSC
�.��.��.��.��.��.��.�.
�.� �.� �.� �.� �.� �.� �. �. �
load
utili
zatio
n
slim ��% fatfat
![Page 21: Resource allocation and utilization in the Blue Gene/L ... · IBM Labs in Haifa © 2004 IBM Corporation Blue Gene/L- Overview First memberof IBM Blue Gene family of supercomputers](https://reader036.vdocuments.us/reader036/viewer/2022071212/60282b8c77f881704d3ab629/html5/thumbnails/21.jpg)
IBM Labs in Haifa
© 2004 IBM Corporation
Summary� Blue Gene/L brings with it a new level of supercomputer
scalability – and many new challenges� Scalability of system management is achieved by sacrificing
granularity� Represent the machine as a smaller system consisting
of collections of nodes� Blue Gene/L’s novel network topology has considerable
advantages compared to traditional interconnects (such as 3D tori)
� The challenges are successfully met with a combined hardware and software solution
![Page 22: Resource allocation and utilization in the Blue Gene/L ... · IBM Labs in Haifa © 2004 IBM Corporation Blue Gene/L- Overview First memberof IBM Blue Gene family of supercomputers](https://reader036.vdocuments.us/reader036/viewer/2022071212/60282b8c77f881704d3ab629/html5/thumbnails/22.jpg)
IBM Labs in Haifa
© 2004 IBM Corporation
End
![Page 23: Resource allocation and utilization in the Blue Gene/L ... · IBM Labs in Haifa © 2004 IBM Corporation Blue Gene/L- Overview First memberof IBM Blue Gene family of supercomputers](https://reader036.vdocuments.us/reader036/viewer/2022071212/60282b8c77f881704d3ab629/html5/thumbnails/23.jpg)
IBM Labs in Haifa
© 2004 IBM Corporation
Link Allocation� The problem:
� Given a partition, fined links in all the lines that participate in the partition for all three dimensions to wire a partition attempting to best utilize future allocations.
� Solution main idea:� Build a lookup table with the partitions wiring possibilities� The dimension are independent �
Table per dimension� All lines in a dimension are equal �
Table contain information on one line� There are not so many whys to wire a partition �
consume relatively small amount of memory
![Page 24: Resource allocation and utilization in the Blue Gene/L ... · IBM Labs in Haifa © 2004 IBM Corporation Blue Gene/L- Overview First memberof IBM Blue Gene family of supercomputers](https://reader036.vdocuments.us/reader036/viewer/2022071212/60282b8c77f881704d3ab629/html5/thumbnails/24.jpg)
IBM Labs in Haifa
© 2004 IBM Corporation
The Lookup table� A table per topology dimension
� The index is a possible set of midplanes
� Each entry contains all sets of links that can wire it as a torus or as a mesh
� Built once at startup time� Given a partition, use tables to
find link set in each dimension� Eliminate non-available sets,
output “best” among available
56
1415
120
Index
0,21,2
1,2,30,1,2,3
3,4,5,6
MP sets
Link sets Connection
12 Mesh
Mesh
3,6 Mesh
1,2,3 Mesh
1,2,3,6 Torus
0,1,3,6 Torus
![Page 25: Resource allocation and utilization in the Blue Gene/L ... · IBM Labs in Haifa © 2004 IBM Corporation Blue Gene/L- Overview First memberof IBM Blue Gene family of supercomputers](https://reader036.vdocuments.us/reader036/viewer/2022071212/60282b8c77f881704d3ab629/html5/thumbnails/25.jpg)
IBM Labs in Haifa
© 2004 IBM Corporation
Y & Z lines Connectivity� No “multi-toroidal topology”
Y-switches
midplanes
Y0
0
Y1
1
Y2
2
Y3
3
�Or can be drown that way (without the midplanes):
�Can accommodate only one torus partition at a time
Y2 Y3
Y0 Y1