Eitan Frachtenberg MIT, 20-Sep-2004
1
P AL
Designing Parallel Operating Systems using Modern Interconnects
CCS-3
Designing Parallel Operating Systems
using Modern Interconnects
Designing Parallel Operating Systems using Modern Interconnects
Eitan Frachtenberg ([email protected])
With
Fabrizio Petrini, Juan Fernandez, Dror Feitelson, Jose-Carlos Sancho, Kei Davis
Computer and Computational Sciences DivisionLos Alamos National Laboratory
Ideas that change the world
Eitan Frachtenberg MIT, 20-Sep-2004
2
P AL
Designing Parallel Operating Systems using Modern Interconnects
CCS-3
Cluster Supercomputers
Growing in prevalence and performance, 7 out of 10 top supercomputers
Running parallel applications Advanced, high-end interconnects
Eitan Frachtenberg MIT, 20-Sep-2004
3
P AL
Designing Parallel Operating Systems using Modern Interconnects
CCS-3
Distributed vs. Parallel
Distributed and parallel applications (including operating systems) may be distinguished by their use of global and collective operations
Distributed—local information, relatively small number of point-to-point messages
Parallel—global synchronization: barriers, reductions, exchanges
Eitan Frachtenberg MIT, 20-Sep-2004
4
P AL
Designing Parallel Operating Systems using Modern Interconnects
CCS-3
System Software Components
JobScheduling
Fault Tolerance Parallel
I/O
CommunicationLibrary
ResourceManagement
SystemSoftware
SystemSoftware
Eitan Frachtenberg MIT, 20-Sep-2004
5
P AL
Designing Parallel Operating Systems using Modern Interconnects
CCS-3
Problems with System Software
Independent single-node OS (e.g. Linux) connected by distributed dæmons: Redundant components Performance hits Scalability issues Load balancing issues
Eitan Frachtenberg MIT, 20-Sep-2004
6
P AL
Designing Parallel Operating Systems using Modern Interconnects
CCS-3
OS’s Collective Operations
Many OS tasks are inherently global or collective operations:
Job launching, data dissemination Context switching Job termination (normal and forced) Load balancing
Eitan Frachtenberg MIT, 20-Sep-2004
7
P AL
Designing Parallel Operating Systems using Modern Interconnects
CCS-3
Local
Operating
System
Resource
Management
Parallel
I/O
Fault Tolerance
Job Scheduling
User-Level
Communication
Local
Operating
System
Resource
Management
Parallel
I/O
Fault Tolerance
Job Scheduling
User-Level
Communication
Node 1 Node 2
Global Parallel Operating System
Job Scheduling Fault Tolerance Communication Parallel I/O Resource Mgmt
Eitan Frachtenberg MIT, 20-Sep-2004
8
P AL
Designing Parallel Operating Systems using Modern Interconnects
CCS-3
The Vision
Modern interconnects are very powerful collective operations programmable NICs on-board RAM
Use a small set of network mechanisms as parallel OS infrastructure
Build upon this infrastructure to create unified system software
System software Inherits scalability and performance from network features
Eitan Frachtenberg MIT, 20-Sep-2004
9
P AL
Designing Parallel Operating Systems using Modern Interconnects
CCS-3
Example: ASCI Q Barrier [HotI’03]
Eitan Frachtenberg MIT, 20-Sep-2004
10
P AL
Designing Parallel Operating Systems using Modern Interconnects
CCS-3
Parallel OS Primitives
System software built atop three primitives Xfer-And-Signal
Transfer block of data to a set of nodes Optionally signal local/remote event upon completion
Compare-And-Write Compare global variable on a set of nodes Optionally write global variable on the same set of nodes
Test-Event Poll local event
Eitan Frachtenberg MIT, 20-Sep-2004
11
P AL
Designing Parallel Operating Systems using Modern Interconnects
CCS-3
Core Primitives on QsNet
System software built atop three primitives Xfer-And-Signal (QsNet):
Node S transfers block of data to nodes D1, D2, D3 and D4
Events triggered at source and destinations
S D1 D2D4D3
SourceEvent
DestinationEvents
Eitan Frachtenberg MIT, 20-Sep-2004
12
P AL
Designing Parallel Operating Systems using Modern Interconnects
CCS-3
Core Primitives (cont.)
System software built atop three primitives Compare-And-Write (QsNet):
Node S compares variable V on nodes D1, D2, D3 and D4
S D1 D2D4D3
•Is V {, , >} to Value?
Eitan Frachtenberg MIT, 20-Sep-2004
13
P AL
Designing Parallel Operating Systems using Modern Interconnects
CCS-3
Core Primitives (cont.)
System software built atop three primitives Compare-And-Write (QsNet):
Node S compares variable V on nodes D1, D2, D3 and D4
Partial results are combined in the switches
S D1 D2D4D3
Eitan Frachtenberg MIT, 20-Sep-2004
14
P AL
Designing Parallel Operating Systems using Modern Interconnects
CCS-3
System Software Components
JobScheduling
Fault Tolerance Parallel
I/O
CommunicationLibrary
ResourceResourceManagementManagement
SystemSoftware
SystemSoftware
Eitan Frachtenberg MIT, 20-Sep-2004
15
P AL
Designing Parallel Operating Systems using Modern Interconnects
CCS-3
- Inherits scalability from network primitives:- Data dissemination and coordination- Interactive job launching speeds- Context-switching at milliseconds level
- Described in [SC’02]
Scalable Tool for Resource Management
Eitan Frachtenberg MIT, 20-Sep-2004
16
P AL
Designing Parallel Operating Systems using Modern Interconnects
CCS-3
State of the Art in ResourceManagement
Resource managers (e.g. PBS, LSF, RMS, LoadLeveler, Maui) are typically implemented using TCP/IP—favors portability over performance, Poorly-scaling algorithms for the distribution/collection of data
and control messages Favoring development time over performance
Scalable performance not important for small clusters but crucial for large ones.
There exists a need for fast and scalable resource management.
Eitan Frachtenberg MIT, 20-Sep-2004
17
P AL
Designing Parallel Operating Systems using Modern Interconnects
CCS-3
Experimental Setup
64 nodes/256 processors ES40 Alphaserver cluster 2 independent network rails of Quadrics Elan3 Files are placed in ramdisk in order to avoid I/O
bottlenecks and expose the performance of the resource management algorithms
Eitan Frachtenberg MIT, 20-Sep-2004
18
P AL
Designing Parallel Operating Systems using Modern Interconnects
CCS-3
Launch Times (Unloaded System)
The launch time is constant when we increase the number of processors.
STORM is highly scalable
Eitan Frachtenberg MIT, 20-Sep-2004
19
P AL
Designing Parallel Operating Systems using Modern Interconnects
CCS-3
Launch Times (Loaded System, 12 MB)
Worst case: 1.5seconds to launch a 12 MB file on 256 processors
Eitan Frachtenberg MIT, 20-Sep-2004
20
P AL
Designing Parallel Operating Systems using Modern Interconnects
CCS-3
Measured and Estimated Launch Times
The model shows that in an ES40-based Alphaserver a 12MB binary can be launched in 135ms on 16,384 nodes
Eitan Frachtenberg MIT, 20-Sep-2004
21
P AL
Designing Parallel Operating Systems using Modern Interconnects
CCS-3
Comparative Evaluation(Measured & Modeled)
Eitan Frachtenberg MIT, 20-Sep-2004
22
P AL
Designing Parallel Operating Systems using Modern Interconnects
CCS-3
System Software Components
JobJobSchedulingScheduling
Fault Tolerance Parallel
I/O
CommunicationLibrary
ResourceManagement
SystemSoftware
SystemSoftware
Eitan Frachtenberg MIT, 20-Sep-2004
23
P AL
Designing Parallel Operating Systems using Modern Interconnects
CCS-3
Job Scheduling
Controls the allocation of space and time resources to jobs
HPC apps have special requirements Multiple processing and network resources Synchronization ( < 1ms granularity) Potentially memory hogs with little locality
Has significant effect on throughput, responsiveness, and utilization
Eitan Frachtenberg MIT, 20-Sep-2004
24
P AL
Designing Parallel Operating Systems using Modern Interconnects
CCS-3
First-Come-First-Serve (FCFS)
Eitan Frachtenberg MIT, 20-Sep-2004
25
P AL
Designing Parallel Operating Systems using Modern Interconnects
CCS-3
Gang Scheduling (GS)
Eitan Frachtenberg MIT, 20-Sep-2004
26
P AL
Designing Parallel Operating Systems using Modern Interconnects
CCS-3
Implicit CoScheduling
Eitan Frachtenberg MIT, 20-Sep-2004
27
P AL
Designing Parallel Operating Systems using Modern Interconnects
CCS-3
Hybrid Methods
Combine global synchronization & local information Rely on scalable primitives for global coordination
and information exchange First implementation of two novel algorithms:
Flexible CoScheduling (FCS) Buffered CoScheduling (BCS)
Eitan Frachtenberg MIT, 20-Sep-2004
28
P AL
Designing Parallel Operating Systems using Modern Interconnects
CCS-3
Flexible CoScheduling (FCS)
Measure communication characteristics, such as granularity and wait times
Classify processes based on synchronization requirements
Schedule processes based on class Described in [IPDPS’03]
Eitan Frachtenberg MIT, 20-Sep-2004
29
P AL
Designing Parallel Operating Systems using Modern Interconnects
CCS-3
FCS Classification
Granularity
Block times
Fine Coarse
Short Long
CS
Always gang-scheduled
F
Preferably gang-scheduled
DC
Locally scheduled
Eitan Frachtenberg MIT, 20-Sep-2004
30
P AL
Designing Parallel Operating Systems using Modern Interconnects
CCS-3
Methodology
Synthetic, controllable MPI programs Workload
Static: all jobs start together Dynamic: different sizes, arrival and run times
Various schedulers implemented: FCFS, GS, FCS, SB (ICS), BCS
Emulation vs. simulation Actual implementation takes into account all the
overhead and factors of a real system
Eitan Frachtenberg MIT, 20-Sep-2004
31
P AL
Designing Parallel Operating Systems using Modern Interconnects
CCS-3
Hardware Environment
Environment ported to three architectures and clusters: Crescendo: 32x2 Pentium III, 1GB Accelerando: 32x2 Itanium II, 2GB Wolverine: 64x4 Alpha ES40, 8GB
Eitan Frachtenberg MIT, 20-Sep-2004
32
P AL
Designing Parallel Operating Systems using Modern Interconnects
CCS-3
Synthetic Application
Bulk synchronous, 3ms basic granularity Can control: granularity, variability and
Communication pattern
Eitan Frachtenberg MIT, 20-Sep-2004
33
P AL
Designing Parallel Operating Systems using Modern Interconnects
CCS-3
Synthetic Scenarios
Balanced Complementing Imbalanced Mixed
Eitan Frachtenberg MIT, 20-Sep-2004
34
P AL
Designing Parallel Operating Systems using Modern Interconnects
CCS-3
Turnaround Time
0
50
100
150
200
250
300
350
400
Balanced Imbalanced Complementing Mixed
FCFS GS SB FCS Optimal
Eitan Frachtenberg MIT, 20-Sep-2004
35
P AL
Designing Parallel Operating Systems using Modern Interconnects
CCS-3
Dynamic Workloads [JSSPP’03]
Static workloads are simple and offer insights, but are not realistic
Most real-life workloads are more complex Users submit jobs dynamically, of varying time and
space requirements
Eitan Frachtenberg MIT, 20-Sep-2004
36
P AL
Designing Parallel Operating Systems using Modern Interconnects
CCS-3
Dynamic Workload Methodology
Emulation using a workload model [Lublin03] 1000 jobs, approx. 12 days, shrunk to 2 hrs Varying load by factoring arrival times Using same synthetic application, with random:
Arrival time, run time, and size, based on model Granularity (fine, medium, coarse) communication pattern (ring, barrier, none)
Recent study with scientific apps (yet unpublished)
Eitan Frachtenberg MIT, 20-Sep-2004
37
P AL
Designing Parallel Operating Systems using Modern Interconnects
CCS-3
Load – Response Time
Eitan Frachtenberg MIT, 20-Sep-2004
38
P AL
Designing Parallel Operating Systems using Modern Interconnects
CCS-3
Load – Bounded Slowdown
Eitan Frachtenberg MIT, 20-Sep-2004
39
P AL
Designing Parallel Operating Systems using Modern Interconnects
CCS-3
Timeslice – Response Time
Eitan Frachtenberg MIT, 20-Sep-2004
40
P AL
Designing Parallel Operating Systems using Modern Interconnects
CCS-3
System Software Components
JobScheduling
Fault Tolerance Parallel
I/O
CommunicationCommunicationLibraryLibrary
ResourceManagement
SystemSoftware
SystemSoftware
Eitan Frachtenberg MIT, 20-Sep-2004
41
P AL
Designing Parallel Operating Systems using Modern Interconnects
CCS-3
Buffered CoScheduling (BCS)
Buffer all communications Exchange information about pending
communication every time slice Schedule and execute communication Implemented mostly on the NIC Requires fine-grained heartbeats Described in [SC’03]
Eitan Frachtenberg MIT, 20-Sep-2004
42
P AL
Designing Parallel Operating Systems using Modern Interconnects
CCS-3
Design and Implementation
Global synchronization Strobe sent at regular intervals (time slices)
Compare-And-Write + Xfer-And-Signal (Master) Test-Event (Slaves)
All system activities are tightly coupled Global Scheduling
Exchange of communication requirements Xfer-And-Signal + Test-Event
Communication scheduling Real transmission
Xfer-And-Signal + Test-Event
Eitan Frachtenberg MIT, 20-Sep-2004
43
P AL
Designing Parallel Operating Systems using Modern Interconnects
CCS-3
Design and Implementation
Implementation in the NIC Application processes interact with NIC threads
MPI primitive Descriptor posted to the NIC Communications are buffered
Cooperative threads running in the NIC Synchronize Partial exchange of control information Schedule communications Perform real transmissions and reduce computations
Comp/comm completely overlapped
Eitan Frachtenberg MIT, 20-Sep-2004
44
P AL
Designing Parallel Operating Systems using Modern Interconnects
CCS-3
Design and Implementation
Non-blocking primitives: MPI_Isend/Irecv
Eitan Frachtenberg MIT, 20-Sep-2004
45
P AL
Designing Parallel Operating Systems using Modern Interconnects
CCS-3
Design and Implementation
Blocking primitives: MPI_Send/Recv
Eitan Frachtenberg MIT, 20-Sep-2004
46
P AL
Designing Parallel Operating Systems using Modern Interconnects
CCS-3
Performance Evaluation
BCS MPI vs. Quadrics MPI Experimental Setup
Benchmarks and Applications• NPB (IS,EP,MG,CG,LU) - Class C• SWEEP3D - 50x50x50• SAGE - timing.input
Scheduling parameters• 500μs communication scheduling time slice (1 rail)• 250μs communication scheduling time slice (2 rails)
Eitan Frachtenberg MIT, 20-Sep-2004
47
P AL
Designing Parallel Operating Systems using Modern Interconnects
CCS-3
Performance Evaluation
Benchmarks and Applications (C)
Application Slowdown
IS (32PEs) 10.40%
EP (49PEs) 5.35%
MG (32PEs) 4.37%
CG (32PEs) 10.83%
LU (32PEs) 15.04%
SWEEP3D (49PEs) -2.23%
SAGE (62PEs) -0.42%
Eitan Frachtenberg MIT, 20-Sep-2004
48
P AL
Designing Parallel Operating Systems using Modern Interconnects
CCS-3
Performance Evaluation
SAGE - timing.input (IA32)
0.5% SPEEDUP
Eitan Frachtenberg MIT, 20-Sep-2004
49
P AL
Designing Parallel Operating Systems using Modern Interconnects
CCS-3
Blocking Communication
Blocking vs. Non-blocking SWEEP3D (IA32)MPI_Send/Recv MPI_Isend/Irecv + MPI_Waitall
Eitan Frachtenberg MIT, 20-Sep-2004
50
P AL
Designing Parallel Operating Systems using Modern Interconnects
CCS-3
System Software Components
JobScheduling
Fault Fault ToleranceTolerance Parallel
I/O
CommunicationLibrary
ResourceManagement
SystemSoftware
SystemSoftware
Eitan Frachtenberg MIT, 20-Sep-2004
51
P AL
Designing Parallel Operating Systems using Modern Interconnects
CCS-3
Fault Tolerance Today
Fault tolerance is commonly achieved, if at all, by Checkpointing Segmentation of the machine Removal of fault-prone components
Massive hardware redundancy is not considered economically feasible
Eitan Frachtenberg MIT, 20-Sep-2004
52
P AL
Designing Parallel Operating Systems using Modern Interconnects
CCS-3
Our Approach to Fault Tolerance
Recent work shows that scalable, system-level fault-tolerance is within reach with current technology, with low overhead, can be achieved through a global operating system
Two results provide the basis for this claim1. Buffered CoScheduling that enforces frequent,
global recovery lines and global control
2. Feasibility of incremental checkpoint
Eitan Frachtenberg MIT, 20-Sep-2004
53
P AL
Designing Parallel Operating Systems using Modern Interconnects
CCS-3
Checkpointing and Recovery
Simplicity Easy implementation
Cost-effective No additional hardware support
Critical aspect: Bandwidth requirements
Saving process state
Eitan Frachtenberg MIT, 20-Sep-2004
54
P AL
Designing Parallel Operating Systems using Modern Interconnects
CCS-3
Reducing Bandwidth
Incremental checkpointing Only the memory modified from the previous
checkpoint is saved to stable storage
Full
Process state
Incremental
Eitan Frachtenberg MIT, 20-Sep-2004
55
P AL
Designing Parallel Operating Systems using Modern Interconnects
CCS-3
Enabling Automatic Checkpointing
Low
User intervention Checkpoint data
Low
Hardware
Operating system
Run-time library
Application
High
High
automatic
Eitan Frachtenberg MIT, 20-Sep-2004
56
P AL
Designing Parallel Operating Systems using Modern Interconnects
CCS-3
The Bandwidth Challenge
Does the current technology provide enough bandwidth?
• Frequent• Automatic
Eitan Frachtenberg MIT, 20-Sep-2004
57
P AL
Designing Parallel Operating Systems using Modern Interconnects
CCS-3
Methodology
Quantifying the Bandwidth Requirements Checkpoint intervals: 1s to 20s Comparing with the current bandwidth available
900 MB/s
75 MB/s
Sustained network bandwidthQuadrics QsNet II
Single sustained disk bandwidthUltra SCSI controller
Eitan Frachtenberg MIT, 20-Sep-2004
58
P AL
Designing Parallel Operating Systems using Modern Interconnects
CCS-3
Memory Footprint
Sage-1000MB 954.6MB
Sage-500MB 497.3MB
Sage-100MB 103.7MB
Sage-50MB 55MB
Sweep3D 105.5MB
SP Class C 40.1MB
LU Class C 16.6MB
BT Class C 76.5MB
FT Class C 118MB
Increasing memory footprint
64 Itanium II processors
Eitan Frachtenberg MIT, 20-Sep-2004
59
P AL
Designing Parallel Operating Systems using Modern Interconnects
CCS-3
Bandwidth Requirements
0
50
100
150
200
250
300
1 5 10 20
Maximum Average
Ban
dw
idth
(M
B/s
)
Timeslices (s)
78.8MB/s 12.1MB/
s
Decreases with the timeslicesSage-1000MB
Eitan Frachtenberg MIT, 20-Sep-2004
60
P AL
Designing Parallel Operating Systems using Modern Interconnects
CCS-3
0
50
100
150
200
250
300
Sage1000MB
Sage 500MB
Sage 100MB
Sage 50MB
Sweepd3D SP LU BT FT
Maximum Average
Bandwidth Requirementsfor 1 second
Increases with memory footprint
Single SCSI disk performance
Most demanding
Eitan Frachtenberg MIT, 20-Sep-2004
61
P AL
Designing Parallel Operating Systems using Modern Interconnects
CCS-3
Increasing Memory Footprint Size
0102030405060708090
1 5 10 20
50MB
100MB
500MB
1000MB
Ave
rage
Ban
dw
idth
(M
B/s
)
Timeslices (s)
Increases sublinearly
Eitan Frachtenberg MIT, 20-Sep-2004
62
P AL
Designing Parallel Operating Systems using Modern Interconnects
CCS-3
Increasing Processor Count
0102030405060708090
1 5 10 20
8 16 32 64
Ave
rage
Ban
dw
idth
(M
B/s
)
Timeslices (s)
Decreases slightly with processor count
Weak-scaling
Eitan Frachtenberg MIT, 20-Sep-2004
63
P AL
Designing Parallel Operating Systems using Modern Interconnects
CCS-3
Technological Trends
0102030405060708090
100
Processor Memory Storage Network
Performance of applications bounded by memory improvements
Increases at a faster
pace
Per
form
ance
Im
pro
vem
ent
per
year
Eitan Frachtenberg MIT, 20-Sep-2004
64
P AL
Designing Parallel Operating Systems using Modern Interconnects
CCS-3
Conclusions
As clusters grow, interconnection technology advances: Better bandwidth and latency On-board programmable processor, RAM Hardware support for collective operations
Allows the development of common system infrastructure that is a parallel program in itself
Eitan Frachtenberg MIT, 20-Sep-2004
65
P AL
Designing Parallel Operating Systems using Modern Interconnects
CCS-3
Conclusions (cont.)
On top of infrastructure we built: Scalable resource management (STORM) Novel job scheduling algorithms Simplified system design and communication library Possible basis for transparent fault tolerance
Eitan Frachtenberg MIT, 20-Sep-2004
66
P AL
Designing Parallel Operating Systems using Modern Interconnects
CCS-3
Conclusions (cont.)
Experimental performance evaluation demonstrates: Scalable interactive job launching and context-
switching Multiprogramming parallel jobs is feasible Adaptive scheduling algorithms adjust to different job
requirements, improving response times and slowdown in various workloads
Transparent, frequent checkpoint within current reach
Eitan Frachtenberg MIT, 20-Sep-2004
67
P AL
Designing Parallel Operating Systems using Modern Interconnects
CCS-3
References
Eitan’s web page
http://www.cs.huji.ac.il/~etcs/pubs/
Fabrizio’s web page
http://www.c3.lanl.gov/~fabrizio/publications.html
PAL team web page:
http://www.c3.lanl.gov/par_arch/Publications.html
Eitan Frachtenberg MIT, 20-Sep-2004
68
P AL
Designing Parallel Operating Systems using Modern Interconnects
CCS-3
Resource Overlapping
Eitan Frachtenberg MIT, 20-Sep-2004
69
P AL
Designing Parallel Operating Systems using Modern Interconnects
CCS-3
Turnaround Time
Eitan Frachtenberg MIT, 20-Sep-2004
70
P AL
Designing Parallel Operating Systems using Modern Interconnects
CCS-3
Response Time
Eitan Frachtenberg MIT, 20-Sep-2004
71
P AL
Designing Parallel Operating Systems using Modern Interconnects
CCS-3
Timeslice – Bounded Slowdown
Eitan Frachtenberg MIT, 20-Sep-2004
72
P AL
Designing Parallel Operating Systems using Modern Interconnects
CCS-3
FCFS vs. GS and MPL
Eitan Frachtenberg MIT, 20-Sep-2004
73
P AL
Designing Parallel Operating Systems using Modern Interconnects
CCS-3
FCFS vs. GS and MPL (2)
Eitan Frachtenberg MIT, 20-Sep-2004
74
P AL
Designing Parallel Operating Systems using Modern Interconnects
CCS-3
Backfilling
Backfilling is a technique to move jobs forward in queue
Can be combined with time-sharing schedulers such as GS when all timeslots are full
Eitan Frachtenberg MIT, 20-Sep-2004
75
P AL
Designing Parallel Operating Systems using Modern Interconnects
CCS-3
Backfilling
Backfilling is a technique to move jobs forward in queue
Can be combined with time-sharing schedulers such as GS when all timeslots are full
Eitan Frachtenberg MIT, 20-Sep-2004
76
P AL
Designing Parallel Operating Systems using Modern Interconnects
CCS-3
Effect of Backfilling
Eitan Frachtenberg MIT, 20-Sep-2004
77
P AL
Designing Parallel Operating Systems using Modern Interconnects
CCS-3
Characterization
Data initializationRegular
processing bursts
Sage-1000MB
Eitan Frachtenberg MIT, 20-Sep-2004
78
P AL
Designing Parallel Operating Systems using Modern Interconnects
CCS-3
Communication
Interleaved
Sage-1000MB
Regular communication
bursts