N. Xiong@ GSU Slide 1
Chapter 05
Clustered Systems for
Massive Parallelism
N. Xiong
Georgia State University
N. Xiong@ GSU Slide 3
Chapter 05
Design Objectives of Clusters and MPPs Cluster and MPP System Architectures Design Principles of Clustered Systems Multiple Job Scheduling and
Management Virtual Clustering and Resource
Provisioning Homework Problems
Chapter 04 Main Contents
N. Xiong@ GSU Slide 4
Chapter 05
Scalability Packaging Control Homogeneity Security
Design Objectives of Clustered Systems
N. Xiong@ GSU Slide 6
Chapter 05
Fundamental Cluster Design Issues
Scalable Performance Single System Image Availability Support Cluster Job Management Internode Communication Fault Tolerance and Recovery Growth of Servers in HPC and
HTC Systems
N. Xiong@ GSU Slide 8
Chapter 05
An Idealized Cluster Architecture
Conventional databases and OLTP monitors offer users a desktop environment
Supports parallel programming based on standard languages and communication libraries
A user-interface subsystem combines the advantages of the Web interface and the windows GUI
N. Xiong@ GSU Slide 9
Chapter 05
Node Architectures and System Packaging
Two types of cluster nodes compute nodes service nodes
N. Xiong@ GSU Slide 17
Chapter 05
Design Principles of Clusters
Single-System-Image (SSI ) Features Single System Single Control Symmetry Location Transparent
N. Xiong@ GSU Slide 18
Chapter 05
Design Principles of Clusters
Single-System-Image Layers Application Software Layer Hardware or Kernel Layer Middleware Layer
N. Xiong@ GSU Slide 19
Chapter 05
Design Principles of Clusters
Single-System-Image Composition Single Entry Point Single File Hierarchy Single I/O, Networking, and Memory
Space Other Desired SSI Features
N. Xiong@ GSU Slide 21
Chapter 05
Single File Hierarchy
It is persistent. It is fault tolerant to some
degree. Network File System (NFS)
and Andrew File System (AFS).
N. Xiong@ GSU Slide 23
Chapter 05
Single I/O, Networking, and Memory Space
Single Input/Output Single Networking Single Point of Control Single Memory Space
N. Xiong@ GSU Slide 26
Chapter 05
Other Desired SSI Features
Single Job Management System
Single User Interface Single Process Space
N. Xiong@ GSU Slide 28
Chapter 05
High Availability Through Redundancy
Reliability Availability Serviceability
N. Xiong@ GSU Slide 32
Chapter 05
Fault-Tolerant Cluster Configurations
Hot Standby Mutual Takeover Fault-Tolerance
N. Xiong@ GSU Slide 33
Chapter 05
Recovery Schemes
Backward recovery Forward recovery: in real-
time systems
N. Xiong@ GSU Slide 34
Chapter 05
Checkpointing and Recovery Techniques
Kernel, Library, and Application Levels Checkpoint Overheads Choosing an Optimal Checkpoint Interval
N. Xiong@ GSU Slide 36
Chapter 05
Cluster Job Scheduling and Management
Cluster Job Management Issues A user server A job scheduler A resource manager
N. Xiong@ GSU Slide 37
Chapter 05
Cluster Job Types
Serial jobs Parallel jobs Interactive jobs Batch jobs Foreign jobs
N. Xiong@ GSU Slide 40
Chapter 05
Migration Schemes Issues
Node Availability Migration Overhead Recruitment Threshold:
the amount of time a workstation stays unused before the cluster considers it an idle node