john david eriksen jamie unger-fink high-performance, dependable multiprocessor
TRANSCRIPT
![Page 1: John David Eriksen Jamie Unger-Fink High-Performance, Dependable Multiprocessor](https://reader030.vdocuments.us/reader030/viewer/2022033107/56649d8e5503460f94a77ed5/html5/thumbnails/1.jpg)
John David EriksenJamie Unger-Fink
High-Performance, Dependable Multiprocessor
![Page 2: John David Eriksen Jamie Unger-Fink High-Performance, Dependable Multiprocessor](https://reader030.vdocuments.us/reader030/viewer/2022033107/56649d8e5503460f94a77ed5/html5/thumbnails/2.jpg)
Traditional space computing limited primarily to mission-critical applications◦ Spacecraft control◦ Life support
Data collected in space and processed on the ground
Data sets in space applications continue to grow
Background and Motivation
![Page 3: John David Eriksen Jamie Unger-Fink High-Performance, Dependable Multiprocessor](https://reader030.vdocuments.us/reader030/viewer/2022033107/56649d8e5503460f94a77ed5/html5/thumbnails/3.jpg)
Communication bandwidth not growing fast enough to cope with increasing size of data sets◦ Instruments and sensors grow in capability
Increasing need for on-board data processing◦ Perform data filtering and other operations on-
board◦ Autonomous systems demand more computing
power
Background and Motivation
![Page 4: John David Eriksen Jamie Unger-Fink High-Performance, Dependable Multiprocessor](https://reader030.vdocuments.us/reader030/viewer/2022033107/56649d8e5503460f94a77ed5/html5/thumbnails/4.jpg)
Advanced Onboard Signal Processor (AOSP)◦ Developed in 70’s and 80’s◦ Helped develop understanding of radiation on
computing systems and components. Advanced Architecture Onboard Processor
(AAOP)◦ Engineered new approaches to onboard data
processing
Related Work
![Page 5: John David Eriksen Jamie Unger-Fink High-Performance, Dependable Multiprocessor](https://reader030.vdocuments.us/reader030/viewer/2022033107/56649d8e5503460f94a77ed5/html5/thumbnails/5.jpg)
Space Touchstone◦ First COTS-based, FT, high-performance system
Remote Exploration and Experimentation◦ Extended FT techniques to parallel and cluster
computing◦ Focused on low-cost, high-performance, good
power-ratio compute cluster designs.
Related Work
![Page 6: John David Eriksen Jamie Unger-Fink High-Performance, Dependable Multiprocessor](https://reader030.vdocuments.us/reader030/viewer/2022033107/56649d8e5503460f94a77ed5/html5/thumbnails/6.jpg)
Address need for increased data processing requirements
Bring COTS systems to space◦ COTS (Commodity Off-The-Shelf)
Less expensive General-purpose Need special considerations to meet requirements
of aerospace environments Fault-tolerance High reliability High availability
Goal
![Page 7: John David Eriksen Jamie Unger-Fink High-Performance, Dependable Multiprocessor](https://reader030.vdocuments.us/reader030/viewer/2022033107/56649d8e5503460f94a77ed5/html5/thumbnails/7.jpg)
A reconfigurable cluster computer with centralized control.
Dependable Multiprocessor is…
![Page 8: John David Eriksen Jamie Unger-Fink High-Performance, Dependable Multiprocessor](https://reader030.vdocuments.us/reader030/viewer/2022033107/56649d8e5503460f94a77ed5/html5/thumbnails/8.jpg)
A hardware architecture◦ High-performance characteristics◦ Scalable◦ Upgradable (thanks to reliance on COTS)
A parallel processing environment◦ Support common scientific computing
development environment (FEMPI) A fault-tolerant computing platform
◦ System controllers provide FT properties A toolset for predicting application behavior
◦ Fault behavior, performance, availability…
Dependable Multiprocessor is…
![Page 9: John David Eriksen Jamie Unger-Fink High-Performance, Dependable Multiprocessor](https://reader030.vdocuments.us/reader030/viewer/2022033107/56649d8e5503460f94a77ed5/html5/thumbnails/9.jpg)
Redundant radiation-hardened system controller
Cluster of COTS-based reconfigurable data processors
Redundant COTS-based packet-switched networks
Radiation-hardened mass data store Redundancy available in:
◦ System controller◦ Network◦ Configurable N-of-M sparing in compute nodes
Hardware Architecture
![Page 10: John David Eriksen Jamie Unger-Fink High-Performance, Dependable Multiprocessor](https://reader030.vdocuments.us/reader030/viewer/2022033107/56649d8e5503460f94a77ed5/html5/thumbnails/10.jpg)
Hardware Architecture
![Page 11: John David Eriksen Jamie Unger-Fink High-Performance, Dependable Multiprocessor](https://reader030.vdocuments.us/reader030/viewer/2022033107/56649d8e5503460f94a77ed5/html5/thumbnails/11.jpg)
Scalability◦ Variable number of compute nodes◦ Cluster-of-cluster
Compute nodes◦ IBM PowerPC 750FX general processor◦ Xilinx VirtexII 6000 FPGA co-processor
Reconfigurable to fulfill various roles DSP processor Data compression Vector processing
Applications implemented in hardware can be very fast◦ Memory and other support chips
Hardware Architecture
![Page 12: John David Eriksen Jamie Unger-Fink High-Performance, Dependable Multiprocessor](https://reader030.vdocuments.us/reader030/viewer/2022033107/56649d8e5503460f94a77ed5/html5/thumbnails/12.jpg)
Hardware Architecture
![Page 13: John David Eriksen Jamie Unger-Fink High-Performance, Dependable Multiprocessor](https://reader030.vdocuments.us/reader030/viewer/2022033107/56649d8e5503460f94a77ed5/html5/thumbnails/13.jpg)
Hardware Architecture
![Page 14: John David Eriksen Jamie Unger-Fink High-Performance, Dependable Multiprocessor](https://reader030.vdocuments.us/reader030/viewer/2022033107/56649d8e5503460f94a77ed5/html5/thumbnails/14.jpg)
Network Interconnect◦ Gigabit Ethernet for data exchange◦ A low-latency, low-bandwidth bus used for
control Mission Interface
◦ Provides interface to rest of space vehicle’s computer systems
◦ Radiation-hardened
Hardware Architecture
![Page 15: John David Eriksen Jamie Unger-Fink High-Performance, Dependable Multiprocessor](https://reader030.vdocuments.us/reader030/viewer/2022033107/56649d8e5503460f94a77ed5/html5/thumbnails/15.jpg)
Current hardware implementation◦ Four data processors◦ Two redundant system controllers◦ One mass data store◦ Two gigabit ethernet networks including two
network switches◦ Software-controlled instrumented power supply◦ Workstation running spacecraft system emulator
software
Hardware Architecture
![Page 16: John David Eriksen Jamie Unger-Fink High-Performance, Dependable Multiprocessor](https://reader030.vdocuments.us/reader030/viewer/2022033107/56649d8e5503460f94a77ed5/html5/thumbnails/16.jpg)
Hardware Architecture
![Page 17: John David Eriksen Jamie Unger-Fink High-Performance, Dependable Multiprocessor](https://reader030.vdocuments.us/reader030/viewer/2022033107/56649d8e5503460f94a77ed5/html5/thumbnails/17.jpg)
![Page 18: John David Eriksen Jamie Unger-Fink High-Performance, Dependable Multiprocessor](https://reader030.vdocuments.us/reader030/viewer/2022033107/56649d8e5503460f94a77ed5/html5/thumbnails/18.jpg)
Platform layer is lowest layer, interfaces hardware to middleware, hardware-specific software, network drivers◦ Uses Linux, allows for use of many existing
software tools Mission Layer: Middleware: includes DM System Services:
fault tolerance, job management, etc.
![Page 19: John David Eriksen Jamie Unger-Fink High-Performance, Dependable Multiprocessor](https://reader030.vdocuments.us/reader030/viewer/2022033107/56649d8e5503460f94a77ed5/html5/thumbnails/19.jpg)
![Page 20: John David Eriksen Jamie Unger-Fink High-Performance, Dependable Multiprocessor](https://reader030.vdocuments.us/reader030/viewer/2022033107/56649d8e5503460f94a77ed5/html5/thumbnails/20.jpg)
DM Framework is application independent, platform independent
API to communicate with mission layer, SAL (System Abstraction Layer) for platform layer
Allows for future applications by facilitating porting to new platforms
![Page 21: John David Eriksen Jamie Unger-Fink High-Performance, Dependable Multiprocessor](https://reader030.vdocuments.us/reader030/viewer/2022033107/56649d8e5503460f94a77ed5/html5/thumbnails/21.jpg)
HA Middleware foundation includes: Availability Management (AMS), Distributed Messaging (DMS), Cluster Management (CMS)
Primary functions◦ Resource monitoring◦ Fault detection, diagnosis, recovery and reporting◦ Cluster configuration◦ Event logging◦ Distributed messaging
Based on small, cross-platform kernel
![Page 22: John David Eriksen Jamie Unger-Fink High-Performance, Dependable Multiprocessor](https://reader030.vdocuments.us/reader030/viewer/2022033107/56649d8e5503460f94a77ed5/html5/thumbnails/22.jpg)
Hosted on the cluster’s system controller Managed Resources include:
◦ Applications◦ Operating System◦ Chassis◦ I/O cards◦ Redundant CPUs◦ Networks◦ Peripherals◦ Clusters◦ Other middleware
![Page 23: John David Eriksen Jamie Unger-Fink High-Performance, Dependable Multiprocessor](https://reader030.vdocuments.us/reader030/viewer/2022033107/56649d8e5503460f94a77ed5/html5/thumbnails/23.jpg)
Provides a reliable messaging layer for communications in DM cluster
Used for Checkpointing, Client/server, Communications, Event notification, Fault management, Time-critical communications
Application opens a DMS connection (channel) to pass data to interested subscribers
Since messaging is in middleware instead of lower layers, application doesn’t have to specify explicitly destination address
Messages are classified and machines choose to receive message of a certain type
![Page 24: John David Eriksen Jamie Unger-Fink High-Performance, Dependable Multiprocessor](https://reader030.vdocuments.us/reader030/viewer/2022033107/56649d8e5503460f94a77ed5/html5/thumbnails/24.jpg)
Manages physical nodes or instances of HA middleware
Discovers and monitors nodes in a cluster Passes node failures to AMS and FT
Manager via DMS
![Page 25: John David Eriksen Jamie Unger-Fink High-Performance, Dependable Multiprocessor](https://reader030.vdocuments.us/reader030/viewer/2022033107/56649d8e5503460f94a77ed5/html5/thumbnails/25.jpg)
Database Management Logging Services Tracing
![Page 26: John David Eriksen Jamie Unger-Fink High-Performance, Dependable Multiprocessor](https://reader030.vdocuments.us/reader030/viewer/2022033107/56649d8e5503460f94a77ed5/html5/thumbnails/26.jpg)
Interface to control computer or ground station
Communicates with system via DMS Monitors system health with FT Manager
◦ “Heartbeat”
![Page 27: John David Eriksen Jamie Unger-Fink High-Performance, Dependable Multiprocessor](https://reader030.vdocuments.us/reader030/viewer/2022033107/56649d8e5503460f94a77ed5/html5/thumbnails/27.jpg)
Detects and recovers from system faults FTM refers to set of recovery policies at
runtime Relies on distributed software agents to
gather system and application liveliness information◦ Avoids monitoring bottleneck
![Page 28: John David Eriksen Jamie Unger-Fink High-Performance, Dependable Multiprocessor](https://reader030.vdocuments.us/reader030/viewer/2022033107/56649d8e5503460f94a77ed5/html5/thumbnails/28.jpg)
Provides application scheduling, resource allocation
Opportunistic load balancing scheduler Jobs are registered and trace by the JM via
tables Checkpointing to allow seamless recovery
of the JM Heartbeats to the FT via middleware
![Page 29: John David Eriksen Jamie Unger-Fink High-Performance, Dependable Multiprocessor](https://reader030.vdocuments.us/reader030/viewer/2022033107/56649d8e5503460f94a77ed5/html5/thumbnails/29.jpg)
Fault-Tolerant Embedded Message Passing Interface◦ Application independent FT middleware◦ Message Passing Interface (MPI) Standard◦ Built on top of HA middleware
![Page 30: John David Eriksen Jamie Unger-Fink High-Performance, Dependable Multiprocessor](https://reader030.vdocuments.us/reader030/viewer/2022033107/56649d8e5503460f94a77ed5/html5/thumbnails/30.jpg)
Recovery from failure should be automatic, with minimal impact
Needs to maintain global awareness of the processes in parallel applications
3 Stages:◦ Fault Detection◦ Notification◦ Recovery
Process failures vs Network failures Survives the crash of n-1 processes in an n-
process job
![Page 31: John David Eriksen Jamie Unger-Fink High-Performance, Dependable Multiprocessor](https://reader030.vdocuments.us/reader030/viewer/2022033107/56649d8e5503460f94a77ed5/html5/thumbnails/31.jpg)
Proprietary nature of FPGA industry USURP - USURP’s Standard for Unified
Reconfigurable Platforms◦ Standard to interact with hardware◦ Provides middleware for portability◦ Black box IP cores◦ Wrappers mask FPGA board
![Page 32: John David Eriksen Jamie Unger-Fink High-Performance, Dependable Multiprocessor](https://reader030.vdocuments.us/reader030/viewer/2022033107/56649d8e5503460f94a77ed5/html5/thumbnails/32.jpg)
Not a universal tool for mapping high-level code with hardware design
OpenFPGA Adaptive Computing System (ACS) vs
USURP◦ Object Oriented Models vs Software APIs
IGOL BLAST CARMA
![Page 33: John David Eriksen Jamie Unger-Fink High-Performance, Dependable Multiprocessor](https://reader030.vdocuments.us/reader030/viewer/2022033107/56649d8e5503460f94a77ed5/html5/thumbnails/33.jpg)
Responsible for: Unifying vendor APIs Standardizing HW
interface Organization of data for
the user application core
Exposing the developer to common FPGA resources.
![Page 34: John David Eriksen Jamie Unger-Fink High-Performance, Dependable Multiprocessor](https://reader030.vdocuments.us/reader030/viewer/2022033107/56649d8e5503460f94a77ed5/html5/thumbnails/34.jpg)
User level protocol for system recovery Consists of:
◦ Server Process that runs on Mass Data Store DMS
◦ API for applications C-type interfaces
![Page 35: John David Eriksen Jamie Unger-Fink High-Performance, Dependable Multiprocessor](https://reader030.vdocuments.us/reader030/viewer/2022033107/56649d8e5503460f94a77ed5/html5/thumbnails/35.jpg)
Algorithm-based Fault Tolerance Library Collection of mathematical routines that can
detect and correct faults BLAS-3 Library
◦ Matrix multiply, LU decomposition, QR decomposition, single-value decompositions (SVD) and fast Fourier transform (FFT).
Uses checksums
![Page 36: John David Eriksen Jamie Unger-Fink High-Performance, Dependable Multiprocessor](https://reader030.vdocuments.us/reader030/viewer/2022033107/56649d8e5503460f94a77ed5/html5/thumbnails/36.jpg)
Triple Modular Redunancy Process Level Replication
![Page 37: John David Eriksen Jamie Unger-Fink High-Performance, Dependable Multiprocessor](https://reader030.vdocuments.us/reader030/viewer/2022033107/56649d8e5503460f94a77ed5/html5/thumbnails/37.jpg)
System architecture has been defined Testbench has been assembled Improvements:
◦ More aggressively address power consumption issues
◦ Add support for other scientific computing platforms such as Fortran
Conclusion