malkawi keynote-speech-challenges to hpcs (1)

The 3rd International Conference on Emerging Ubiquitous Systems and Pervasive Networkswww.iasks.org/conferences/EUSPN2011Amman, JordanOctober 10-13, 2011

Challenges to High Productivity Computing Systems and NetworksMohammad MalkawiDean of Engineering, Jadara University

[email protected]

OutlineHigh Productivity Computing Systems (HPCS) - The Big PictureThe ChallengesIBM PERCSCray CascadeSUN Hero ProgramCloud Computing

HPCS: The Big PictureManufacture and deliver a peta-flop class computer

Complex architectureHigh performanceEasier to programEasier to use

HPCS GoalsProductivity Reduce code development timeProcessing powerFloating point & integer arithmeticMemoryLarge size, high bandwidth & low latencyInterconnectionLarge bisection bandwidth

HPCS ChallengesHigh Effective BandwidthHigh bandwidth/low latency memory systems Balanced System ArchitectureProcessors, memory, interconnects, programming environmentsRobustnessHardware and software reliabilityCompute through failure Intrusion identification and resistance techniques.

HPCS ChallengesPerformance Measurement and PredictionNew class of metrics and benchmarks to measure and predict performance of system architecture and applications softwareScalabilityAdapt and optimize to changing workload and user requirements; e.g., multiple programming models, selectable machine abstractions, and configurable software/hardware architectures

Productivity ChallengesQuantify productivity for code development and productionIdentify characteristics of Application codesWorkflowBottlenecks and obstaclesLessons learned so that decisions by the productivity team and the vendors are based on real data rather than anecdotal data

Did Not Learn the Lessons

Productivity Dilemma - 1Diminishing productivity is alarmingCodingDebuggingOptimizingModifyingOver-provisioning hardwareRunning high-end applications

Productivity Dilemma - 2Not long ago, a computational scientist could personally write, debug and optimize code to run on a leadership class high performance computing system without the help of others. Today, the programming for a cluster of machines is significantly more difficult than traditional programming, and the scale of the machines and problems has increased more than 1,000 times.

Productivity Dilemma - 3Owning and running high-end computational facilities for nuclear research, seismic modeling, gene sequencing or business intelligence, takes sizeable investment in terms of staffing, procurement and operations. Applications achieve 5 to 10 percent of the theoretical peak performance of the system. Applications must be restarted from scratch every time a hardware or software failure interrupts the job.

HPCS Trends: Productivity Crisis

High Productivity ComputingScaling the Program Without Scaling the ProgrammerBandwidth enables productivity and allows for simpler programming environments and systems with greater fault tolerance

Language ChallengesMPI is a fairly low-level languageReliable, predictable and works. Extension of Fortran, C and C++New languages with higher level of abstractionImprove legacy applicationsScale to Petascale levelsSUN FortressIBM - X10Cray ChapelOpen MP

Global View Programming Model Global View programs present a single, global view of the program's data structures, Begin with a single main thread.Parallel execution then spreads out dynamically as work becomes available.

Unprecedented Performance LeapPerformance targets require aggressive improvements in system parameters traditionally ignored by the "Linpack" benchmark.Improve system performance under the most demanding benchmarks (GUPS)Determine whether general applications will be written or modified to benefit from these features.

Trade-OffsPortability versus innovationsAbstractions vs. difficulty of programming and performance overheadShared memory versus message passing

Cost of Petascale ComputingRequire petabytes of memoryOrder of 106 processors Hundreds of petabytes of disk storage for capacity and bandwidth. Power consumption and cost for DRAM and disks (Tens of Mega Watts)Operational cost

The DARPA HPCS ProgramFirst major program to devote effort to make high end computers more user-friendlyMask the difficulty of developing and running codes on HPCSMask the challenge of getting good performance for a general codeFast, large, and low latency RAMFast processingQuantitative measure of productivity

IBM HPCS EXAMPLE

IBM HPCS Program PERC 2011Productive, Easy-to-use, Reliable ComputerRich programming environmentDevelop new applications and maintain existing ones.Support existing programming models and languagesScalability to the peta-level

Automate performance tuning tasksRich graphical interfaces Automate monitoring and recovery tasksFewer system administrators to handle larger systems more effectively

IBM Blue Gene HPCS Base

IBM Approach - HardwareInnovative processor chip design & leverage the POWER processor server line. Lower Soft Error Rates (SER)Reduce the latency of memory accesses by placing the processors close to large memory arrays. Multiple chip configuration to suit different workloads.

IBM Approach - SoftwareLarge set of tools integrated into a modern, user-friendly programming environment. Support both legacy programming models and languages (MPI, OpenMP, C, C++, Fortran, etc.), Support emerging ones (PGAS)Design new experimental programming language, called X10.

X10 FeaturesDesigned for parallel processing from the ground up. Falls under the Partitioned Global Address Space (PGAS) category Balance between a high-level abstraction and exposing the topology of the system Asynchronous interactions among the parallel threads Avoid the blocking synchronization style

CRAY HPCS EXAMPLE

Multiple Processing TechnologiesIn high performance computing: one size does not fit allHeterogeneous computing using custom processing technologies. Performance achieved via deeper pipelining and more complex microarchitecturesIntroduction of multi-core processors:Further stresses processor-memory balance issuesDrives up the number of processors required to solve large problems

Specialized Computing Technologies Vector processing and field programmable gate arrays (FPGAs)Ability to extract more performance out of the transistors on a chip with less control overhead. Allow higher processor performance, with lower powerReduce the number of processors required to solve a given problemVector processors tolerate memory latency extremely well

Specialized Computing TechnologiesMultithreading improve latency toleranceCascade design will combine multiple computing technologiesPure scalar nodes, based on Opteron microprocessorsNodes providing vector, massively multithreaded, and FPGA-based acceleration.Nodes that can adapt their mode of operation to the application.

Cray: The Cascade ApproachScalable, high-bandwidth systemGlobally addressable memoryHeterogeneous processing technologies Fast serial executionMassive multithreadingVector processing and FPGA-based application acceleration. Adaptive supercomputing: The system adapts to the application rather than requiring the programmer to adapt the application to the system.

Cascade ApproachUse Cray T3ETM massively parallel systemUse best-of-class microprocessorProcessors directly access global memory with very low overhead and at very high data rates.Hierarchical address translation allows the processors to access very large data sets without suffering from TLB faultsAMD's Opteron will be the base processor for Cascade

Cray Adaptive SupercomputingThe system adapts to the applicationThe user logs into a single system, and sees one global file system. The compiler analyzes the code to determine which processing technology best fits the codeThe scheduling software automatically deploys the code on the appropriate nodes.

Balanced Hardware DesignA balanced hardware designComplements processor flops with memory, network and I/O bandwidthScalable performanceImproving programmability and breadth of applicability.Balanced systems also require fewer processors to scale to a given level of performance, reducing failure rates and administrative overhead.

Cray- System Bandwidth ChallengeThe Cascade program is attacking this problem on two frontsSignalling technology and Network design.Provide truly massive global bandwidth at an affordable cost.A key part of the design is a common, globally addressable memory across the whole machine. Efficient, low-overhead communication.

Cray- System Bandwidth ChallengeAccessing remote data is as simple as issuing a load or store instruction, rather than calling a library function to pass messages between processors. Allows many outstanding references to be overlapped with each other and with ongoing computation.

Cray Programming Model Support MPI for legacy purposesUnified Parallel C (UPC) and Coarray Fortran (CAF)simpler and easier to write than MPIReference memory on remote nodes as easily as referencing memory on the local nodeData sharing is much more naturalCommunication overhead is much lower.

The Chapel Cray HPCS LanguageSupport for graphs, hash tables, sparse arrays, and iterators. Ability to separate the specification of an algorithm from structural details of the computation including Data layoutsWork decomposition and communication.Simplifies the creation of the basic algorithmsAllows these structural components to be gradually tuned over time.

Cray's Programming Tools Reduce the complexity of working on highly scalable applications. The Cascade debugger solution will Focus on data rather than controlSupport application portingAllow scaling commensurate with the applicationIntegrated user environment (IDE)

Cascade Performance Analysis Tools Hardware performance countersSoftware introspection techniques. Present the user with insight, rather than statistics.Act as a parallel programming expertProvide high-level feedback on program behaviourProvide suggestions for program modifications to remove key bottlenecks or otherwise improve performance.

SUN HPCS EXAMPLE

Evolution of HPCS at SUNGrid:Loosely coupled heterogeneous resourcesMultiple administrative domainsWide area networkClustersTightly coupled high performance systemsMessage passing MPIUltrascaleDistributed scalable systemsHigh productivity shared memory systemsHigh bandwidth, global address space, unified administration tools

SUN Approach The Hero SystemRich bandwidthLow latenciesVery high levels of fault toleranceHighly integrated toolset to scale the program and not the programmersMultithreading technologies ( > 100 concurrent threads)

SUN Approach The Hero SystemGlobally addressable memorySystem level and application checkpointing Hardware and software telemetry for dramatically improved fault tolerance. The system appears more like a flat memory systemFocus on solving the problem at hand rather than making elaborate efforts to distribute data in a robust manner.

Definition: Bisection BandwidthExample is an all-to-all interconnect between 8 cabinetsThere are 28 total connections, of which 16 cross the bisection (orange) and 12 do not (blue)High bandwidth optical connections are key to meeting HPCS peta-scale bisection bandwidth targetSplit a system into equal halves such that there is the minimum number of connections across the split- the bandwidth across the split is the bisection bandwidthA standard metric for systems ability to globally move data

System Bandwidth Over TimeA giant leap in productivity expected

High Bandwidth Required by HPCSRadical Changes From Todays Architecture Necessary

Motivation for Higher Bandwidth

Growing BW demand in HPCSMulticore CPUs: Aggregation of multiple cores is unstoppable and copper interconnects are stressed at very large scaleSilicon Photonics is the solution since it brings a potential of unlimited BW on the best medium allowing for large aggregation of multicore CPUs

Growing BW demand in HPCSClusters are growing in number of nodes and in performance/nodeInterconnects are the limiting factor in BW, latency, distanceProtocols reduce latency & copper increases latency. Silicon Photonics brings high BW and low latency

Growing BW demand in HPCSStorage I/O BW increasing exponentially due to the faster data/rate and the parallelism caused by striping technologiesWDM will eventually allow 10Tb of data to be transmitted down a single piece of fiber Silicon Photonics is at the beginning of its life cycle with headroom for explosive BW growth without any increase in latency or reduction in reach

Proximity + CMOS Photonics

Proximity Communication -2

Proximity CommunicationCapacitive coupling enables high-speed data communication between neighboring chips without the need for wires of any kindAllows for the alignment of metal plates on one chip with metal plates on a neighboring chip and the transfer of data between themreduced power improves cross-section bandwidth andcommunication power

Proximity Communication - SUN3.6 x 4.1 mm test chip0.35 um technology50 um bit pitch1.35 Gbps/channel for 16 simultaneous channels< 10^-12 BER @ 1Gbps3.6 mW/channel static power3.9 pJ/bit dynamic power

Low Cost, Low Power Optics

DWDM CMOS Photonics

CMOS Photonics Module

SUN Programming ModelSimpler Code with High Bandwidth Shared MemoryNAS Parallel Benchmark CG (Conjugate Gradient) Lines of Code

SUN Fortress LanguageCatch stupid mistakesExtensive librariesPlatform indpendenceSecurity modelType safetyMultithreadingDynamic compilationTo Do For Fortran What JavaTM Did For C

Object-Based Smart StorageWith Object Storage File Systems For Massive Scalability and Extreme Performance

Ultra-scale Computing in 2010Simpler development environments will make HPC more accessible to a diverse range of usersLone researchers and small teams will once again be able to harness the computational power of leadership class systemsMany gaps regarding commercial and scientific computing will narrow

Cloud ComputingService computingThe net is the computerMore than 100 vendorsGrowing fastProgramming environment

BACKUP SLIDES

HPCS TechnologiesSome Publicly Announced Projects

IBM HPCS - PERCSOpen source operating systems and hypervisors will provide HPC-oriented VirtualizationSecurityResource management Affinity control Resource limitsCheckpoint-restart and reliability features that will improve the robustness and availability of the system.

MPI ParadigmWriting applications in MPI requires breaking up all the data and computation into a large number of discrete piecesand then using library code to explicitly bundle up data and pass it between processors in messages whenever processors need to share data. It's a cumbersome affair that distracts scientists from their primary focus.Once an application is written, it's generally a time-consuming process to debug and tune it. Traditional debugging models just don't scale well to thousands or tens of thousands of processors (try opening up 10,000 debugger windows, one for each thread!). Trying to figure out why your application isn't getting the performance you think it should is also exceedingly difficult at large scales. Traditional profiling and even sophisticated statistics-gathering may be insufficient to ascertain why the performance is lagging, much less how to change the code to improve it.

Productivity ChallengesThe time spent trying to structure an application to fit the attributes of the target machine. If the machine is a cluster with limited interconnect bandwidth the programmer must carefully minimize communication make sure that any sparse data to be communicated is first bundled together into larger messages to reduce communication overheads.

Productivity ChallengesIf the machine uses conventional microprocessorsCare must be taken to maximize cache re-use Eliminate global memory references, which tend to stall the processor. If the machine looks like a hammerYou'd better make all your codes look like nails!This can lead to "unnatural" algorithms and data structures, which significantly reduces programmer productivity

malkawi keynote-speech-challenges to hpcs (1)

Documents

diminishing productivity

productivity team

traditional programming

multiple programming

emerging ubiquitous

hpcs trends

software failure

lessonsproductivity