www.cineca.it cineca hpc infrastructure: state of the art and road map carlo cavazzoni, hpc...
TRANSCRIPT
www.cineca.it
CINECA HPC Infrastructure: state of the art and road map
• Carlo Cavazzoni, HPC department, CINECA
Installed HPC Engines
hybrid cluster64 nodes1024 SandyBridge cores64 K20 GPU64 Xeon PHI coprocessor150 TFlops peak
10240 nodes163840 PowerA2 cores2PFlops peak
Hybrid cluster274 nodes3288 Westmere cores548 nVidia M2070 (Fermi)300TFlops peak
Eurora (Eurotech)FERMI, (IBM BGQ) PLX, (IBM DataPlex)
FERMI @ CINECAPRACE Tier-0 System
Architecture: 10 BGQ FrameModel: IBM-BG/QProcessor Type: IBM PowerA2, 1.6 GHzComputing Cores: 163840 Computing Nodes: 10240 RAM: 1GByte / core Internal Network: 5D TorusDisk Space: 2PByte of scratch space Peak Performance: 2PFlop/s
Available for ISCRA & PRACE call for projects
The PRACE RI provides access to distributed persistent pan-European world class HPC computing and data management resources and services. Expertise in efficient use of the resources is available through participating centers throughout Europe. Available resources are announced for each Call for Proposals..
Peer reviewed open accessPRACE Projects (Tier-0)PRACE Preparatory (Tier-0)DECI Projects (Tier-1)
European
Local
Tier 0
Tier 1
Tier 2
National
1. Chip:16 P cores
2. Single Chip Module
4. Node Card:32 Compute Cards, Optical Modules, Link Chips, Torus
5a. Midplane: 16 Node Cards
6. Rack: 2 Midplanes
7. System: 20PF/s
3. Compute card:One chip module,16 GB DDR3 Memory,
5b. IO drawer:8 IO cards w/16 GB8 PCIe Gen2 x8 slots
BG/Q I/O architecture
BG/Q compute racks BG/Q IO Switch File system servers
IB PCI_E
IB
IB SAN
I/O drawers
I/O nodesPCIe
8 I/O nodes
At least one I/O node for each partition/job
Minimum partition/job size: 64 nodes, 1024 cores
PowerA2 chip, basic info• 64bit RISC Processor
• Power instruction set (Power1…Power7, PowerPC)
• 4 Floating Point units per core & 4 way MT
• 16 cores + 1 + 1 (17th Processor core for system functions)
• 1.6GHz
• 32MByte cache
• system-on-a-chip design
• 16GByte of RAM at 1.33GHz
• Peak Perf 204.8 gigaflops
• power draw of 55 watts
• 45 nanometer copper/SOI process (same as Power7)
• Water Cooled
9
PowerA2 FPU
• Each FPU on each core has four pipelines• execute scalar floating point instructions• four-wide SIMD instructions• two-wide complex arithmetic SIMD inst.• six-stage pipeline• maximum of eight concurrent • floating point operations • per clock plus a load and a store.
EURORA #1 in The Green500 List June
2013
What EURORA stant for?EURopean many integrated cORe Architecture
What is EURORA?Prototype ProjectFounded by PRACE 2IP EU project Grant agreement number: RI-283493
Co-designed by CINECA and EUROTECH
Where is EURORA?EURORA is installed at CINECA
When EURORA has been installed?March 2013
Who is using EURORA?All Italian and EU researchers through PRACE Prototype grant access program 3,200MFLOPS/W – 30KW
Why EURORA? (project objectives)
Address Today HPC Constraints:Flops/Watt,Flops/m2,Flops/Dollar.
Efficient Cooling Technology:hot water cooling (free cooling);measure power efficiency, evaluate (PUE & TCO).
Improve Application Performances:at the same rate as in the past (~Moore’s Law);new programming models.
Evaluate Hybrid (accelerated) Technology:Intel Xeon Phi;NVIDIA Kepler.
Custom Interconnection Technology:3D Torus network (FPGA);evaluation of accelerator-to-accelerator communications.
64 compute cards
128 Xeon SandyBridge (2.1GHz, 95W and 3.1GHz, 150W)
16GByte DDR3 1600MHz per node
160GByte SSD per node
1 FPGA (Altera Stratix V) per node
IB QDR interconnect
3D Torus interconnect
128 Accelerator cards (NVIDA K20 and INTEL PHI)
EURORA prototype configuration
Node card
13
Xeon PHI
K20
Node Energy Efficiency
14
Decreases!
HPC Service
FERMI(IBM BGQ)
PLX(IBM x86+GPU)
Eurora(Eurotech hybrid)
HPC Data store
Workspace3.6PByteRepository
1.8PByteTape1.5PB
HPC Engines
Network
Custom
FERMI EURORA
IB
EURORA PLX Store Nubes
Gbe
Infrastructure Internet
Fibre
Store
External Data Sources
LabsPRACE EUDAT Projects
Data Processing Workloads
FERMI PLX
vizHigh througput
Big mem DB
Data mover Data mover processing
Web
serv.
FECNUBES
Cloud serv. Web Archive FTP
HPC Workloads
PRACE
ISCRA
LISA
Labs Industry
AgreementsProjects
Training
HPC Services
HPC Cloud FEC PLX StoreNubes
#12 Top5002PFlops peak163840 cores163Tbyte RAMPower 1.6GHz
#1 Green5000.17PFlops peak1024 x86 cores64 Intel PHI64 NVIDIA K20
0.3PFlops peak~3500 x86 procs548 NVIDIA GPU20 NVIDIA Quadro16 Fat nodes
CINECA services
•High Performance Computing•Computational workflow•Storage•Data analytics•Data preservation (long term)•Data access (web/app)•Remote Visualization•HPC Training•HPC Consulting•HPC Hosting•Monitoring and Metering•…
For academia and industry
Road Map
Workspace3.6PByte
Core Data Processing
viz Big mem DB
Data mover processing
Web
serv.
Web Archive FTP
Core Data Store
Repository5PByte
Tape5+ PByte
Internal data sources
(data centric) Infrastructure (Q3 2014)
Cloud service
Scale-Out Data Processing
FERMI
X86 Cluster
Laboratories
PRACE EUDAT
Other Data Sources
External Data Sources
Human Brain Prj
SaaS APP
Analytics APP Parallel APPNew
analytics cluster
New analytics cluster
New storage
New storage
Requisiti di alto livello del sistema
Potenza elettrica assorbita: 400KWDimensione fisica del sistema: 5 racksPotenza di picco del sistema (CPU+GPU): nell'ordine di 1PFlops Potenza di picco del sistema (solo CPU): nell'ordine di 300TFlops
New Tier 1 CINECA
Procurement Q3 2014
Requisiti di alto livello del sistema
Architettura CPU: Intel Xeon Ivy BridgeNumero di core per CPU: 8 @ >3GHz, oppure 12 @ 2.4GHz
La scelta della frequenza ed il numero di core dipende dal TDP del socket, dalla densità del sistema e dalla capacità di raffreddamento
Numero di server: 500 - 600, ( Peak perf = 600 * 2socket * 12core * 3GHz * 8Flop/clk = 345TFlops )Il numero di server del sistema potrà dipendere dal costo o dalla geometria della configurazionein termini di numero di nodi solo CPU e numero di nodi CPU+GPU
Architettura GPU: Nvidia K40Numero di GPU: >500
( Peak perf = 700 * 1.43TFlops = 1PFlops )Il numero di schede GPU del sistema potrà dipendere dal costo o dalla geometria della configurazione in termini dinumero di nodi solo CPU e numero di nodi CPU+GPU
Tier 1 CINECA
Requisiti di alto livello del sistema
Vendor identificati: IBM, EurotechDRAM Memory: 1GByte/core
Verrà richiesta la possibilità di avere un sottoinsieme di nodi con una quantità di memoria più elevata
Memoria non volatile locale: >500GByte SSD/HD a seconda del costo e dalla configurazione del sistema
Cooling: sistema di raffreddamento a liquido con opzione di free coolingSpazio disco scratch: >300TByte (provided by CINECA)
Tier 1 CINECA
Roadmap 50PFlops
Roadmap to Exascale(architectural trends)
HPC Architectures
two model
Hybrid: Server class processors:
Server class nodesSpecial purpose nodes
Accelerator devices:NvidiaIntelAMDFPGA
Homogeneus:Server class node:
Standar processors
Special porpouse nodesSpecial purpose processors
Architectural trends
Peak Performance Moore law
FPU Performance Dennard law
Number of FPUs Moore + Dennard
App. Parallelism Amdahl's law
Programming Models
fundamental paradigm:Message passingMulti-threadsConsolidated standard: MPI & OpenMPNew task based programming model
Special purpose for accelerators:CUDAIntel offload directivesOpenACC, OpenCL, Ecc…NO consolidated standard
Scripting:python
But!
Si lattice
0.54 nm
There will be still 4~6 cycles (or technology generations) left untilwe reach 11 ~ 5.5 nm technologies, at which we will reach downscaling limit, in some year between 2020-30 (H. Iwai, IWJT2008).
300 atoms!
14nm VLSI
Thank you
Dennard scaling law(downscaling)
L’ = L / 2V’ = V / 2F’ = F * 2D’ = 1 / L2 = 4DP’ = P
do not hold anymore!
The power crisis!
L’ = L / 2V’ = ~VF’ = ~F * 2D’ = 1 / L2 = 4 * DP’ = 4 * P
Increase the number of coresto maintain the architectures evolution on the Moore’s law
Programming crisis!
The core frequencyand performance do notgrow following the Moore’s law any longer
new VLSI gen.
old VLSI gen.
The cost per chip “is going down more than the capital intensity is going up,” Smith said, suggesting Intel’s profit margins should not suffer because of heavy capital spending. “This is the economic beauty of Moore’s Law.”And Intel has a good handle on the next production shift, shrinking circuitry to 10 nanometers. Holt said the company has test chips running on that technology. “We are projecting similar kinds of improvements in cost out to 10 nanometers,” he said.So, despite the challenges, Holt could not be induced to say there’s any looming end to Moore’s Law, the invention race that has been a key driver of electronics innovation since first defined by Intel’s co-founder in the mid-1960s.
Moore’s Law
Economic and market law
From WSJ
Stacy Smith, Intel’s chief financial officer, later gave some more detail on the economic benefits of staying on the Moore’s Law race.
It is all about the number of chips per Si wafer!
What about Applications?
In a massively parallel context, an upper limit for the scalability of parallel applications is determined by the fraction of the overall execution time spent in non-scalable operations (Amdahl's law).
maximum speedup tends to 1 / ( 1 − P )
P= parallel fraction
1000000 core
P = 0.999999
serial fraction= 0.000001
HPC Architectures
two model Hybrid, but…
Homogeneus, but…
What 100PFlops system we will see … my guess
IBM (hybrid) Power8+Nvidia GPUCray (homo/hybrid) with Intel only!Intel (hybrid) Xeon + MICArm (homo) only arm chip, but…Nvidia/Arm (hybrid) arm+NvidiaFujitsu (homo) sparc high density low powerChina (homo/hybrid) with Intel onlyRoom for AMD console chips
Chip Architecture
Intel
ARM
NVIDIA
Power
AMD
Strongly market driven Mobile, Tv set, ScreensVideo/Image processing
New arch to compete with ARMLess Xeon, but PHI
Main focus on low power mobile chipQualcomm, Texas inst. , Nvidia, ST, eccnew HPC market, server maket
GPU alone will not last longARM+GPU, Power+GPU
Embedded marketPower+GPU, only chance for HPC
Console marketStill some chance for HPC