many integrated core prototype prace autumn school 2012 on massively parallel architectures and...

58
Many Integrated Core Prototype PRACE Autumn School 2012 on Massively Parallel Architectures and Molecular Simulations Sofia, 24-28 September 2012 G. Erbacci – CINECA

Upload: rachel-spencer

Post on 16-Jan-2016

218 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Many Integrated Core Prototype PRACE Autumn School 2012 on Massively Parallel Architectures and Molecular Simulations Sofia, 24-28 September 2012 G. Erbacci

Many Integrated Core Prototype

PRACE Autumn School 2012 on Massively Parallel Architectures and Molecular SimulationsSofia, 24-28 September 2012

G. Erbacci – CINECA

Page 2: Many Integrated Core Prototype PRACE Autumn School 2012 on Massively Parallel Architectures and Molecular Simulations Sofia, 24-28 September 2012 G. Erbacci

Outline

• HPC evolution• The Eurora Prototype• MIC architecture• Programming MIC

2

Page 3: Many Integrated Core Prototype PRACE Autumn School 2012 on Massively Parallel Architectures and Molecular Simulations Sofia, 24-28 September 2012 G. Erbacci

Many Integrated Core Prototype

• HPC evolution• The Eurora Prototype• MIC architecture• Programming MIC

3

Page 4: Many Integrated Core Prototype PRACE Autumn School 2012 on Massively Parallel Architectures and Molecular Simulations Sofia, 24-28 September 2012 G. Erbacci

HPC at CINECA

CINECA: National Supercomputing Centre in Italy• manage the HPC infrastructure • provide support to Italian and European researchers • promote technology transfer initiatives for industry • CINECA is a Hosting Member in PRACE

– PLX: Cluster Linux with GPUs (Tier-1 in PRACE) – FERMI: IBM BG/Q (Tier-0 in PRACE)

4

Page 5: Many Integrated Core Prototype PRACE Autumn School 2012 on Massively Parallel Architectures and Molecular Simulations Sofia, 24-28 September 2012 G. Erbacci

PLX@CINECA

IBM Cluster linux

Processor type: 2 six-cores Intel Xeon (Esa-Core Westmere)

X 5645 @ 2.4 GHz, 12MB Cache

N. of nodes / cores: 274 / 3288

RAM: 48 GB/Compute node (14 TB in total)

Internal Network: Infiniband with 4x QDR switches (40 Gbps)

Acccelerators: 2 GPUs nVIDIA M2070 per node

548 GPUs in total

Peak performance: 32 Tflops

565 TFlops SP GPUs

283 TFlops DP GPUs 5

Page 6: Many Integrated Core Prototype PRACE Autumn School 2012 on Massively Parallel Architectures and Molecular Simulations Sofia, 24-28 September 2012 G. Erbacci

FERMI@CINECA

Architecture: 10 BGQ Frames

Model: IBM-BG/Q

Processor type: IBM PowerA2 @1.6 GHz

Computing Cores: 163840

Computing Nodes: 10240

RAM: 1GByte / core (163 PByte total)

Internal Network: 5D Torus

Disk Space: 2PByte of scratch space

Peak Performance: 2PFlop/s

N. 7 in Top 500 rank (June 2012)

National and PRACE Tier-0 calls

6

Page 7: Many Integrated Core Prototype PRACE Autumn School 2012 on Massively Parallel Architectures and Molecular Simulations Sofia, 24-28 September 2012 G. Erbacci

CINECA HPC Infrastructure

7

Page 8: Many Integrated Core Prototype PRACE Autumn School 2012 on Massively Parallel Architectures and Molecular Simulations Sofia, 24-28 September 2012 G. Erbacci

Computational Sciences Computational science (with theory and experimentation), is the “third pillar” of scientific inquiry,enabling researchers to build and test models of complex phenomena

Quick evolution of innovation: - Instantaneous communication

- Geographically distributed work

- Increased productivity

- More data everywhere

- Increasing problem complexity

- Innovation happens worldwide

8

Page 9: Many Integrated Core Prototype PRACE Autumn School 2012 on Massively Parallel Architectures and Molecular Simulations Sofia, 24-28 September 2012 G. Erbacci

Technology EvolutionMore data everywhere: Radar, satellites, CAT scans, sensors, micro-arrays weather models, the human genome. The size and resolution of the problems scientists address today are limited only by the size of the data they can reasonably work with. There is a constantly increasing demand for faster processing on bigger data.

Increasing problem complexity Partly driven by the ability to handle bigger data, but also by the requirements and opportunities brought by new technologies. For example, new kinds of medical scans create new computational challenges.

HPC EvolutionAs technology allows scientists to handle bigger datasets and faster computations, they push to solve harder problems. In turn, the new class of problems drives the next cycle of technology innovation. 9

Page 10: Many Integrated Core Prototype PRACE Autumn School 2012 on Massively Parallel Architectures and Molecular Simulations Sofia, 24-28 September 2012 G. Erbacci

Top 500: some facts 1976 Cray 1 installed at Los Alamos: peak performance 160 MegaFlop/s (106 flop/s)

1993 (1° Edition Top 500) N. 1 59.7 GFlop/s (1012 flop/s)

1997 Teraflop/s barrier (1012 flop/s)

2008 Petaflop/s (1015 flop/s): Roadrunner (LANL) Rmax 1026 Gflop/s, Rpeak 1375 Gflop/shybrid system: 6562 processors dual-core AMD Opteron accelerated with 12240 IBM Cell processors (98 TByte di RAM)

2012 (J) 16.3 Petaflop/s : Lawrence Livermore’s Sequoia Supercomputer BlueGene/Q, (1.572.864 cores) - 4 European systems in the Top 10- Total combined performance of all 500 systems has grown to 123.02 Pflop/s, compared to 74.2 Pflop/s six months ago - 57 systems use accelerators

- - - - Toward Exascale

10

Page 11: Many Integrated Core Prototype PRACE Autumn School 2012 on Massively Parallel Architectures and Molecular Simulations Sofia, 24-28 September 2012 G. Erbacci

Dennard Scaling law (MOSFET)• L’ = L / 2• V’ = V / 2• F’ = F * 2• D’ = 1 / L2 = 4D• P’ = P

do not hold anymore!

The power crisis!

L’ = L / 2V’ = ~VF’ = ~F * 2D’ = 1 / L2 = 4 * DP’ = 4 * P

The core frequencyand performance do not

grow following the Moore’s law any longer

CPU + Acceleratorto maintain the

architectures evolution In the Moore’s law

Programming crisis!

11

Page 12: Many Integrated Core Prototype PRACE Autumn School 2012 on Massively Parallel Architectures and Molecular Simulations Sofia, 24-28 September 2012 G. Erbacci

Roadmap to Exascale(architectural trends)

12

Page 13: Many Integrated Core Prototype PRACE Autumn School 2012 on Massively Parallel Architectures and Molecular Simulations Sofia, 24-28 September 2012 G. Erbacci

Heterogeneous Multi-core Architecture• Combines different types of processors

– Each optimized for a different operational modality• Performance

– Synthesis favors superior performance• For complex computation exhibiting distinct modalities

• Purpose-designed accelerators– Integrated to significantly speedup some critical aspect of one or more important classes of computation– IBM Cell architecture, ClearSpeed SIMD attached array processor,

• Conventional co-processors– Graphical processing units (GPU)– Network controllers (NIC)– Many Integrated Cores (MIC ) – Efforts underway to apply existing special purpose components

to general applications

13

Page 14: Many Integrated Core Prototype PRACE Autumn School 2012 on Massively Parallel Architectures and Molecular Simulations Sofia, 24-28 September 2012 G. Erbacci

AcceleratorsA set (one or more) of very simple execution units that can perform few operations (with respect to standard CPU) with very high efficiency. When combined with full featured CPU (CISC or RISC) can accelerate the “nominal” speed of a system.

CPU ACC.

CPUACC.Physical integration

CPU & ACCArchitectural integration

Single thread perf. throughput

14

Page 15: Many Integrated Core Prototype PRACE Autumn School 2012 on Massively Parallel Architectures and Molecular Simulations Sofia, 24-28 September 2012 G. Erbacci

nVIDIA GPU

Fermi implementation packs 512 processor

cores

15

Page 16: Many Integrated Core Prototype PRACE Autumn School 2012 on Massively Parallel Architectures and Molecular Simulations Sofia, 24-28 September 2012 G. Erbacci

ATI FireStream, AMD GPU

2012New Graphics Core Next “GCN”

With new instruction set and new SIMD design

16

Page 17: Many Integrated Core Prototype PRACE Autumn School 2012 on Massively Parallel Architectures and Molecular Simulations Sofia, 24-28 September 2012 G. Erbacci

Intel MIC (Knight Ferry)

17

Page 18: Many Integrated Core Prototype PRACE Autumn School 2012 on Massively Parallel Architectures and Molecular Simulations Sofia, 24-28 September 2012 G. Erbacci

Real HPC Crisis is with Software

A supercomputer application and software are usually much more long-lived than a hardware- Hardware life typically four-five years at most.- Fortran and C are still the main programming models

Programming is stuck- Arguably hasn’t changed so much since the 70’s

Software is a major cost component of modern technologies - The tradition in HPC system procurement is to assume that the software is free.

It’s time for a change - Complexity is rising dramatically - Challenges for the applications on Petaflop systems - Improvement of existing codes will become complex and partly impossible - The use of O(100K) cores implies dramatic optimization effort - New paradigm as the support of a hundred threads in one node implies new parallelization strategies - Implementation of new parallel programming methods in existing large applications has not always a promising perspective

There is the need for new community codes 18

Page 19: Many Integrated Core Prototype PRACE Autumn School 2012 on Massively Parallel Architectures and Molecular Simulations Sofia, 24-28 September 2012 G. Erbacci

What about parallel App?

• In a massively parallel context, an upper limit for the scalability of parallel applications is determined by the fraction of the overall execution time spent in non-scalable operations (Amdahl's law).

maximum speedup tends to 1 / ( 1 − P )

P= parallel fraction

1000000 core

P = 0.999999

serial fraction= 0.000001

19

Page 20: Many Integrated Core Prototype PRACE Autumn School 2012 on Massively Parallel Architectures and Molecular Simulations Sofia, 24-28 September 2012 G. Erbacci

Trends

Vector

Distributed memory

Shared Memory

Hybrid codes

MPP System, Message Passing: MPI

Multi core nodes: OpenMP

Accelerator (GPGPU, FPGA): Cuda, OpenCL

Scalar Application

20

Page 21: Many Integrated Core Prototype PRACE Autumn School 2012 on Massively Parallel Architectures and Molecular Simulations Sofia, 24-28 September 2012 G. Erbacci

Many Integrated Core Prototype

• HPC evolution• The Eurora Prototype• MIC architecture• Programming MIC

21

Page 22: Many Integrated Core Prototype PRACE Autumn School 2012 on Massively Parallel Architectures and Molecular Simulations Sofia, 24-28 September 2012 G. Erbacci

EURORA Prototype

22

• Evolution of AURORA architecture by Eurotech (http://www.eurotech.com/)

– Aurora Rack: 256 Nodes: 512 CPUs

– 101 Tflops @ 100 KW

– liquid cooled

• CPU: Xeon Sandy Bridge (SB)– Up to One full cabinet (128 nodes + 256 accelerators)

• Accelerator: Intel Many Integrated Cores (MIC)

• Network architecture: IB and Torus interconnect– Low Latency/High Bandwidth Interconnect

• Cooling: Hot Water

Page 23: Many Integrated Core Prototype PRACE Autumn School 2012 on Massively Parallel Architectures and Molecular Simulations Sofia, 24-28 September 2012 G. Erbacci

EURORA chassis 1 rack, 16 chassis

16 nodes card or8 nodes card + 16 accelerators

Eurora Rack Physical dimensions: 2133mm(48U) h, 1095mm w, 1500 mm d;

Weight (full rack with cooling fully loaded with water): 2000Kg Power/Cooling typical requirements: 120-130 kW @ 48 Vdc

23

Page 24: Many Integrated Core Prototype PRACE Autumn School 2012 on Massively Parallel Architectures and Molecular Simulations Sofia, 24-28 September 2012 G. Erbacci

EURORA node•

2 Intel Xeon E52 Intel MIC or

2 nVidia Kepler16GByte DDR3 1.6GHz

SSD disk

24

Page 25: Many Integrated Core Prototype PRACE Autumn School 2012 on Massively Parallel Architectures and Molecular Simulations Sofia, 24-28 September 2012 G. Erbacci

Node card mockup

25

• Presented at ISC12• Can host MIC and K20 cards• Thermal analysis and validation performed

Page 26: Many Integrated Core Prototype PRACE Autumn School 2012 on Massively Parallel Architectures and Molecular Simulations Sofia, 24-28 September 2012 G. Erbacci

EURORA Network

3D Torus custom networkFPGA (Altera Stratix V)

EXTOLL, APENETAd-hoc MPI subset

InfiniBand FDRMellanox ConnectX3

MPI + FilesystemSynch

26

Page 27: Many Integrated Core Prototype PRACE Autumn School 2012 on Massively Parallel Architectures and Molecular Simulations Sofia, 24-28 September 2012 G. Erbacci

Cooling

• Hot water 50-80C• Temperature gap 3-5C• No rotating fans• Cold plates –direct on component liquid cooling• Dry chillers• Free cooling• Temperature sensors – downgrade performance is

required• System isolation

Quick disconnect

27

Page 28: Many Integrated Core Prototype PRACE Autumn School 2012 on Massively Parallel Architectures and Molecular Simulations Sofia, 24-28 September 2012 G. Erbacci

28

EURORA prototype (Node Accelerator)

EURopean many integrated cORe Architecture

Goal: evaluate a new architecture for next generation Tier-0 system

Partners: - CINECA, Italy - GRNET, Greece - IPB, Serbia - NCSA, BulgariaVendor: Eurotech, Italy

Page 29: Many Integrated Core Prototype PRACE Autumn School 2012 on Massively Parallel Architectures and Molecular Simulations Sofia, 24-28 September 2012 G. Erbacci

EURORA Installation Plan

29

Page 30: Many Integrated Core Prototype PRACE Autumn School 2012 on Massively Parallel Architectures and Molecular Simulations Sofia, 24-28 September 2012 G. Erbacci

HW Procurement

• Contract with EUROTECH signed in July– 64 compute card– 128 Xeon SandyBridge 3.1GHz– 16GByte DDR3 1600MHz per node– 160GByte SSD per node– 1 FPGA (Altera Stratix V) per node– IB FDR– 128 Accelerator cards

• INTEL KNC (or NVIDA K20)

– Thermal sensors network

30

Page 31: Many Integrated Core Prototype PRACE Autumn School 2012 on Massively Parallel Architectures and Molecular Simulations Sofia, 24-28 September 2012 G. Erbacci

HW Procurement and Facility

• Contract with EUROTECH signed in July• Integration in the Facility

– First assessment of the location with EUROTECH in May– First project of integration completed

• Estimated cost higher than budgeted

– Second assessment with EUROTECH in September (before the end)

– Procurement of the technology: • Dry coolers, pipes and pumps, exchanger, tanks, filters

31

Page 32: Many Integrated Core Prototype PRACE Autumn School 2012 on Massively Parallel Architectures and Molecular Simulations Sofia, 24-28 September 2012 G. Erbacci

Some Applications

• www.quantum-espresso.org

www.gromacs.org

32

Page 33: Many Integrated Core Prototype PRACE Autumn School 2012 on Massively Parallel Architectures and Molecular Simulations Sofia, 24-28 September 2012 G. Erbacci

EURORA Programming Models

• Message Passing (MPI)• Shared Memory (OpenMP, TBB)• MIC offload (pragmas) / native• Hybrid: MPI + OpenMP + MIC extensions/OpenCL

33

Page 34: Many Integrated Core Prototype PRACE Autumn School 2012 on Massively Parallel Architectures and Molecular Simulations Sofia, 24-28 September 2012 G. Erbacci

ACCELERATORS

• First K20 and KNC (dense form factor) samples in September

• KNC standard expansion module, already available to start the work on software.

34

Page 35: Many Integrated Core Prototype PRACE Autumn School 2012 on Massively Parallel Architectures and Molecular Simulations Sofia, 24-28 September 2012 G. Erbacci

35

• Installation of the KNC software kit• Test of the compiler, and node card HW• First simple (MPI+OpenMP) application test • First Mic-to-Mic MPI communication test

– Intel MPI– within the same node

• Test of the affinity

Software

Page 36: Many Integrated Core Prototype PRACE Autumn School 2012 on Massively Parallel Architectures and Molecular Simulations Sofia, 24-28 September 2012 G. Erbacci

ACCESS

• Access will be granted upon request to the partners of the prototype project.

• Other requests will be evaluated case by case.• We are working to grant early access to the KNC

board already installed.

36

Page 37: Many Integrated Core Prototype PRACE Autumn School 2012 on Massively Parallel Architectures and Molecular Simulations Sofia, 24-28 September 2012 G. Erbacci

Expected results

• Validate node card design;

• Density in the order of 500TFlops/rack (BG/Q is 200TFlops/rack);

• 3D Torus network scalability and performance vs InfiniBand;

• Power Usage Effectiveness (PUE) close to or less than 1.1 (free

cooling most of the year);

• Programming model for MIC accelerator;

• improved applications efficiency with respect of multi-core clusters;

• Bridge the gap with exascale machines

37

Page 38: Many Integrated Core Prototype PRACE Autumn School 2012 on Massively Parallel Architectures and Molecular Simulations Sofia, 24-28 September 2012 G. Erbacci

Many Integrated Core Prototype

• HPC evolution• The Eurora Prototype• MIC architecture• Programming MIC

38

Page 39: Many Integrated Core Prototype PRACE Autumn School 2012 on Massively Parallel Architectures and Molecular Simulations Sofia, 24-28 September 2012 G. Erbacci

MIC Architecture

• MIC Many Integrated Core • Knight Corner co-processor • Intel Xeon Phi co-processor

– 22 nm technology

– > 50 Intel Architecture cores

– connected by a high performance on-die bi-directional interconnect.

– I/O Bus: PCIe

– Memory Type: GDDR5 and >2x bandwidth of KNF

– Memory size: 8 GB GDDR5 memory technology

– Peak performance: >1 TFLOP (DP)

– Single Linux image per chip

39

Page 40: Many Integrated Core Prototype PRACE Autumn School 2012 on Massively Parallel Architectures and Molecular Simulations Sofia, 24-28 September 2012 G. Erbacci

MIC Intel Xeon Phi Ring

40

Each microprocessor core is a fully functional, in-order core capable of running IA instructions independently of the other cores.

Hardware multi-threaded coresEach core can concurrently run instructions from four processes or threads.

The Ring Interconnect connecting all the components together on the chip

Page 41: Many Integrated Core Prototype PRACE Autumn School 2012 on Massively Parallel Architectures and Molecular Simulations Sofia, 24-28 September 2012 G. Erbacci

The Processor Core

41

- Fetches and decodes instructions from four hardware thread execution contexts - Executes the x86 ISA, and Knights Corner vector instructions- The core can execute 2 instructions per clock cycle, one per pipe - 32KB, 8-Way set associative L1 Icache and Dcache - Core Ring Interface (CRI)- L2 Cache- Memory controllers (which access external memory devices to read and write data)- PCI Express client: is the system interface to the host CPU or PCI Express switch,

Page 42: Many Integrated Core Prototype PRACE Autumn School 2012 on Massively Parallel Architectures and Molecular Simulations Sofia, 24-28 September 2012 G. Erbacci

The vector processing unit

42

Vector processing unit (VPU) associated with each core.

This is primarily a sixteen-element wide SIMD engine, operating on 512-bit vector registers.

Gather / Scatter Unit

Vector Mask

Page 43: Many Integrated Core Prototype PRACE Autumn School 2012 on Massively Parallel Architectures and Molecular Simulations Sofia, 24-28 September 2012 G. Erbacci

43

Vector processing

do i = 1, N

A(i) = B(i)+C(i)

end do B(2)C(2)

B(1)C(1)

B(3)C(3)

B(1)C(1)

B(2)C(2)

B(4)C(4)

B(1)C(1)

B(2)C(2)

B(3)C(3)

B(5)C(5)

B(1)C(1)

B(2)C(2)

B(3)C(3)

B(4)C(4)

B(6)C(6)

B(1)C(1)

B(2)C(2)

B(3)C(3)

B(4)C(4)

B(5)C(5)

B(7)C(7)

B(1) C(1)+B(2)C(2)

B(3)C(3)

B(4)C(4)

B(5)C(5)

B(6)C(6)

B(1)C(1)

.... B(3) B(2) B(1)

.... C(3) C(2) C(1)

.... C(4) C(3) C(2)

.... C(8) C(7) C(6)

.... C(5) C(4) C(3)

.... C(6) C(5) C(4)

.... C(7) C(6) C(5)

.... C(9) C(8) C(7)

.... C(10) C(9) C(8)

.... B(4) B(3) B(2)

.... B(5) B(4) B(3)

.... B(6) B(5) B(4)

.... B(7) B(6) B(5)

.... B(8) B(7) B(6)

.... B(9) B(8) B(7)

.... B(10) B(9) B(8)

CP 0

CP 1

CP 2

CP 3

CP 4

CP 5

CP 7

CP 6

V0 V1 + V2

Functional Unit Add Floating Point

Page 44: Many Integrated Core Prototype PRACE Autumn School 2012 on Massively Parallel Architectures and Molecular Simulations Sofia, 24-28 September 2012 G. Erbacci

The L2 Cache• Each core has a 512 KB L2 cache • The L2 cache is part of the Core-Ring Interface block• The L2 cache is private to the core: each core acts as a

stand-alone core with 512 KB of total L2 cache space• Other cores can not directly use them as a cache

512 KB x > 50 cores > 25 MB L2 on Knight Corner• Tag Directory on each core, not private to the core• A simplified way to view the many cores in Knights Corner

is as a chip-level symmetric multiprocessor (SMP) and > 50 such cores share a high-speed interconnect on-die.

44

Page 45: Many Integrated Core Prototype PRACE Autumn School 2012 on Massively Parallel Architectures and Molecular Simulations Sofia, 24-28 September 2012 G. Erbacci

The Ring Interconnect• Knights Corner has 10 rings (5 in each direction):

– BL ring carries the data

– 2 AR rings carry address

– 2 AK rings carry coherence information

• Knights Corner can send data across the ring once per clock per controller

45

Page 46: Many Integrated Core Prototype PRACE Autumn School 2012 on Massively Parallel Architectures and Molecular Simulations Sofia, 24-28 September 2012 G. Erbacci

Many Integrated Core Prototype

• HPC evolution• The Eurora Prototype• MIC architecture• Programming MIC

46

Page 47: Many Integrated Core Prototype PRACE Autumn School 2012 on Massively Parallel Architectures and Molecular Simulations Sofia, 24-28 September 2012 G. Erbacci

Compute modes vision

47

Xeon Centric MIC Centric

Xeon Hosted

Scalar Co-processing

Symmetric ParallelCo-processing

MIC Hosted

General purpose serial and parallel

computing

Codes with balanced needs

Highly-parallel codes

Highly parallel codes with scalar

phases

Codes with highly-parallel phases

Page 48: Many Integrated Core Prototype PRACE Autumn School 2012 on Massively Parallel Architectures and Molecular Simulations Sofia, 24-28 September 2012 G. Erbacci

48

Main( )Foo( )MPI_*( )

Main( )Foo( )MPI_*( )

Foo( ) Main( )Foo( )MPI_*( )

Xeon

PCIe

MIC Foo( )Main( )Foo( )MPI_*( )

Main( )Foo( )MPI_*( )

Main( )Foo( )MPI_*( )

Xeon Native

MIC NativeMIC-hosted Xeon co-

processed

Autonomous Mode

Xeon-hosted MIC Co-

processed

Offload:Intel

C, C++, Fortran Compiler

Native: -mmic

Page 49: Many Integrated Core Prototype PRACE Autumn School 2012 on Massively Parallel Architectures and Molecular Simulations Sofia, 24-28 September 2012 G. Erbacci

MPI programmming models for MIC

49

Page 50: Many Integrated Core Prototype PRACE Autumn School 2012 on Massively Parallel Architectures and Molecular Simulations Sofia, 24-28 September 2012 G. Erbacci

Offload Model

50

This model is characterized by the MPI communications taking place only between the host processors.

The co-processors are used exclusively thru the offload capabilities of the products like Intel® C, C++, and Fortran Compiler for Intel MIC Architecture, Intel Math Kernel Library (MKL), etc.

This mode of operation is already supported by the Intel MPI Library for Linux OS

MPI on Host Devices with Offload to Co-processors

Page 51: Many Integrated Core Prototype PRACE Autumn School 2012 on Massively Parallel Architectures and Molecular Simulations Sofia, 24-28 September 2012 G. Erbacci

Symmetric Model

51

The MPI processes reside on both the host and the MIC devices

This model involves both the host CPUs and the co-processors into the execution of the MPI processes and the related MPI communications.

Message passing is supported inside the co-processor, inside the host node, and between the co-processor and the host. environment variable

Most general MPI view of an essentially heterogeneous cluster.

Page 52: Many Integrated Core Prototype PRACE Autumn School 2012 on Massively Parallel Architectures and Molecular Simulations Sofia, 24-28 September 2012 G. Erbacci

Co-processor-only Model (or MPI Native)

52

the MPI processes reside on the MIC co-processor only .

MPI libraries, the application, and other needed libraries are uploaded to the co-processors.

An application can be launched from the host or the co-processor.

This can be seen as a specific case of the symmetric model

Page 53: Many Integrated Core Prototype PRACE Autumn School 2012 on Massively Parallel Architectures and Molecular Simulations Sofia, 24-28 September 2012 G. Erbacci

53

Page 54: Many Integrated Core Prototype PRACE Autumn School 2012 on Massively Parallel Architectures and Molecular Simulations Sofia, 24-28 September 2012 G. Erbacci

54

3D Eulerian chemical-transport model (CTM); Fortran 77/90;• Used to study the transport, chemical conversion and deposition of atmospheric pollutants;• Manages multiple nested grids with different resolution.• Can be compiled in four different ways:

- Serial; - OpenMP; - MPI (Master-Worker strategy); - Hybrid (OpenMP + MPI).

• Good candidate to be tested on MICNative compilation: All FARM processes run on the MIC.Symmetric compilation: Master process runs on the host; Worker Processes run on the MIC.

FARM: Air quality Model

Page 55: Many Integrated Core Prototype PRACE Autumn School 2012 on Massively Parallel Architectures and Molecular Simulations Sofia, 24-28 September 2012 G. Erbacci

55

Native compilation

Page 56: Many Integrated Core Prototype PRACE Autumn School 2012 on Massively Parallel Architectures and Molecular Simulations Sofia, 24-28 September 2012 G. Erbacci

56

Native compilation

Page 57: Many Integrated Core Prototype PRACE Autumn School 2012 on Massively Parallel Architectures and Molecular Simulations Sofia, 24-28 September 2012 G. Erbacci

57

Porting on MICPros: • Compilation with an additional Intel compiler flag (-mmic);• Scalability tests: fast and smooth;• Quick analysis with Intel tools (VtuneT, Itac Intel Trace Analyzer and Collector;• Porting time: one day with validation of the numerical result;

• expert developer of FARM, with good knowledge of the Intel compiler, BUT with only a basic knowledge of MIC.

• Best scalability with OpenMP and Hybrid.

Cons:• MPI Init routine problem: increasing CPU time for increasing number of processes;• Same problem when using two MICs together;• Detailed analysis of OpenMP threads is currently not available;• Execution time depends strongly from code vectorization, so compiler vectorization skills and code structure are a key point to have a vectorizable satisfactory overall performances.

Page 58: Many Integrated Core Prototype PRACE Autumn School 2012 on Massively Parallel Architectures and Molecular Simulations Sofia, 24-28 September 2012 G. Erbacci

Conclusions

• Hybrid clusters can bridge the gap with exascale machines

• Test and monitor technologies that could influence the design of Exascale

systems

• Introduce fault tolerance at system and application level

• New hybrid programming models– Focus on the applications

– Big Data Challenge

– Uncertainties Challenge

– coupling Architecture-Algorithm-Application

58