introduction to many-core architectures henk corporaal heco asci winterschool on embedded systems...

124
Introduction to Many-Core Architectures Henk Corporaal www.ics.ele.tue.nl/~heco ASCI Winterschool on Embedded Systems Soesterberg, March 2010

Upload: keyla-coaker

Post on 31-Mar-2015

220 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Introduction to Many-Core Architectures Henk Corporaal heco ASCI Winterschool on Embedded Systems Soesterberg, March 2010

Introductionto

Many-Core Architectures

Henk Corporaalwww.ics.ele.tue.nl/~heco

ASCI Winterschool on Embedded Systems

Soesterberg, March 2010

Page 2: Introduction to Many-Core Architectures Henk Corporaal heco ASCI Winterschool on Embedded Systems Soesterberg, March 2010

ASCI Winterschool 2010

Henk Corporaal(2)

Intel Trends(K. Olukotun)

Core i7

3GHz

100W

5

Page 3: Introduction to Many-Core Architectures Henk Corporaal heco ASCI Winterschool on Embedded Systems Soesterberg, March 2010

ASCI Winterschool 2010

Henk Corporaal(3)

System-level integration (Chuck Moore, AMD at MICRO 2008)

Single-chip CPU Era: 1986 –2004 Extreme focus on single-threaded performance Multi-issue, out-of-order execution plus moderate cache hierarchy

Chip Multiprocessor (CMP) Era: 2004 –2010 Early: Hasty integration of multiple cores into same chip/package Mid-life: Address some of the HW scalability and interference issues Current: Homogeneous CPUs plus moderate system-level

functionality

System-level Integration Era: ~2010 onward Integration of substantial system-level functionality Heterogeneous processors and accelerators Introspective control systems for managing on-chip resources &

events

Page 4: Introduction to Many-Core Architectures Henk Corporaal heco ASCI Winterschool on Embedded Systems Soesterberg, March 2010

ASCI Winterschool 2010

Henk Corporaal(4)

Why many core?

Running into Frequency wall ILP wall Memory wall Energy wall

Chip area enabler: Moore's law goes well below 22 nm What to do with all this area? Multiple processors fit easily on a single die

Application demands

Cost effective (just connect existing processors or processor cores)

Low power: parallelism may allow lowering Vdd Performance/Watt is the new metric !!

Page 5: Introduction to Many-Core Architectures Henk Corporaal heco ASCI Winterschool on Embedded Systems Soesterberg, March 2010

ASCI Winterschool 2010

Henk Corporaal(5)

Low power through parallelism

Sequential Processor Switching capacitance C Frequency f Voltage V P1 = fCV2

Parallel Processor (two times the number of units) Switching capacitance 2C Frequency f/2 Voltage V’ < V P2 = f/2 2C V’2 = fCV’2 < P1

CPU

CPU1 CPU2

Page 6: Introduction to Many-Core Architectures Henk Corporaal heco ASCI Winterschool on Embedded Systems Soesterberg, March 2010

ASCI Winterschool 2010

Henk Corporaal(6)

How low Vdd can we go?

Subthreshold JPEG encoder Vdd 0.4 – 1.2 Volt

Engine

Engine

Engine

Engine

pJ /operation

8.3X5.6X4.4X3.4X

1.1 1.0 0.9 0.8 0.7 0.6 0.5Supply Voltage (V)

1.2 0.40.0

1.0

2.0

3.0

4.0

5.0

6.0

7.0

8.0

Page 7: Introduction to Many-Core Architectures Henk Corporaal heco ASCI Winterschool on Embedded Systems Soesterberg, March 2010

ASCI Winterschool 2010

Henk Corporaal(7)

Computational efficiency: how many MOPS/Watt?

Yifan He e.a., DAC 2010

Page 8: Introduction to Many-Core Architectures Henk Corporaal heco ASCI Winterschool on Embedded Systems Soesterberg, March 2010

ASCI Winterschool 2010

Henk Corporaal(8)

Computational efficiency: what do we need?

1

10

100

1000

10000

0.1 1 10 100

Be

tter

Po

wer Efficien

cy

1 Mops/mW

10 Mops/mW100 Mops/mW

1000 Mops/mW

SODA(65nm)

SODA (90nm)

TI C6X

Imagine

VIRAM Pentium M

IBM Cell

Pe

rfo

rma

nc

e (

Go

ps

)

Power (Watts )

3G Wireless

4G Wireless

Mobile HDVideo

Woh e.a., ISCA 2009

Page 9: Introduction to Many-Core Architectures Henk Corporaal heco ASCI Winterschool on Embedded Systems Soesterberg, March 2010

ASCI Winterschool 2010

Henk Corporaal(9)

Intel's opinion: 48-core x86

Page 10: Introduction to Many-Core Architectures Henk Corporaal heco ASCI Winterschool on Embedded Systems Soesterberg, March 2010

ASCI Winterschool 2010

Henk Corporaal(

10

)

Outline

Classifications of Parallel Architectures

Examples Various (research) architectures GPUs Cell Intel multi-cores

How much performance do you really get?Roofline model

Trends & Conclusions

Page 11: Introduction to Many-Core Architectures Henk Corporaal heco ASCI Winterschool on Embedded Systems Soesterberg, March 2010

ASCI Winterschool 2010

Henk Corporaal(

11

)

Classifications

Performance / parallelism driven: 4-5 D Flynn

Communication & Memory Message passing / Shared memory Shared memory issues: coherency, consistency,

synchronization

Interconnect

Page 12: Introduction to Many-Core Architectures Henk Corporaal heco ASCI Winterschool on Embedded Systems Soesterberg, March 2010

ASCI Winterschool 2010

Henk Corporaal(

12

)

Flynn's Taxomony

SISD (Single Instruction, Single Data) Uniprocessors

SIMD (Single Instruction, Multiple Data) Vector architectures also belong to this class

Multimedia extensions (MMX, SSE, VIS, AltiVec, …) Examples: Illiac-IV, CM-2, MasPar MP-1/2, Xetal, IMAP, Imagine,

GPUs, ……

MISD (Multiple Instruction, Single Data) Systolic arrays / stream based processing

MIMD (Multiple Instruction, Multiple Data) Examples: Sun Enterprise 5000, Cray T3D/T3E, SGI Origin

Flexible Most widely used

Page 13: Introduction to Many-Core Architectures Henk Corporaal heco ASCI Winterschool on Embedded Systems Soesterberg, March 2010

ASCI Winterschool 2010

Henk Corporaal(

13

)

Flynn's Taxomony

Page 14: Introduction to Many-Core Architectures Henk Corporaal heco ASCI Winterschool on Embedded Systems Soesterberg, March 2010

ASCI Winterschool 2010

Henk Corporaal(

14

)

Enhance performance: 4 architecture methods (Super)-pipelining

Powerful instructions MD-technique

multiple data operands per operation MO-technique

multiple operations per instruction

Multiple instruction issue Single stream: Superscalar Multiple streams

Single core, multiple threads: Simultaneously Multi-Threading

Multiple cores

Page 15: Introduction to Many-Core Architectures Henk Corporaal heco ASCI Winterschool on Embedded Systems Soesterberg, March 2010

ASCI Winterschool 2010

Henk Corporaal(

15

)

Architecture methodsPipelined Execution of Instructions

Purpose of pipelining: Reduce #gate_levels in critical path Reduce CPI close to one (instead of a large number for the

multicycle machine) More efficient Hardware

Problems Hazards: pipeline stalls

Structural hazards: add more hardware Control hazards, branch penalties: use branch prediction Data hazards: by passing required

IF: Instruction Fetch

DC: Instruction Decode

RF: Register Fetch

EX: Execute instruction

WB: Write Result Register

IF DC RF EX WBIF DC RF EX WB

IF DC RF EX WBIF DC RF EX WB

INS

TR

UC

TIO

N

CYCLE

1 2 43 5 6 7 8

12

3

4 Simple 5-stage pipeline

Page 16: Introduction to Many-Core Architectures Henk Corporaal heco ASCI Winterschool on Embedded Systems Soesterberg, March 2010

ASCI Winterschool 2010

Henk Corporaal(

16

)

Architecture methodsPipelined Execution of InstructionsSuperpipelining:

Split one or more of the critical pipeline stages

Superpipelining degree S:

*Op I_set

S(architecture) = f(Op) * lt (Op)

where: f(op) is frequency of operation op lt(op) is latency of operation op

Page 17: Introduction to Many-Core Architectures Henk Corporaal heco ASCI Winterschool on Embedded Systems Soesterberg, March 2010

ASCI Winterschool 2010

Henk Corporaal(

17

)

Architecture methodsPowerful Instructions (1)MD-technique

Multiple data operands per operation SIMD: Single Instruction Multiple Data

Vector instruction:

for (i=0, i++, i<64) c[i] = a[i] + 5*b[i];

or

c = a + 5*b

Assembly:

set vl,64ldv v1,0(r2)mulvi v2,v1,5ldv v1,0(r1)addv v3,v1,v2stv v3,0(r3)

Page 18: Introduction to Many-Core Architectures Henk Corporaal heco ASCI Winterschool on Embedded Systems Soesterberg, March 2010

ASCI Winterschool 2010

Henk Corporaal(

18

)

Architecture methodsPowerful Instructions (1)SIMD computing

All PEs (Processing Elements) execute same operation

Typical mesh or hypercube connectivity

Exploit data locality of e.g. image processing applications

Dense encoding (few instruction bits needed)

SIMD Execution Method

tim

e

Instruction 1

Instruction 2

Instruction 3

Instruction n

PE1 PE2 PEn

Page 19: Introduction to Many-Core Architectures Henk Corporaal heco ASCI Winterschool on Embedded Systems Soesterberg, March 2010

ASCI Winterschool 2010

Henk Corporaal(

19

)

Architecture methodsPowerful Instructions (1)Sub-word parallelism

SIMD on restricted scale: Used for Multi-media instructions

Examples MMX, SSE, SUN-VIS, HP MAX-2,

AMD-K7/Athlon 3Dnow, Trimedia II

Example: i=1..4|ai-bi| * * * *

Page 20: Introduction to Many-Core Architectures Henk Corporaal heco ASCI Winterschool on Embedded Systems Soesterberg, March 2010

ASCI Winterschool 2010

Henk Corporaal(

20

)

Architecture methodsPowerful Instructions (2)MO-technique: multiple operations per instruction

Two options: CISC (Complex Instruction Set Computer) VLIW (Very Long Instruction Word)

sub r8, r5, 3 and r1, r5, 12 mul r6, r5, r2 ld r3, 0(r5)

FU 1 FU 2 FU 3 FU 4field

instruction bnez r5, 13

FU 5

VLIW instruction example

Page 21: Introduction to Many-Core Architectures Henk Corporaal heco ASCI Winterschool on Embedded Systems Soesterberg, March 2010

ASCI Winterschool 2010

Henk Corporaal(

21

)

Execunit 1

Execunit 2

Execunit 3

Register file

Issue slot 1

Execunit 4

Execunit 5

Execunit 6

Execunit 7

Execunit 8

Execunit 9

Issue slot 2 Issue slot 3

Q: How many ports does the registerfile need for n-issue?

VLIW architecture: central Register File

Page 22: Introduction to Many-Core Architectures Henk Corporaal heco ASCI Winterschool on Embedded Systems Soesterberg, March 2010

ASCI Winterschool 2010

Henk Corporaal(

22

)

Architecture methodsMultiple instruction issue (per cycle)Who guarantees semantic correctness?

can instructions be executed in parallel

User: he specifies multiple instruction streams Multi-processor: MIMD (Multiple Instruction Multiple

Data)

HW: Run-time detection of ready instructions Superscalar

Compiler: Compile into dataflow representation Dataflow processors

Page 23: Introduction to Many-Core Architectures Henk Corporaal heco ASCI Winterschool on Embedded Systems Soesterberg, March 2010

ASCI Winterschool 2010

Henk Corporaal(

23

)

Four dimensional representation of the architecture design space <I, O, D, S>

Instructions/cycle ‘I’

Superpipelining Degree ‘S’

Operations/instruction ‘O’

Data/operation ‘D’

Superscalar MIMD Dataflow

Superpipelined

RISC

VLIW

10 100

1010

0.1

Vector

10

SIMD100

CISC

Page 24: Introduction to Many-Core Architectures Henk Corporaal heco ASCI Winterschool on Embedded Systems Soesterberg, March 2010

ASCI Winterschool 2010

Henk Corporaal(

24

)

Architecture design space

Architecture I O D S MparCISC 0.2 1.2 1.1 1 0.26RISC 1 1 1 1.2 1.2VLIW 1 10 1 1.2 12Superscalar 3 1 1 1.2 3.6SIMD 1 1 128 1.2 154MIMD 32 1 1 1.2 38GPU 32 2 8 24 12288Top500 Jaguar ???

Example values of <I, O, D, S> for different architectures

Mpar = I*O*D*S

Op I_set

S(architecture) = f(Op) * lt (Op)You should exploit this

amount of parallelism !!!

Page 25: Introduction to Many-Core Architectures Henk Corporaal heco ASCI Winterschool on Embedded Systems Soesterberg, March 2010

ASCI Winterschool 2010

Henk Corporaal(

25

)

Communication

Parallel Architecture extends traditional computer architecture with a communication network abstractions (HW/SW interface) organizational structure to realize abstraction

efficiently

Communication Network

Processingnode

Processingnode

Processingnode

Processingnode

Processingnode

Page 26: Introduction to Many-Core Architectures Henk Corporaal heco ASCI Winterschool on Embedded Systems Soesterberg, March 2010

ASCI Winterschool 2010

Henk Corporaal(

26

)

Communication models: Shared Memory

Coherence problem

Memory consistency issue

Synchronization problem

Process P1 Process P2

SharedMemory

(read, write)(read, write)

Page 27: Introduction to Many-Core Architectures Henk Corporaal heco ASCI Winterschool on Embedded Systems Soesterberg, March 2010

ASCI Winterschool 2010

Henk Corporaal(

27

)

Communication models: Shared memory

Shared address space

Communication primitives: load, store, atomic swap

Two varieties: Physically shared => Symmetric Multi-Processors (SMP)

usually combined with local caching

Physically distributed => Distributed Shared Memory (DSM)

Page 28: Introduction to Many-Core Architectures Henk Corporaal heco ASCI Winterschool on Embedded Systems Soesterberg, March 2010

ASCI Winterschool 2010

Henk Corporaal(

28

)

SMP: Symmetric Multi-Processor

Memory: centralized with uniform access time (UMA) and bus interconnect, I/O

Examples: Sun Enterprise 6000, SGI Challenge, Intel

Main memory I/O System

One ormore cache

levels

Processor

One ormore cache

levels

Processor

One ormore cache

levels

Processor

One ormore cache

levels

Processorcan be 1 bus, N busses, or any network

Page 29: Introduction to Many-Core Architectures Henk Corporaal heco ASCI Winterschool on Embedded Systems Soesterberg, March 2010

ASCI Winterschool 2010

Henk Corporaal(

29

)

DSM: Distributed Shared Memory

Nonuniform access time (NUMA) and scalable interconnect (distributed memory)

Interconnection NetworkInterconnection Network

Cache

Processor

Memory

Cache

Processor

Memory

Cache

Processor

Memory

Cache

Processor

Memory

Main memory I/O System

Page 30: Introduction to Many-Core Architectures Henk Corporaal heco ASCI Winterschool on Embedded Systems Soesterberg, March 2010

ASCI Winterschool 2010

Henk Corporaal(

30

)

Shared Address Model Summary

Each processor can name every physical location in the machine

Each process can name all data it shares with other processes

Data transfer via load and store

Data size: byte, word, ... or cache blocks

Memory hierarchy model applies: communication moves data to local proc. cache

Page 31: Introduction to Many-Core Architectures Henk Corporaal heco ASCI Winterschool on Embedded Systems Soesterberg, March 2010

ASCI Winterschool 2010

Henk Corporaal(

31

)

Three fundamental issues for shared memory multiprocessors

Coherence, about: Do I see the most recent data?

Consistency, about: When do I see a written value? e.g. do different processors see writes at the same time

(w.r.t. other memory accesses)?

SynchronizationHow to synchronize processes? how to protect access to shared data?

Page 32: Introduction to Many-Core Architectures Henk Corporaal heco ASCI Winterschool on Embedded Systems Soesterberg, March 2010

ASCI Winterschool 2010

Henk Corporaal(

32

)

Communication models: Message Passing

Communication primitives e.g., send, receive library calls standard MPI: Message Passing Interface

www.mpi-forum.org

Note that MP can be build on top of SM and vice versa!

Process P1 Process P2

receive

receive send

sendFiFO

Page 33: Introduction to Many-Core Architectures Henk Corporaal heco ASCI Winterschool on Embedded Systems Soesterberg, March 2010

ASCI Winterschool 2010

Henk Corporaal(

33

)

Message Passing Model

Explicit message send and receive operations

Send specifies local buffer + receiving process on remote computer

Receive specifies sending process on remote computer + local buffer to place data

Typically blocking communication, but may use DMA

Header Data Trailer

Message structure

Page 34: Introduction to Many-Core Architectures Henk Corporaal heco ASCI Winterschool on Embedded Systems Soesterberg, March 2010

ASCI Winterschool 2010

Henk Corporaal(

34

)

Message passing communication

Interconnection NetworkInterconnection Network

Networkinterface

Networkinterface

Networkinterface

Networkinterface

Cache

Processor

Memory

DMA

Cache

Processor

Memory

DMA

Cache

Processor

Memory

DMA

Cache

Processor

Memory

DMA

Page 35: Introduction to Many-Core Architectures Henk Corporaal heco ASCI Winterschool on Embedded Systems Soesterberg, March 2010

ASCI Winterschool 2010

Henk Corporaal(

35

)

Communication Models: Comparison

Shared-Memory: Compatibility with well-understood language mechanisms Ease of programming for complex or dynamic

communications patterns Shared-memory applications; sharing of large data

structures Efficient for small items Supports hardware caching

Messaging Passing: Simpler hardware Explicit communication Implicit synchronization (with any communication)

Page 36: Introduction to Many-Core Architectures Henk Corporaal heco ASCI Winterschool on Embedded Systems Soesterberg, March 2010

ASCI Winterschool 2010

Henk Corporaal(

36

)

Interconnect

How to connect your cores?

Some options: Connect everybody:

Single bus Hierarchical bus NoC

• multi-hop via routers• any topology possible• easy 2D layout helps

Connect with e.g. neighbors only e.g. using shift operation in SIMD or using dual-ported mems to connect 2 cores.

Page 37: Introduction to Many-Core Architectures Henk Corporaal heco ASCI Winterschool on Embedded Systems Soesterberg, March 2010

ASCI Winterschool 2010

Henk Corporaal(

37

)

Bus (shared) or Network (switched)

Network: claimed to be more scalable no bus arbitration point-to-point connections

but router overhead

node

R

node

R

node

R

node

R

node

R

node

R

node

R

node

R

Example:NoC with 2x4 meshrouting network

Page 38: Introduction to Many-Core Architectures Henk Corporaal heco ASCI Winterschool on Embedded Systems Soesterberg, March 2010

ASCI Winterschool 2010

Henk Corporaal(

38

)

Historical Perspective

Early machines were: Collection of microprocessors. Communication was performed using bi-directional queues

between nearest neighbors.

Messages were forwarded by processors on path “Store and forward” networking

There was a strong emphasis on topology in algorithms, in order to minimize the number of hops => minimize time

Page 39: Introduction to Many-Core Architectures Henk Corporaal heco ASCI Winterschool on Embedded Systems Soesterberg, March 2010

ASCI Winterschool 2010

Henk Corporaal(

39

)

Design Characteristics of a Network Topology (how things are connected):

Crossbar, ring, 2-D and 3-D meshes or torus, hypercube, tree, butterfly, perfect shuffle, ....

Routing algorithm (path used): Example in 2D torus: all east-west then all north-south (avoids

deadlock)

Switching strategy: Circuit switching: full path reserved for entire message, like the

telephone. Packet switching: message broken into separately-routed packets,

like the post office.

Flow control and buffering (what if there is congestion): Stall, store data temporarily in buffers re-route data to other nodes tell source node to temporarily halt, discard, etc.

QoS guarantees, Error handling, …., etc, etc.

Page 40: Introduction to Many-Core Architectures Henk Corporaal heco ASCI Winterschool on Embedded Systems Soesterberg, March 2010

ASCI Winterschool 2010

Henk Corporaal(

40

)

Switch / Network Topology

Topology determines: Degree: number of links from a node Diameter: max number of links crossed between nodes Average distance: number of links to random destination Bisection: minimum number of links that separate the

network into two halves Bisection bandwidth = link bandwidth * bisection

Page 41: Introduction to Many-Core Architectures Henk Corporaal heco ASCI Winterschool on Embedded Systems Soesterberg, March 2010

ASCI Winterschool 2010

Henk Corporaal(

41

)

Bisection Bandwidth Bisection bandwidth: bandwidth across smallest cut that divides

network into two equal halves

Bandwidth across “narrowest” part of the network

bisection cut

not a bisectioncut

bisection bw= link bw bisection bw = sqrt(n) * link bw

Bisection bandwidth is important for algorithms in which all processors need to communicate with all others

Page 42: Introduction to Many-Core Architectures Henk Corporaal heco ASCI Winterschool on Embedded Systems Soesterberg, March 2010

ASCI Winterschool 2010

Henk Corporaal(

42

)

Common Topologies

Type Degree Diameter Ave Dist Bisection

1D mesh 2 N-1 N/3 1

2D mesh 4 2(N1/2 - 1) 2N1/2 / 3 N1/2

3D mesh 6 3(N1/3 - 1) 3N1/3 / 3 N2/3

nD mesh 2n n(N1/n - 1) nN1/n / 3 N(n-1) / n

Ring 2 N/2 N/4 2

2D torus 4 N1/2 N1/2 / 2 2N1/2

Hypercube Log2N n=Log2N n/2 N/2

2D Tree 3 2Log2N ~2Log2 N 1

Crossbar N-1 1 1 N2/2

N = number of nodes, n = dimension

Page 43: Introduction to Many-Core Architectures Henk Corporaal heco ASCI Winterschool on Embedded Systems Soesterberg, March 2010

ASCI Winterschool 2010

Henk Corporaal(

43

)

Topologies in Real High End Machines

Red Storm (Opteron + Cray network, future)

3D Mesh

Blue Gene/L 3D Torus

SGI Altix Fat tree

Cray X1 4D Hypercube (approx)

Myricom (Millennium) Arbitrary

Quadrics (in HP Alpha server clusters)

Fat tree

IBM SP Fat tree (approx)

SGI Origin Hypercube

Intel Paragon 2D Mesh

BBN Butterfly Butterfly

old

er

n

ew

er

Page 44: Introduction to Many-Core Architectures Henk Corporaal heco ASCI Winterschool on Embedded Systems Soesterberg, March 2010

ASCI Winterschool 2010

Henk Corporaal(

44

)

Network: Performance metrics

Network Bandwidth Need high bandwidth in communication How does it scale with number of nodes?

Communication Latency Affects performance, since processor may have to wait Affects ease of programming, since it requires more thought to

overlap communication and computation

How can a mechanism help hide latency? overlap message send with computation, prefetch data, switch to other task or thread

Page 45: Introduction to Many-Core Architectures Henk Corporaal heco ASCI Winterschool on Embedded Systems Soesterberg, March 2010

ASCI Winterschool 2010

Henk Corporaal(

45

)

Examples of many core / PE architectures

SIMD Xetal (320 PEs), Imap (128 PEs), AnySP (Michigan Univ)

VLIW Itanium,TRIPS / EDGE, ADRES,

Multi-threaded idea: hide long latencies Denelcor HEP (1982), SUN Niagara (2005)

Multi-processor RaW, PicoChip, Intel/AMD, GRID, Farms, …..

Hybrid, like , Imagine, GPUs, XC-Core actually, most are hybrid !!

Page 46: Introduction to Many-Core Architectures Henk Corporaal heco ASCI Winterschool on Embedded Systems Soesterberg, March 2010

ASCI Winterschool 2010

Henk Corporaal(

46

)

IMAP from NEC

NEC IMAPSIMD•128 PEs•Supports indirect addressing

e.g. LD r1, (r2)•Each PE 5-issue VLIW

Page 47: Introduction to Many-Core Architectures Henk Corporaal heco ASCI Winterschool on Embedded Systems Soesterberg, March 2010

ASCI Winterschool 2010

Henk Corporaal(

47

)

TRIPS (Austin Univ / IBM)a statically mapped data flow architecture

R: register fileE: execution unitD: Data cacheI: Instruction cacheG: global control

Page 48: Introduction to Many-Core Architectures Henk Corporaal heco ASCI Winterschool on Embedded Systems Soesterberg, March 2010

ASCI Winterschool 2010

Henk Corporaal(

48

)

Compiling for TRIPS

1. Form hyperblocks (use unrolling, predication, inlining to enlarge scope)

2. Spatial map operations of each hyperblock registers are accessed at hyperblock boundaries

3. Schedule hyperblocks

Page 49: Introduction to Many-Core Architectures Henk Corporaal heco ASCI Winterschool on Embedded Systems Soesterberg, March 2010

ASCI Winterschool 2010

Henk Corporaal(

49

)

Multithreaded CategoriesTi

me

(pro

cess

or

cycle

)Superscalar Fine-Grained Coarse-Grained Multiprocessing

SimultaneousMultithreading

Thread 1

Thread 2Thread 3Thread 4

Thread 5Idle slot

Intel calls this 'Hyperthreading'

Page 50: Introduction to Many-Core Architectures Henk Corporaal heco ASCI Winterschool on Embedded Systems Soesterberg, March 2010

ASCI Winterschool 2010

Henk Corporaal(

50

)

SUN Niagara processing element

4 threads per processor 4 copies of PC logic, Instr. buffer, Store buffer, Register file

Page 51: Introduction to Many-Core Architectures Henk Corporaal heco ASCI Winterschool on Embedded Systems Soesterberg, March 2010

ASCI Winterschool 2010

Henk Corporaal(

51

)

Really BIG: Jaguar-Cray XT5-HE Oak Ridge Nat

Lab

224,256 AMD Opteron cores

2.33 PetaFloppeak perf.

299 Tbyte main memory

10 Petabyte disk

478GB/s mem bandwidth

6.9 MegaWatt

3D torus

TOP 500 #1(Nov 2009)

Page 52: Introduction to Many-Core Architectures Henk Corporaal heco ASCI Winterschool on Embedded Systems Soesterberg, March 2010

ASCI Winterschool 2010

Henk Corporaal(

52

)

Graphic Processing Units (GPUs)

NVIDIA GT 340(2010)

ATI 5970(2009)

Page 53: Introduction to Many-Core Architectures Henk Corporaal heco ASCI Winterschool on Embedded Systems Soesterberg, March 2010

ASCI Winterschool 2010

Henk Corporaal(

53

)

Why GPUs

Page 54: Introduction to Many-Core Architectures Henk Corporaal heco ASCI Winterschool on Embedded Systems Soesterberg, March 2010

ASCI Winterschool 2010

Henk Corporaal(

54

)

In Need of TeraFlops?

3 * GTX295• 1440 PEs• 5.3 TeraFlop

Page 55: Introduction to Many-Core Architectures Henk Corporaal heco ASCI Winterschool on Embedded Systems Soesterberg, March 2010

ASCI Winterschool 2010

Henk Corporaal(

55

)

How Do GPUs Spend Their Die Area?GPUs are designed to match the workload of 3D graphics.

J. Roca, et al. "Workload Characterization of 3D Games", IISWC 2006, linkT. Mitra, et al. "Dynamic 3D Graphics Workload Characterization and the Architectural Implications", Micro 1999, link

Die photo of GeForce GTX 280 (source: NVIDIA)

Page 56: Introduction to Many-Core Architectures Henk Corporaal heco ASCI Winterschool on Embedded Systems Soesterberg, March 2010

ASCI Winterschool 2010

Henk Corporaal(

56

)

How Do CPUs Spend Their Die Area?

CPUs are designed for low latency instead of high throughput

Die photo of Intel Penryn (source: Intel)

Page 57: Introduction to Many-Core Architectures Henk Corporaal heco ASCI Winterschool on Embedded Systems Soesterberg, March 2010

ASCI Winterschool 2010

Henk Corporaal(

57

)

GPU: Graphics Processing Unit

The Utah teapot: http://en.wikipedia.org/wiki/Utah_teapot

From polygon mesh to image pixel.

Page 58: Introduction to Many-Core Architectures Henk Corporaal heco ASCI Winterschool on Embedded Systems Soesterberg, March 2010

ASCI Winterschool 2010

Henk Corporaal(

58

)

The Graphics Pipeline

K. Fatahalian, et al. "GPUs: a Closer Look", ACM Queue 2008, http://doi.acm.org/10.1145/1365490.1365498

Page 59: Introduction to Many-Core Architectures Henk Corporaal heco ASCI Winterschool on Embedded Systems Soesterberg, March 2010

ASCI Winterschool 2010

Henk Corporaal(

59

)

The Graphics Pipeline

K. Fatahalian, et al. "GPUs: a Closer Look", ACM Queue 2008, http://doi.acm.org/10.1145/1365490.1365498

Page 60: Introduction to Many-Core Architectures Henk Corporaal heco ASCI Winterschool on Embedded Systems Soesterberg, March 2010

ASCI Winterschool 2010

Henk Corporaal(

60

)

The Graphics Pipeline

K. Fatahalian, et al. "GPUs: a Closer Look", ACM Queue 2008, http://doi.acm.org/10.1145/1365490.1365498

Page 61: Introduction to Many-Core Architectures Henk Corporaal heco ASCI Winterschool on Embedded Systems Soesterberg, March 2010

ASCI Winterschool 2010

Henk Corporaal(

61

)

The Graphics Pipeline

K. Fatahalian, et al. "GPUs: a Closer Look", ACM Queue 2008, http://doi.acm.org/10.1145/1365490.1365498

Page 62: Introduction to Many-Core Architectures Henk Corporaal heco ASCI Winterschool on Embedded Systems Soesterberg, March 2010

ASCI Winterschool 2010

Henk Corporaal(

62

)

GPUs: what's inside?Basically an SIMD:

• A single instruction stream operates on multiple data streams

• All PEs execute the same instruction at the same time

• PEs operate concurrently on their own piece of memory

• However, GPU far more complex !!

• Instruction Memory

Control Processor

PE1

PE2

PE3

PE4

...PE320

Interconnect

Data memory

PE5

PE6

Instr.

Addr.

Instr.

Status•

Add Add Add Add Add Add Add Add Add

Page 63: Introduction to Many-Core Architectures Henk Corporaal heco ASCI Winterschool on Embedded Systems Soesterberg, March 2010

ASCI Winterschool 2010

Henk Corporaal(

63

)

CPU Programming: NVIDIA CUDA example

Single thread program float A[4][8];do-all(i=0;i<4;i++){    do-all(j=0;j<8;j++){        A[i][j]++;     }}

CUDA program

float A[4][8];  kernelF<<<(4,1),(8,1)>>>(A); __device__    kernelF(A){    i = blockIdx.x;    j = threadIdx.x;    A[i][j]++;} 

• CUDA program expresses data level parallelism (DLP) in terms of thread level parallelism (TLP).

• Hardware converts TLP into DLP at run time.

Page 64: Introduction to Many-Core Architectures Henk Corporaal heco ASCI Winterschool on Embedded Systems Soesterberg, March 2010

ASCI Winterschool 2010

Henk Corporaal(

64

)

System Architecture

Erik Lindholm, et al. "NVIDIA Tesla: A Unified Graphics and Computing Architecture", IEEE Micro 2008, link

Page 65: Introduction to Many-Core Architectures Henk Corporaal heco ASCI Winterschool on Embedded Systems Soesterberg, March 2010

ASCI Winterschool 2010

Henk Corporaal(

65

)

NVIDIA Tesla Architecture (G80)

Erik Lindholm, et al. "NVIDIA Tesla: A Unified Graphics and Computing Architecture", IEEE Micro 2008, link

Page 66: Introduction to Many-Core Architectures Henk Corporaal heco ASCI Winterschool on Embedded Systems Soesterberg, March 2010

ASCI Winterschool 2010

Henk Corporaal(

66

)

Texture Processor Cluster (TPC)

Page 67: Introduction to Many-Core Architectures Henk Corporaal heco ASCI Winterschool on Embedded Systems Soesterberg, March 2010

ASCI Winterschool 2010

Henk Corporaal(

67

)

Deeply pipelined SM for high throughput

One instruction executed by a warp of 32 threads One warp is executed on 8 PEs over 4 shader cycles

Let's start with a simple example: execution of 1 instruction

Page 68: Introduction to Many-Core Architectures Henk Corporaal heco ASCI Winterschool on Embedded Systems Soesterberg, March 2010

ASCI Winterschool 2010

Henk Corporaal(

68

)

Issue an Instruction for 32 Threads

Page 69: Introduction to Many-Core Architectures Henk Corporaal heco ASCI Winterschool on Embedded Systems Soesterberg, March 2010

ASCI Winterschool 2010

Henk Corporaal(

69

)

Read Source Operands of 32 Threads

Page 70: Introduction to Many-Core Architectures Henk Corporaal heco ASCI Winterschool on Embedded Systems Soesterberg, March 2010

ASCI Winterschool 2010

Henk Corporaal(

70

)

Buffer Source Operands to Op Collector

Page 71: Introduction to Many-Core Architectures Henk Corporaal heco ASCI Winterschool on Embedded Systems Soesterberg, March 2010

ASCI Winterschool 2010

Henk Corporaal(

71

)

Execute Threads 0~7

Page 72: Introduction to Many-Core Architectures Henk Corporaal heco ASCI Winterschool on Embedded Systems Soesterberg, March 2010

ASCI Winterschool 2010

Henk Corporaal(

72

)

Execute Threads 8~15

Page 73: Introduction to Many-Core Architectures Henk Corporaal heco ASCI Winterschool on Embedded Systems Soesterberg, March 2010

ASCI Winterschool 2010

Henk Corporaal(

73

)

Execute Threads 16~23

Page 74: Introduction to Many-Core Architectures Henk Corporaal heco ASCI Winterschool on Embedded Systems Soesterberg, March 2010

ASCI Winterschool 2010

Henk Corporaal(

74

)

Execute Threads 24~31

Page 75: Introduction to Many-Core Architectures Henk Corporaal heco ASCI Winterschool on Embedded Systems Soesterberg, March 2010

ASCI Winterschool 2010

Henk Corporaal(

75

)

Write Back from Result Queue to Reg

Page 76: Introduction to Many-Core Architectures Henk Corporaal heco ASCI Winterschool on Embedded Systems Soesterberg, March 2010

ASCI Winterschool 2010

Henk Corporaal(

76

)

Warp: Basic Scheduling Unit in Hardware

One warp consists of 32 consecutive threads Warps are transparent to programmer, formed at run

time

Page 77: Introduction to Many-Core Architectures Henk Corporaal heco ASCI Winterschool on Embedded Systems Soesterberg, March 2010

ASCI Winterschool 2010

Henk Corporaal(

77

)

Warp Scheduling • Schedule at most

24 warps in an interleaved manner

• Zero overhead for interleaved issue of warps

Page 78: Introduction to Many-Core Architectures Henk Corporaal heco ASCI Winterschool on Embedded Systems Soesterberg, March 2010

ASCI Winterschool 2010

Henk Corporaal(

78

)

Handling BranchThreads within a warp are free to branch.

if( $r17 > $r19 ){    $r16 = $r20 + $r31 }else{    $r16 = $r21 - $r32 }$r18 = $r15 + $r16

Assembly code on the right are disassembled from cuda binary (cubin) using "decuda", link

Page 79: Introduction to Many-Core Architectures Henk Corporaal heco ASCI Winterschool on Embedded Systems Soesterberg, March 2010

ASCI Winterschool 2010

Henk Corporaal(

79

)

Branch Divergence within a Warp

If threads within a warp diverge, both paths have to be executed.

Masks are set to filter out threads not executing on current path.

Page 80: Introduction to Many-Core Architectures Henk Corporaal heco ASCI Winterschool on Embedded Systems Soesterberg, March 2010

ASCI Winterschool 2010

Henk Corporaal(

80

)

CPU Programming: NVIDIA CUDA example

Single thread program float A[4][8];do-all(i=0;i<4;i++){    do-all(j=0;j<8;j++){        A[i][j]++;     }}

CUDA program

float A[4][8];  kernelF<<<(4,1),(8,1)>>>(A); __device__    kernelF(A){    i = blockIdx.x;    j = threadIdx.x;    A[i][j]++;} 

• CUDA program expresses data level parallelism (DLP) in terms of thread level parallelism (TLP).

• Hardware converts TLP into DLP at run time.

Page 81: Introduction to Many-Core Architectures Henk Corporaal heco ASCI Winterschool on Embedded Systems Soesterberg, March 2010

ASCI Winterschool 2010

Henk Corporaal(

81

)

CUDA Programming

kernelF<<<(2,2),(4,2)>>>(A); __device__    kernelF(A){    i = blockDim.x * blockIdx.y        + blockIdx.x;    j = threadDim.x * threadIdx.y        + threadIdx.x;    A[i][j]++;} 

Both grid and thread block can have two dimensional index.

Page 82: Introduction to Many-Core Architectures Henk Corporaal heco ASCI Winterschool on Embedded Systems Soesterberg, March 2010

ASCI Winterschool 2010

Henk Corporaal(

82

)

Mapping Thread Blocks to SMs One thread block can only run on one SM Thread block can not migrate from one SM to another SM Threads of the same thread block can share data using shared

memory

Example: mapping 12 thread blocks on 4 SMs.

Page 83: Introduction to Many-Core Architectures Henk Corporaal heco ASCI Winterschool on Embedded Systems Soesterberg, March 2010

ASCI Winterschool 2010

Henk Corporaal(

83

)

Mapping Thread Blocks (0,0)/(0,1)/(0,2)/(0,3)

Page 84: Introduction to Many-Core Architectures Henk Corporaal heco ASCI Winterschool on Embedded Systems Soesterberg, March 2010

ASCI Winterschool 2010

Henk Corporaal(

84

)

CUDA Compilation Trajectorycudafe: CUDA front endnvopencc: customized open64 compiler for CUDAptx: high level assemble code (documented)ptxas: ptx assemblercubin: CUDA binrary

decuda, http://wiki.github.com/laanwj/decuda

Page 85: Introduction to Many-Core Architectures Henk Corporaal heco ASCI Winterschool on Embedded Systems Soesterberg, March 2010

ASCI Winterschool 2010

Henk Corporaal(

85

)

Optimization Guide Optimizations on memory latency tolerance

Reduce register pressure Reduce shared memory pressure   

Optimizations on memory bandwidth Global memory coalesce Shared memory bank conflicts Grouping byte access Avoid Partition camping

Optimizations on computation efficiency Mul/Add balancing Increase floating point proportion 

Optimizations on operational intensity Use tiled algorithm Tuning thread granularity

Page 86: Introduction to Many-Core Architectures Henk Corporaal heco ASCI Winterschool on Embedded Systems Soesterberg, March 2010

ASCI Winterschool 2010

Henk Corporaal(

86

)

Global Memory: Coalesced  Access

NVIDIA, "CUDA Programming Guide", link

perfectly coalesced allow threads skipping LD/ST

Page 87: Introduction to Many-Core Architectures Henk Corporaal heco ASCI Winterschool on Embedded Systems Soesterberg, March 2010

ASCI Winterschool 2010

Henk Corporaal(

87

)

Global Memory: Non-Coalesced  Access

NVIDIA, "CUDA Programming Guide", link

non-consecutiveaddress

starting address not aligned to 128 Byte

non-consecutiveaddress

stride larger than one word

Page 88: Introduction to Many-Core Architectures Henk Corporaal heco ASCI Winterschool on Embedded Systems Soesterberg, March 2010

ASCI Winterschool 2010

Henk Corporaal(

88

)

Shared Memory: without Bank Conflict

NVIDIA, "CUDA Programming Guide", link

one access per bank one access per bank with shuffling

access the same address (broadcast)

partial broadcast and skipping some banks

Page 89: Introduction to Many-Core Architectures Henk Corporaal heco ASCI Winterschool on Embedded Systems Soesterberg, March 2010

ASCI Winterschool 2010

Henk Corporaal(

89

)

Shared Memory: with Bank Conflict

NVIDIA, "CUDA Programming Guide", link

access more than one address per bank

broadcast more than one address per bank

Page 90: Introduction to Many-Core Architectures Henk Corporaal heco ASCI Winterschool on Embedded Systems Soesterberg, March 2010

ASCI Winterschool 2010

Henk Corporaal(

90

)

Optimizing MatrixMul

Matrix Multiplication example from the  5kk70 course in TU/e, link.The CUDA@MIT course also provides Matrix Multiplication as a hands-on example, link.

Page 91: Introduction to Many-Core Architectures Henk Corporaal heco ASCI Winterschool on Embedded Systems Soesterberg, March 2010

ASCI Winterschool 2010

Henk Corporaal(

91

)

ATI Cypress (RV870)• 1600 shader ALUs

ref: tom's hardware, link

Page 92: Introduction to Many-Core Architectures Henk Corporaal heco ASCI Winterschool on Embedded Systems Soesterberg, March 2010

ASCI Winterschool 2010

Henk Corporaal(

92

)

ATI Cypress (RV870)• VLIW PEs

ref: tom's hardware, link

Page 93: Introduction to Many-Core Architectures Henk Corporaal heco ASCI Winterschool on Embedded Systems Soesterberg, March 2010

ASCI Winterschool 2010

Henk Corporaal(

93

)

Intel Larrabee• x86 core, 8/16/32 cores.

Larry Seiler, et al. "Larrabee: a many-core x86 architecture for visual computing", SIGGRAPH 2008, link

Page 94: Introduction to Many-Core Architectures Henk Corporaal heco ASCI Winterschool on Embedded Systems Soesterberg, March 2010

ASCI Winterschool 2010

Henk Corporaal(

94

)

CELL

PS3

NVIDIARSXreality

synthesizer

NVIDIARSXreality

synthesizer

CellBroadband

Engine3.2 GHz

CellBroadband

Engine3.2 GHz

South BridgeSouth Bridge

GDDR3

GDDR3

GDDR3

GDDR3

XDR DRAM

XDR DRAM

XDR DRAM

XDR DRAM

drivesUSB

NetworkMedia

Video Memory

2.5 GB/sec 2.5 GB/sec

15 GB/sec 20 GB/sec

128pin * 1.4Gbps/pin = 22.4GB/sec

64pin * 3.2Gbps/pin = 25.6GB/sec

Main Memory

Page 95: Introduction to Many-Core Architectures Henk Corporaal heco ASCI Winterschool on Embedded Systems Soesterberg, March 2010

ASCI Winterschool 2010

Henk Corporaal(

95

)

CELL – the architecture

1 x PPE 64-bit PowerPCL1: 32 KB I$ + 32 KB D$L2: 512 KB

8 x SPE cores:Local store: 256 KB 128 x 128 bit vector registers

Hybrid memory model: PPE: Rd/Wr SPEs: Asynchronous DMA

EIB: 205 GB/s sustained aggregate bandwidth

Processor-to-memory bandwidth: 25.6 GB/s

Processor-to-processor: 20 GB/s in each direction

Page 96: Introduction to Many-Core Architectures Henk Corporaal heco ASCI Winterschool on Embedded Systems Soesterberg, March 2010

ASCI Winterschool 2010

Henk Corporaal(

96

)

Page 97: Introduction to Many-Core Architectures Henk Corporaal heco ASCI Winterschool on Embedded Systems Soesterberg, March 2010

ASCI Winterschool 2010

Henk Corporaal(

97

)

Intel / AMD x86 – Historical overview

Page 98: Introduction to Many-Core Architectures Henk Corporaal heco ASCI Winterschool on Embedded Systems Soesterberg, March 2010

ASCI Winterschool 2010

Henk Corporaal(

98

)

Nehalem architecture

In novel processors Core i7 & Xeon 5500s

Quad Core

3 cache levels

2 TLB levels

2 branch predictors

Out-of-Order execution

Simultaneous Multithreading

DVFS: dynamic voltage & frequency scaling

1 core

Page 99: Introduction to Many-Core Architectures Henk Corporaal heco ASCI Winterschool on Embedded Systems Soesterberg, March 2010

ASCI Winterschool 2010

Henk Corporaal(

99

)

Nehalem pipeline (1/2)

Instruction Fetch and PreDecode

Instruction Queue

Decode

Rename/Alloc

Retirement unit(Re-Order Buffer)

Scheduler

EXE Unit Cluster 0

EXE Unit Cluster 1

EXE Unit Cluster 2

Load Store

L1D Cache and DTLB

L2 Cache

Inclusive L3 Cache by all cores

Micro-code ROM

QPI

Quick Path Interconnect (2x20 bit)

Page 100: Introduction to Many-Core Architectures Henk Corporaal heco ASCI Winterschool on Embedded Systems Soesterberg, March 2010

ASCI Winterschool 2010

Henk Corporaal(

100

)

Nehalem pipeline (2/2)

Page 101: Introduction to Many-Core Architectures Henk Corporaal heco ASCI Winterschool on Embedded Systems Soesterberg, March 2010

ASCI Winterschool 2010

Henk Corporaal(

101

)

Tylersburg: connecting 2 quad cores

Level Capacity Associativity(ways)

Line size(bytes)

Access Latency(clocks)

Access Throughput(clocks)

Write UpdatePolicy

L1D 4 x 32 KiB 8 64 4 1 Writeback

L1I 4 x 32 KiB 4 N/A N/A N/A N/A

L2U 4 x 256KiB 8 64 10 Varies Writeback

L3U 1 x 8 MiB 16 64 35-40 Varies Writeback

Core

L1D L1I

L2U

Core

L1D L1I

L2U

Core

L1D L1I

L2U

Core

L1D L1I

L2U

L3U

Memory controller QPI QPI

Core

L1D L1I

L2U

Core

L1D L1I

L2U

Core

L1D L1I

L2U

Core

L1D L1I

L2U

L3U

Memory controllerQPI QPIQPI

Main memoryD

DR

3Main memory

DD

R3

IOH

Page 102: Introduction to Many-Core Architectures Henk Corporaal heco ASCI Winterschool on Embedded Systems Soesterberg, March 2010

ASCI Winterschool 2010

Henk Corporaal(

102

)

Programming these arechitectures: N-tap FIR

int i, j;

for (i = 0; i < M; i ++){

out[i] = 0;

for (j = 0; j < N; j ++)

out[i] +=n[i+j]*coeff[j];

}

1

0

][*][][N

j

jcoeffjiiniout

C-code:

Page 103: Introduction to Many-Core Architectures Henk Corporaal heco ASCI Winterschool on Embedded Systems Soesterberg, March 2010

ASCI Winterschool 2010

Henk Corporaal(

103

)

Y0 Y1 Y2 Y3 Y4 Y5 Y6 Y7 Y8 Y9 Y10 Y11

X0 X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11

C0 C1 C2 C3

x x x x

C0 C1 C2 C3

x x x x

C0 C1 C2 C3

x x x x

C0 C1 C2 C3

x x x x

+

+

+

+

Page 104: Introduction to Many-Core Architectures Henk Corporaal heco ASCI Winterschool on Embedded Systems Soesterberg, March 2010

ASCI Winterschool 2010

Henk Corporaal(

104

)

__m128 X, XH, XL, Y, C, H; int i, j;for(i = 0; i < (M/4); i ++){ XL = _mm_load_ps(&in[i*4]); Y = _mm_setzero_ps(); for(j = 0; j < (N/4); j ++){ XH = XL; XL = _mm_load_ps(&in[(i+j+1)*4]); C =_mm_load_ps(&coeff[j*4]); H =_mm_shuffle_ps (C, C, _MM_SHUFFLE(0,0,0,0)); X = _mm_mul_ps (XH, H); Y = _mm_add_ps (Y, X);

H =_mm_shuffle_ps (C, C, _MM_SHUFFLE(1,1,1,1)); X = _mm_alignr_epi8 (XL, XH, 4); X = _mm_mul_ps (X, H); Y = _mm_add_ps (Y, X);

H = _mm_shuffle_ps (C, C, _MM_SHUFFLE(2,2,2,2)); X = _mm_alignr_epi8 (XL, XH, 8); X = _mm_mul_ps (X, H); Y = _mm_add_ps (Y, X);

H = _mm_shuffle_ps (C, C, _MM_SHUFFLE(3,3,3,3)); X = _mm_alignr_epi8 (XL, XH, 12); X = _mm_mul_ps (X, H); Y = _mm_add_ps (Y, X); } _mm_store_ps(&out[i*4], Y);}

FIR with x86 SSE Intrinsics

Y0

Y1

Y2

Y3

=

X0

X1

X2

X3

C0

C0

C0

C0

x

X1

X2

X3

X4

C1

C1

C1

C1

x

X2

X3

X4

X5

C2

C2

C2

C2

x

X3

X4

X5

X6

C3

C3

C3

C3

x+ + +

Y H H H HX X X X

Page 105: Introduction to Many-Core Architectures Henk Corporaal heco ASCI Winterschool on Embedded Systems Soesterberg, March 2010

ASCI Winterschool 2010

Henk Corporaal(

105

)

FIR using pthread

pthread_t fir_threads[N_THREAD];

fir_arg fa[N_THREAD];

tsize = M/N_THREAD;

for(i = 0; i < N_THREAD; i ++){

/*… Initialize thread

parameters fa[i] … */

rc = pthread_create(&fir_threads[i],\

NULL, fir_kernel, (void *)&fa[i]);

}

for(i=0; i<N_THREAD; i++) {

rc = pthread_join(fir_threads[i],\

&status);

}

split

T0 T1 T2 T3

join

Input

Sequential FIR kernel or

Vectorized FIR kernel

Page 106: Introduction to Many-Core Architectures Henk Corporaal heco ASCI Winterschool on Embedded Systems Soesterberg, March 2010

ASCI Winterschool 2010

Henk Corporaal(

106

)

x86 FIR speedup

On Intel Core 2 Quad Q8300, gcc optimization level 2 Input: ~5M samples #threads in pthread: 4

Page 107: Introduction to Many-Core Architectures Henk Corporaal heco ASCI Winterschool on Embedded Systems Soesterberg, March 2010

ASCI Winterschool 2010

Henk Corporaal(

107

)

FIR kernel on CELL SPE

Vectorization is similar to SSE

vector float,X, XH, XL, Y, H; int i, j; for(i = 0; i < (M/4); i ++){ XL = in[i]; Y = spu_splats(0.0f); for(j = 0; j < (N/4); j ++){ XH = XL; XL = in[i+j+1]); H=splats(coeff[j*4]); Y = spu_madd(XH, H, Y);

H=splats(coeff[j*4+1]); X = spu_shuffle(XH, XL, SHUFFLE_X1); Y = spu_madd(X, H, Y);

H=splats(coeff[j*4+2]); X = spu_shuffle(XH, XL, SHUFFLE_X2); Y = spu_madd(X, H, Y);

H=splats(coeff[j*4+3]); X = spu_shuffle(XH, XL, SHUFFLE_X3); Y = spu_madd(X, H, Y); } out[i] = Y;}

Page 108: Introduction to Many-Core Architectures Henk Corporaal heco ASCI Winterschool on Embedded Systems Soesterberg, March 2010

ASCI Winterschool 2010

Henk Corporaal(

108

)

SPE DMA double buffering

...

Use iBuf0Write to oBuf0

Use iBuf1Write to oBuf1

Use iBuf0Write to oBuf0

Get iBuf0

Get iBuf1

timeGet

iBuf0

Get iBuf1

Put oBuf0

Put oBuf1

float iBuf[2][BUF_SIZE];float oBuf[2][BUF_SIZE];int idx=0; int buffers=size/BUF_SIZE;mfc_get(iBuf[idx],argp,\ BUF_SIZE*sizeof(float),\ tag[idx],0,0);for(int i = 1;I < buffers; i++){ wait_for_dma(tag[idx]); next_idx = idx^1; mfc_get(iBuf[next_idx],argp,\ BUF_SIZE*sizeof(float),0,0,0); fir_kernel(oBuf[idx], iBuf[idx],\ coeff,BUF_SIZE,taps); mfc_put(oBuf[idx],outbuf,\ BUF_SIZE*sizeof(float),\ tag[idx],0,0); idx = next_idx;}/* Finish up the last block ...*/

Page 109: Introduction to Many-Core Architectures Henk Corporaal heco ASCI Winterschool on Embedded Systems Soesterberg, March 2010

ASCI Winterschool 2010

Henk Corporaal(

109

)

CELL FIR speedup

On PlayStation 3, CELL with six accessible SPE

Input: ~6M samples

Speed-up compare to scalar implementation on PPE

Page 110: Introduction to Many-Core Architectures Henk Corporaal heco ASCI Winterschool on Embedded Systems Soesterberg, March 2010

ASCI Winterschool 2010

Henk Corporaal(

110

)

Roofline Model

Introduced by Samual Williams and David PattersonP

erfo

rman

ce in

GF

lops

/sec

Operational intensity in Flops/Byte

peak performance

peak

band

width

ridge pointbalanced architecture for given application

Page 111: Introduction to Many-Core Architectures Henk Corporaal heco ASCI Winterschool on Embedded Systems Soesterberg, March 2010

ASCI Winterschool 2010

Henk Corporaal(

111

)

Roofline Model of GT8800 GPU

Page 112: Introduction to Many-Core Architectures Henk Corporaal heco ASCI Winterschool on Embedded Systems Soesterberg, March 2010

ASCI Winterschool 2010

Henk Corporaal(

112

)

Roofline Model

Threads of one warp diverge into different paths at branch.

Page 113: Introduction to Many-Core Architectures Henk Corporaal heco ASCI Winterschool on Embedded Systems Soesterberg, March 2010

ASCI Winterschool 2010

Henk Corporaal(

113

)

Roofline Model In G80 architecture, a non-coalesced global memory access will

be separated into 16 accesses.

Page 114: Introduction to Many-Core Architectures Henk Corporaal heco ASCI Winterschool on Embedded Systems Soesterberg, March 2010

ASCI Winterschool 2010

Henk Corporaal(

114

)

Roofline Model

Previous examples assume memory latency can be hidden. Otherwise the program can be latency-bound.

Z. Guz, et al, "Many-Core vs. Many-Thread Machines: Stay Away From the Valley", IEEE Comp Arch Letters, 2009, linkS. Hong, et al. "An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness", ISCA09, link

rm : percentage of memory instruction in total instructiontavg : average memory latencyCPIexe : Cycle per Instruction

• There is one memory instruction in every (1/rm) instructions.• There is one memory instruction every (1/rm) x CPIexe cycles.• It takes (tavg x rm / CPIexe) threads to hide memory latency.

Page 115: Introduction to Many-Core Architectures Henk Corporaal heco ASCI Winterschool on Embedded Systems Soesterberg, March 2010

ASCI Winterschool 2010

Henk Corporaal(

115

)

Roofline Model

If not enough threads to hide the memory latency, the memory latency could become the bottleneck.

Samuel Williams, "Auto-tuning Performance on Multicore Computers", PhD Thesis, UC Berkeley, 2008, linkS. Hong, et al. "An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness", ISCA09, link

Page 116: Introduction to Many-Core Architectures Henk Corporaal heco ASCI Winterschool on Embedded Systems Soesterberg, March 2010

ASCI Winterschool 2010

Henk Corporaal(

116

)

Four Architectures

667MHz DDR2 DIMMs667MHz DDR2 DIMMs

10.66 GB/s

2x64b memory controllers2x64b memory controllers

Hyp

erT

ran

spor

tH

yper

Tra

nsp

ortOpteronOpteron OpteronOpteron OpteronOpteron OpteronOpteron

667MHz DDR2 DIMMs667MHz DDR2 DIMMs

10.66 GB/s

2x64b memory controllers2x64b memory controllers

OpteronOpteron OpteronOpteron OpteronOpteron OpteronOpteron

512KBvictim

512KBvictim

512KBvictim

512KBvictim

512KBvictim

512KBvictim

512KBvictim

512KBvictim

512KBvictim

512KBvictim

512KBvictim

512KBvictim

512KBvictim

512KBvictim

512KBvictim

512KBvictim

2MB Shared quasi-victim (32 way)2MB Shared quasi-victim (32 way)

SRI / crossbarSRI / crossbar

2MB Shared quasi-victim (32 way)2MB Shared quasi-victim (32 way)

SRI / crossbarSRI / crossbarHyp

erT

ran

spor

tH

yper

Tra

nsp

ort

4G

B/s

(eac

h di

rect

ion

)

667MHz FBDIMMs 667MHz FBDIMMs

21.33 GB/s 10.66 GB/s

4MB Shared L2 (16 way)(64b interleaved)

4MB Shared L2 (16 way)(64b interleaved)

4 Coherency Hubs4 Coherency Hubs

2x128b controllers2x128b controllers

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

CrossbarCrossbar

179 GB/s 90 GB/s

667MHz FBDIMMs 667MHz FBDIMMs

21.33 GB/s 10.66 GB/s

4MB Shared L2 (16 way)(64b interleaved)

4MB Shared L2 (16 way)(64b interleaved)

4 Coherency Hubs4 Coherency Hubs

2x128b controllers2x128b controllers

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

CrossbarCrossbar

179 GB/s 90 GB/s

8 x

6.4

GB

/s(1

pe

r h

ub p

er

dire

ctio

n)

inte

rcon

nect

inte

rcon

nect

86.4 GB/s

768MB 900MHz GDDR3 Device DRAM768MB 900MHz GDDR3 Device DRAM

ThreadCluster

ThreadCluster

ThreadCluster

ThreadCluster

ThreadCluster

ThreadCluster

ThreadCluster

ThreadCluster

192KB L2 (Textures only)192KB L2 (Textures only)

24 ROPs24 ROPs

6 x 64b memory controllers6 x 64b memory controllers

BIFBIF

512MB XDR DRAM512MB XDR DRAM

25.6 GB/s

EIB (ring network)EIB (ring network)

XDR memory controllersXDR memory controllers

VMTPPEVMTPPE

512KL2

512KL2

SP

ES

PE

256

K2

56K

MF

CM

FC

SP

ES

PE

256

K2

56K

MF

CM

FC

SP

ES

PE

256

K2

56K

MF

CM

FC

SP

ES

PE

256

K2

56K

MF

CM

FC

SP

ES

PE

256

K2

56K

MF

CM

FC

SP

ES

PE

256

K2

56K

MF

CM

FC

SP

ES

PE

256

K2

56K

MF

CM

FC

SP

ES

PE

256

K2

56K

MF

CM

FC B

IFBIF

512MB XDR DRAM512MB XDR DRAM

25.6 GB/s

EIB (ring network)EIB (ring network)

XDR memory controllersXDR memory controllers

SP

ES

PE

256

K2

56K

MF

CM

FC

SP

ES

PE

256

K2

56K

MF

CM

FC

SP

ES

PE

256

K2

56K

MF

CM

FC

SP

ES

PE

256

K2

56K

MF

CM

FC

SP

ES

PE

256

K2

56K

MF

CM

FC

SP

ES

PE

256

K2

56K

MF

CM

FC

SP

ES

PE

256

K2

56K

MF

CM

FC

SP

ES

PE

256

K2

56K

MF

CM

FC

VMTPPEVMTPPE

512KL2

512KL2

<20

GB

/s(e

ach

dire

ctio

n)

Sun Victoria FallsAMD Barcelona

NVIDIA G80IBM Cell Blade

Page 117: Introduction to Many-Core Architectures Henk Corporaal heco ASCI Winterschool on Embedded Systems Soesterberg, March 2010

ASCI Winterschool 2010

Henk Corporaal(

117

)

32b Rooflines for the Four(in-core parallelism)

1/8

flop:DRAM byte ratio

att

ain

ab

le G

flop

/s (

32

b)

1/41/2 1 2 4 8 16

1/8

flop:DRAM byte ratio

att

ain

ab

le G

flop

/s (

32

b)

1/41/2 1 2 4 8 16 1/8

flop:DRAM byte ratio

att

ain

ab

le G

flop

/s (

32

b)

1/41/2 1 2 4 8 16

Sun Victoria FallsAMD Barcelona

NVIDIA G80

Single Precision Roofline models for the SMPs used in this work.

Based on micro-benchmarks, experience, and manuals

Ceilings =

in-core parallelism

Can the compiler find all this parallelism ?

NOTE: log-log scale Assumes perfect

SPMD

41/8

flop:DRAM byte ratio

att

ain

ab

le G

flop

/s (

32

b)

8

16

32

64

128

256

1/41/2 1 2 4 8 16

IBM Cell Blade

4

8

16

32

64

128

256

4

8

16

32

64

128

256

4

8

16

32

64

128

256

512

512

512

512

peak SP

mul / add imbalance

w/out SIMD

w/out ILP

peak SP

peak SP

w/out FMA

w/out SIMD

w/out ILP

peak SP

w/out FMA

w/out

mem

ory c

oales

cing

w/out

NUM

A

w/out

SW

pre

fetc

h

w/out

NUM

A

w/out

SW

pre

fetc

h

w/out

NUM

A

w/out

DM

A con

curre

ncy

Page 118: Introduction to Many-Core Architectures Henk Corporaal heco ASCI Winterschool on Embedded Systems Soesterberg, March 2010

ASCI Winterschool 2010

Henk Corporaal(

118

)

Let's conclude: Trends

Reliability + Fault Tolerance Requires run-time management, process migration

Power is the new metric Low power management at all levels - Scenarios - subthreshold,

back biasing, ….

Virtualization (1): do not disturb other applications composability

Virtualization (2): 1 virual target platform avoids porting problem 1 intermediate supporting multiple target

huge RT management support, JITC multiple OS

Compute servers

Transactional memory

3D: integrate different dies

Page 119: Introduction to Many-Core Architectures Henk Corporaal heco ASCI Winterschool on Embedded Systems Soesterberg, March 2010

ASCI Winterschool 2010

Henk Corporaal(

119

)

3D using Through Silicon Vias (TSV)

Using TVS:Face-to-Back

(Scalable)

Flip-Chip:Face-to-Face

(limited to 2 die tiers)

4um pitch in 2011 (ITRS 2007)

Can enlarge device area

from Woo e.a. HPCA 2009

Page 120: Introduction to Many-Core Architectures Henk Corporaal heco ASCI Winterschool on Embedded Systems Soesterberg, March 2010

ASCI Winterschool 2010

Henk Corporaal(

120

)

Don't forget Amdahl

However, see next slide!

Page 121: Introduction to Many-Core Architectures Henk Corporaal heco ASCI Winterschool on Embedded Systems Soesterberg, March 2010

ASCI Winterschool 2010

Henk Corporaal(

121

)

Trends: Homogeneous vs Heterogeneous: where do we go ?

Homogenous: Easier to program Favored by DLP / Vector parallelism Fault tolerant / Task migration

Heterogeneous Energy efficiency demands Higher speedup

Amdahl++(see Hill and Marty, HPCA'08 on Amdahl's law in multi-core area)

Memory dominated suggests homogenous sea of heterogeneous cores

Sea of reconfigurable compute or processor blocks? many examples: Smart Memory, SmartCell, PicoChip, MathStar

FPOA, Stretch, XPP, ……. etc.

Page 122: Introduction to Many-Core Architectures Henk Corporaal heco ASCI Winterschool on Embedded Systems Soesterberg, March 2010

ASCI Winterschool 2010

Henk Corporaal(

122

)

How does a future architecture look like

A couple of high performance (low latency) cores also sequential code should run fast

Add a whole battery of wide vector processors

Some shared memory (to reduce copying large data structures) Level 2 and 3 in 3D technology Huge bandwidth; exploit large vectors

Accelerators for dedicated domains

OS support (runtime mapping, DVFS, use of accelerators)

Page 123: Introduction to Many-Core Architectures Henk Corporaal heco ASCI Winterschool on Embedded Systems Soesterberg, March 2010

ASCI Winterschool 2010

Henk Corporaal(

123

)

But the real problem is …..

Programming parallel is the real bottleneck new programming models like transaction based progr.

That's what we will talk about this week…

Page 124: Introduction to Many-Core Architectures Henk Corporaal heco ASCI Winterschool on Embedded Systems Soesterberg, March 2010

ASCI Winterschool 2010

Henk Corporaal(

124

)