introduction to many-core architectures henk corporaal heco asci winterschool on embedded systems...

Introductionto

Many-Core Architectures

Henk Corporaalwww.ics.ele.tue.nl/~heco

ASCI Winterschool on Embedded Systems

Soesterberg, March 2010

http://www.ics.ele.tue.nl/~heco

ASCI Winterschool 2010

Henk Corporaal(2)

Intel Trends(K. Olukotun)

Core i7

3GHz

100W

5


Henk Corporaal(3)

System-level integration (Chuck Moore, AMD at MICRO 2008)

Single-chip CPU Era: 1986 –2004 Extreme focus on single-threaded performance Multi-issue, out-of-order execution plus moderate cache hierarchy

Chip Multiprocessor (CMP) Era: 2004 –2010 Early: Hasty integration of multiple cores into same chip/package Mid-life: Address some of the HW scalability and interference issues Current: Homogeneous CPUs plus moderate system-level

functionality

System-level Integration Era: ~2010 onward Integration of substantial system-level functionality Heterogeneous processors and accelerators Introspective control systems for managing on-chip resources &

events


Henk Corporaal(4)

Why many core?

Running into Frequency wall ILP wall Memory wall Energy wall

Chip area enabler: Moore's law goes well below 22 nm What to do with all this area? Multiple processors fit easily on a single die

Application demands

Cost effective (just connect existing processors or processor cores)

Low power: parallelism may allow lowering Vdd Performance/Watt is the new metric !!


Henk Corporaal(5)

Low power through parallelism

Sequential Processor Switching capacitance C Frequency f Voltage V P1 = fCV2

Parallel Processor (two times the number of units) Switching capacitance 2C Frequency f/2 Voltage V’ < V P2 = f/2 2C V’2 = fCV’2 < P1

CPU

CPU1 CPU2


Henk Corporaal(6)

How low Vdd can we go?

Subthreshold JPEG encoder Vdd 0.4 – 1.2 Volt

Engine

Engine

Engine

Engine

pJ /operation

8.3X5.6X4.4X3.4X

1.1 1.0 0.9 0.8 0.7 0.6 0.5Supply Voltage (V)

1.2 0.40.0

1.0

2.0

3.0

4.0

5.0

6.0

7.0

8.0


Henk Corporaal(7)

Computational efficiency: how many MOPS/Watt?

Yifan He e.a., DAC 2010


Henk Corporaal(8)

Computational efficiency: what do we need?

1

10

100

1000

10000

0.1 1 10 100

Be

tter

Po

wer Efficien

cy

1 Mops/mW

10 Mops/mW100 Mops/mW

1000 Mops/mW

SODA(65nm)

SODA (90nm)

TI C6X

Imagine

VIRAM Pentium M

IBM Cell

Pe

rfo

rma

nc

e (

Go

ps

)

Power (Watts )

3G Wireless

4G Wireless

Mobile HDVideo

Woh e.a., ISCA 2009


Henk Corporaal(9)

Intel's opinion: 48-core x86


Henk Corporaal(

10

)

Outline

Classifications of Parallel Architectures

Examples Various (research) architectures GPUs Cell Intel multi-cores

How much performance do you really get?Roofline model

Trends & Conclusions


Henk Corporaal(

11

)

Classifications

Performance / parallelism driven: 4-5 D Flynn

Communication & Memory Message passing / Shared memory Shared memory issues: coherency, consistency,

synchronization

Interconnect


Henk Corporaal(

12

)

Flynn's Taxomony

SISD (Single Instruction, Single Data) Uniprocessors

SIMD (Single Instruction, Multiple Data) Vector architectures also belong to this class

Multimedia extensions (MMX, SSE, VIS, AltiVec, …) Examples: Illiac-IV, CM-2, MasPar MP-1/2, Xetal, IMAP, Imagine,

GPUs, ……

MISD (Multiple Instruction, Single Data) Systolic arrays / stream based processing

MIMD (Multiple Instruction, Multiple Data) Examples: Sun Enterprise 5000, Cray T3D/T3E, SGI Origin

Flexible Most widely used


Henk Corporaal(

13

)

Flynn's Taxomony


Henk Corporaal(

14

)

Enhance performance: 4 architecture methods (Super)-pipelining

Powerful instructions MD-technique

multiple data operands per operation MO-technique

multiple operations per instruction

Multiple instruction issue Single stream: Superscalar Multiple streams

Single core, multiple threads: Simultaneously Multi-Threading

Multiple cores


Henk Corporaal(

15

)

Architecture methodsPipelined Execution of Instructions

Purpose of pipelining: Reduce #gate_levels in critical path Reduce CPI close to one (instead of a large number for the

multicycle machine) More efficient Hardware

Problems Hazards: pipeline stalls

Structural hazards: add more hardware Control hazards, branch penalties: use branch prediction Data hazards: by passing required

IF: Instruction Fetch

DC: Instruction Decode

RF: Register Fetch

EX: Execute instruction

WB: Write Result Register

IF DC RF EX WBIF DC RF EX WB

IF DC RF EX WBIF DC RF EX WB

INS

TR

UC

TIO

N

CYCLE

1 2 43 5 6 7 8

12

3

4 Simple 5-stage pipeline


Henk Corporaal(

16

)

Architecture methodsPipelined Execution of InstructionsSuperpipelining:

Split one or more of the critical pipeline stages

Superpipelining degree S:

*Op I_set

S(architecture) = f(Op) * lt (Op)

where: f(op) is frequency of operation op lt(op) is latency of operation op


Henk Corporaal(

17

)

Architecture methodsPowerful Instructions (1)MD-technique

Multiple data operands per operation SIMD: Single Instruction Multiple Data

Vector instruction:

for (i=0, i++, i<64) c[i] = a[i] + 5*b[i];

or

c = a + 5*b

Assembly:

set vl,64ldv v1,0(r2)mulvi v2,v1,5ldv v1,0(r1)addv v3,v1,v2stv v3,0(r3)


Henk Corporaal(

18

)

Architecture methodsPowerful Instructions (1)SIMD computing

All PEs (Processing Elements) execute same operation

Typical mesh or hypercube connectivity

Exploit data locality of e.g. image processing applications

Dense encoding (few instruction bits needed)

SIMD Execution Method

tim

e

Instruction 1

Instruction 2

Instruction 3

Instruction n

PE1 PE2 PEn


Henk Corporaal(

19

)

Architecture methodsPowerful Instructions (1)Sub-word parallelism

SIMD on restricted scale: Used for Multi-media instructions

Examples MMX, SSE, SUN-VIS, HP MAX-2,

AMD-K7/Athlon 3Dnow, Trimedia II

Example: i=1..4|ai-bi| * * * *


Henk Corporaal(

20

)

Architecture methodsPowerful Instructions (2)MO-technique: multiple operations per instruction

Two options: CISC (Complex Instruction Set Computer) VLIW (Very Long Instruction Word)

sub r8, r5, 3 and r1, r5, 12 mul r6, r5, r2 ld r3, 0(r5)

FU 1 FU 2 FU 3 FU 4field

instruction bnez r5, 13

FU 5

VLIW instruction example


Henk Corporaal(

21

)

Execunit 1

Execunit 2

Execunit 3

Register file

Issue slot 1

Execunit 4

Execunit 5

Execunit 6

Execunit 7

Execunit 8

Execunit 9

Issue slot 2 Issue slot 3

Q: How many ports does the registerfile need for n-issue?

VLIW architecture: central Register File


Henk Corporaal(

22

)

Architecture methodsMultiple instruction issue (per cycle)Who guarantees semantic correctness?

can instructions be executed in parallel

User: he specifies multiple instruction streams Multi-processor: MIMD (Multiple Instruction Multiple

Data)

HW: Run-time detection of ready instructions Superscalar

Compiler: Compile into dataflow representation Dataflow processors


Henk Corporaal(

23

)

Four dimensional representation of the architecture design space <I, O, D, S>

Instructions/cycle ‘I’

Superpipelining Degree ‘S’

Operations/instruction ‘O’

Data/operation ‘D’

Superscalar MIMD Dataflow

Superpipelined

RISC

VLIW

10 100

1010

0.1

Vector

10

SIMD100

CISC


Henk Corporaal(

24

)

Architecture design space

Architecture I O D S MparCISC 0.2 1.2 1.1 1 0.26RISC 1 1 1 1.2 1.2VLIW 1 10 1 1.2 12Superscalar 3 1 1 1.2 3.6SIMD 1 1 128 1.2 154MIMD 32 1 1 1.2 38GPU 32 2 8 24 12288Top500 Jaguar ???

Example values of <I, O, D, S> for different architectures

Mpar = I*O*D*S

Op I_set

S(architecture) = f(Op) * lt (Op)You should exploit this

amount of parallelism !!!


Henk Corporaal(

25

)

Communication

Parallel Architecture extends traditional computer architecture with a communication network abstractions (HW/SW interface) organizational structure to realize abstraction

efficiently

Communication Network

Processingnode

Processingnode

Processingnode

Processingnode

Processingnode


Henk Corporaal(

26

)

Communication models: Shared Memory

Coherence problem

Memory consistency issue

Synchronization problem

Process P1 Process P2

SharedMemory

(read, write)(read, write)


Henk Corporaal(

27

)

Communication models: Shared memory

Shared address space

Communication primitives: load, store, atomic swap

Two varieties: Physically shared => Symmetric Multi-Processors (SMP)

usually combined with local caching

Physically distributed => Distributed Shared Memory (DSM)


Henk Corporaal(

28

)

SMP: Symmetric Multi-Processor

Memory: centralized with uniform access time (UMA) and bus interconnect, I/O

Examples: Sun Enterprise 6000, SGI Challenge, Intel

Main memory I/O System

One ormore cache

levels

Processor

One ormore cache

levels

Processor

One ormore cache

levels

Processor

One ormore cache

levels

Processorcan be 1 bus, N busses, or any network


Henk Corporaal(

29

)

DSM: Distributed Shared Memory

Nonuniform access time (NUMA) and scalable interconnect (distributed memory)

Interconnection NetworkInterconnection Network

Cache

Processor

Memory

Cache

Processor

Memory

Cache

Processor

Memory

Cache

Processor

Memory

Main memory I/O System


Henk Corporaal(

30

)

Shared Address Model Summary

Each processor can name every physical location in the machine

Each process can name all data it shares with other processes

Data transfer via load and store

Data size: byte, word, ... or cache blocks

Memory hierarchy model applies: communication moves data to local proc. cache


Henk Corporaal(

31

)

Three fundamental issues for shared memory multiprocessors

Coherence, about: Do I see the most recent data?

Consistency, about: When do I see a written value? e.g. do different processors see writes at the same time

(w.r.t. other memory accesses)?

SynchronizationHow to synchronize processes? how to protect access to shared data?


Henk Corporaal(

32

)

Communication models: Message Passing

Communication primitives e.g., send, receive library calls standard MPI: Message Passing Interface

www.mpi-forum.org

Note that MP can be build on top of SM and vice versa!

Process P1 Process P2

receive

receive send

sendFiFO


Henk Corporaal(

33

)

Message Passing Model

Explicit message send and receive operations

Send specifies local buffer + receiving process on remote computer

Receive specifies sending process on remote computer + local buffer to place data

Typically blocking communication, but may use DMA

Header Data Trailer

Message structure


Henk Corporaal(

34

)

Message passing communication

Interconnection NetworkInterconnection Network

Networkinterface

Networkinterface

Networkinterface

Networkinterface

Cache

Processor

Memory

DMA

Cache

Processor

Memory

DMA

Cache

Processor

Memory

DMA

Cache

Processor

Memory

DMA


Henk Corporaal(

35

)

Communication Models: Comparison

Shared-Memory: Compatibility with well-understood language mechanisms Ease of programming for complex or dynamic

communications patterns Shared-memory applications; sharing of large data

structures Efficient for small items Supports hardware caching

Messaging Passing: Simpler hardware Explicit communication Implicit synchronization (with any communication)


Henk Corporaal(

36

)

Interconnect

How to connect your cores?

Some options: Connect everybody:

Single bus Hierarchical bus NoC

• multi-hop via routers• any topology possible• easy 2D layout helps

Connect with e.g. neighbors only e.g. using shift operation in SIMD or using dual-ported mems to connect 2 cores.


Henk Corporaal(

37

)

Bus (shared) or Network (switched)

Network: claimed to be more scalable no bus arbitration point-to-point connections

but router overhead

node

R

node

R

node

R

node

R

node

R

node

R

node

R

node

R

Example:NoC with 2x4 meshrouting network


Henk Corporaal(

38

)

Historical Perspective

Early machines were: Collection of microprocessors. Communication was performed using bi-directional queues

between nearest neighbors.

Messages were forwarded by processors on path “Store and forward” networking

There was a strong emphasis on topology in algorithms, in order to minimize the number of hops => minimize time


Henk Corporaal(

39

)

Design Characteristics of a Network Topology (how things are connected):

Crossbar, ring, 2-D and 3-D meshes or torus, hypercube, tree, butterfly, perfect shuffle, ....

Routing algorithm (path used): Example in 2D torus: all east-west then all north-south (avoids

deadlock)

Switching strategy: Circuit switching: full path reserved for entire message, like the

telephone. Packet switching: message broken into separately-routed packets,

like the post office.

Flow control and buffering (what if there is congestion): Stall, store data temporarily in buffers re-route data to other nodes tell source node to temporarily halt, discard, etc.

QoS guarantees, Error handling, …., etc, etc.


Henk Corporaal(

40

)

Switch / Network Topology

Topology determines: Degree: number of links from a node Diameter: max number of links crossed between nodes Average distance: number of links to random destination Bisection: minimum number of links that separate the

network into two halves Bisection bandwidth = link bandwidth * bisection


Henk Corporaal(

41

)

Bisection Bandwidth Bisection bandwidth: bandwidth across smallest cut that divides

network into two equal halves

Bandwidth across “narrowest” part of the network

bisection cut

not a bisectioncut

bisection bw= link bw bisection bw = sqrt(n) * link bw

Bisection bandwidth is important for algorithms in which all processors need to communicate with all others


Henk Corporaal(

42

)

Common Topologies

Type Degree Diameter Ave Dist Bisection

1D mesh 2 N-1 N/3 1

2D mesh 4 2(N1/2 - 1) 2N1/2 / 3 N1/2

3D mesh 6 3(N1/3 - 1) 3N1/3 / 3 N2/3

nD mesh 2n n(N1/n - 1) nN1/n / 3 N(n-1) / n

Ring 2 N/2 N/4 2

2D torus 4 N1/2 N1/2 / 2 2N1/2

Hypercube Log2N n=Log2N n/2 N/2

2D Tree 3 2Log2N ~2Log2 N 1

Crossbar N-1 1 1 N2/2

N = number of nodes, n = dimension


Henk Corporaal(

43

)

Topologies in Real High End Machines

Red Storm (Opteron + Cray network, future)

3D Mesh

Blue Gene/L 3D Torus

SGI Altix Fat tree

Cray X1 4D Hypercube (approx)

Myricom (Millennium) Arbitrary

Quadrics (in HP Alpha server clusters)

Fat tree

IBM SP Fat tree (approx)

SGI Origin Hypercube

Intel Paragon 2D Mesh

BBN Butterfly Butterfly

old

er

n

ew

er


Henk Corporaal(

44

)

Network: Performance metrics

Network Bandwidth Need high bandwidth in communication How does it scale with number of nodes?

Communication Latency Affects performance, since processor may have to wait Affects ease of programming, since it requires more thought to

overlap communication and computation

How can a mechanism help hide latency? overlap message send with computation, prefetch data, switch to other task or thread


Henk Corporaal(

45

)

Examples of many core / PE architectures

SIMD Xetal (320 PEs), Imap (128 PEs), AnySP (Michigan Univ)

VLIW Itanium,TRIPS / EDGE, ADRES,

Multi-threaded idea: hide long latencies Denelcor HEP (1982), SUN Niagara (2005)

Multi-processor RaW, PicoChip, Intel/AMD, GRID, Farms, …..

Hybrid, like , Imagine, GPUs, XC-Core actually, most are hybrid !!


Henk Corporaal(

46

)

IMAP from NEC

NEC IMAPSIMD•128 PEs•Supports indirect addressing

e.g. LD r1, (r2)•Each PE 5-issue VLIW


Henk Corporaal(

47

)

TRIPS (Austin Univ / IBM)a statically mapped data flow architecture

R: register fileE: execution unitD: Data cacheI: Instruction cacheG: global control


Henk Corporaal(

48

)

Compiling for TRIPS

1. Form hyperblocks (use unrolling, predication, inlining to enlarge scope)

2. Spatial map operations of each hyperblock registers are accessed at hyperblock boundaries

3. Schedule hyperblocks


Henk Corporaal(

49

)

Multithreaded CategoriesTi

me

(pro

cess

or

cycle

)Superscalar Fine-Grained Coarse-Grained Multiprocessing

SimultaneousMultithreading

Thread 1

Thread 2Thread 3Thread 4

Thread 5Idle slot

Intel calls this 'Hyperthreading'


Henk Corporaal(

50

)

SUN Niagara processing element

4 threads per processor 4 copies of PC logic, Instr. buffer, Store buffer, Register file


Henk Corporaal(

51

)

Really BIG: Jaguar-Cray XT5-HE Oak Ridge Nat

Lab

224,256 AMD Opteron cores

2.33 PetaFloppeak perf.

299 Tbyte main memory

10 Petabyte disk

478GB/s mem bandwidth

6.9 MegaWatt

3D torus

TOP 500 #1(Nov 2009)


Henk Corporaal(

52

)

Graphic Processing Units (GPUs)

NVIDIA GT 340(2010)

ATI 5970(2009)


Henk Corporaal(

53

)

Why GPUs


Henk Corporaal(

54

)

In Need of TeraFlops?

3 * GTX295• 1440 PEs• 5.3 TeraFlop


Henk Corporaal(

55

)

How Do GPUs Spend Their Die Area?GPUs are designed to match the workload of 3D graphics.

J. Roca, et al. "Workload Characterization of 3D Games", IISWC 2006, linkT. Mitra, et al. "Dynamic 3D Graphics Workload Characterization and the Architectural Implications", Micro 1999, link

Die photo of GeForce GTX 280 (source: NVIDIA)

http://portal.acm.org/citation.cfm?id=320080.320090

http://portal.acm.org/citation.cfm?id=320080.320090


Henk Corporaal(

56

)

How Do CPUs Spend Their Die Area?

CPUs are designed for low latency instead of high throughput

Die photo of Intel Penryn (source: Intel)


Henk Corporaal(

57

)

GPU: Graphics Processing Unit

The Utah teapot: http://en.wikipedia.org/wiki/Utah_teapot

From polygon mesh to image pixel.

http://en.wikipedia.org/wiki/Utah_teapot


Henk Corporaal(

58

)

The Graphics Pipeline

K. Fatahalian, et al. "GPUs: a Closer Look", ACM Queue 2008, http://doi.acm.org/10.1145/1365490.1365498

http://doi.acm.org/10.1145/1365490.1365498


Henk Corporaal(

59

)



http://doi.acm.org/10.1145/1365490.1365498


Henk Corporaal(

60

)



http://doi.acm.org/10.1145/1365490.1365498


Henk Corporaal(

61

)



http://doi.acm.org/10.1145/1365490.1365498


Henk Corporaal(

62

)

GPUs: what's inside?Basically an SIMD:

• A single instruction stream operates on multiple data streams

• All PEs execute the same instruction at the same time

• PEs operate concurrently on their own piece of memory

• However, GPU far more complex !!

• Instruction Memory

Control Processor

PE1

PE2

PE3

PE4

...PE320

Interconnect

Data memory

PE5

PE6

Instr.

Addr.

Instr.

Status•

Add Add Add Add Add Add Add Add Add


Henk Corporaal(

63

)

CPU Programming: NVIDIA CUDA example

Single thread program float A[4][8];do-all(i=0;i<4;i++){ do-all(j=0;j<8;j++){ A[i][j]++; }}

CUDA program

float A[4][8]; kernelF<<<(4,1),(8,1)>>>(A); __device__ kernelF(A){ i = blockIdx.x; j = threadIdx.x; A[i][j]++;}

• CUDA program expresses data level parallelism (DLP) in terms of thread level parallelism (TLP).

• Hardware converts TLP into DLP at run time.


Henk Corporaal(

64

)

System Architecture

Erik Lindholm, et al. "NVIDIA Tesla: A Unified Graphics and Computing Architecture", IEEE Micro 2008, link

http://doi.ieeecomputersociety.org/10.1109/MM.2008.31


Henk Corporaal(

65

)

NVIDIA Tesla Architecture (G80)

Erik Lindholm, et al. "NVIDIA Tesla: A Unified Graphics and Computing Architecture", IEEE Micro 2008, link

http://doi.ieeecomputersociety.org/10.1109/MM.2008.31


Henk Corporaal(

66

)

Texture Processor Cluster (TPC)


Henk Corporaal(

67

)

Deeply pipelined SM for high throughput

One instruction executed by a warp of 32 threads One warp is executed on 8 PEs over 4 shader cycles

Let's start with a simple example: execution of 1 instruction


Henk Corporaal(

68

)

Issue an Instruction for 32 Threads


Henk Corporaal(

69

)

Read Source Operands of 32 Threads


Henk Corporaal(

70

)

Buffer Source Operands to Op Collector


Henk Corporaal(

71

)

Execute Threads 0~7


Henk Corporaal(

72

)

Execute Threads 8~15


Henk Corporaal(

73

)



Henk Corporaal(

74

)



Henk Corporaal(

75

)

Write Back from Result Queue to Reg


Henk Corporaal(

76

)

Warp: Basic Scheduling Unit in Hardware

One warp consists of 32 consecutive threads Warps are transparent to programmer, formed at run

time


Henk Corporaal(

77

)

Warp Scheduling • Schedule at most

24 warps in an interleaved manner

• Zero overhead for interleaved issue of warps


Henk Corporaal(

78

)

Handling BranchThreads within a warp are free to branch.

if( $r17 > $r19 ){ $r16 = $r20 + $r31 }else{ $r16 = $r21 - $r32 }$r18 = $r15 + $r16

Assembly code on the right are disassembled from cuda binary (cubin) using "decuda", link

http://wiki.github.com/laanwj/decuda


Henk Corporaal(

79

)

Branch Divergence within a Warp

If threads within a warp diverge, both paths have to be executed.

Masks are set to filter out threads not executing on current path.


Henk Corporaal(

80

)

CPU Programming: NVIDIA CUDA example

Single thread program float A[4][8];do-all(i=0;i<4;i++){ do-all(j=0;j<8;j++){ A[i][j]++; }}

CUDA program

float A[4][8]; kernelF<<<(4,1),(8,1)>>>(A); __device__ kernelF(A){ i = blockIdx.x; j = threadIdx.x; A[i][j]++;}

• CUDA program expresses data level parallelism (DLP) in terms of thread level parallelism (TLP).

• Hardware converts TLP into DLP at run time.


Henk Corporaal(

81

)

CUDA Programming

kernelF<<<(2,2),(4,2)>>>(A); __device__ kernelF(A){ i = blockDim.x * blockIdx.y + blockIdx.x; j = threadDim.x * threadIdx.y + threadIdx.x; A[i][j]++;}

Both grid and thread block can have two dimensional index.


Henk Corporaal(

82

)

Mapping Thread Blocks to SMs One thread block can only run on one SM Thread block can not migrate from one SM to another SM Threads of the same thread block can share data using shared

memory

Example: mapping 12 thread blocks on 4 SMs.


Henk Corporaal(

83

)

Mapping Thread Blocks (0,0)/(0,1)/(0,2)/(0,3)


Henk Corporaal(

84

)

CUDA Compilation Trajectorycudafe: CUDA front endnvopencc: customized open64 compiler for CUDAptx: high level assemble code (documented)ptxas: ptx assemblercubin: CUDA binrary

decuda, http://wiki.github.com/laanwj/decuda

http://wiki.github.com/laanwj/decuda


Henk Corporaal(

85

)

Optimization Guide Optimizations on memory latency tolerance

Reduce register pressure Reduce shared memory pressure

Optimizations on memory bandwidth Global memory coalesce Shared memory bank conflicts Grouping byte access Avoid Partition camping

Optimizations on computation efficiency Mul/Add balancing Increase floating point proportion

Optimizations on operational intensity Use tiled algorithm Tuning thread granularity


Henk Corporaal(

86

)

Global Memory: Coalesced Access

NVIDIA, "CUDA Programming Guide", link

perfectly coalesced allow threads skipping LD/ST

http://www.nvidia.com/object/cuda_develop.html


Henk Corporaal(

87

)

Global Memory: Non-Coalesced Access


non-consecutiveaddress

starting address not aligned to 128 Byte

non-consecutiveaddress

stride larger than one word



Henk Corporaal(

88

)

Shared Memory: without Bank Conflict


one access per bank one access per bank with shuffling

access the same address (broadcast)

partial broadcast and skipping some banks



Henk Corporaal(

89

)

Shared Memory: with Bank Conflict


access more than one address per bank

broadcast more than one address per bank



Henk Corporaal(

90

)

Optimizing MatrixMul

Matrix Multiplication example from the 5kk70 course in TU/e, link.The CUDA@MIT course also provides Matrix Multiplication as a hands-on example, link.

http://sites.google.com/site/5kk70gpu/

http://sites.google.com/site/cudaiap2009/


Henk Corporaal(

91

)

ATI Cypress (RV870)• 1600 shader ALUs

ref: tom's hardware, link

http://www.tomshardware.com/reviews/radeon-hd-5870,2422.html


Henk Corporaal(

92

)

ATI Cypress (RV870)• VLIW PEs

ref: tom's hardware, link

http://www.tomshardware.com/reviews/radeon-hd-5870,2422.html


Henk Corporaal(

93

)

Intel Larrabee• x86 core, 8/16/32 cores.

Larry Seiler, et al. "Larrabee: a many-core x86 architecture for visual computing", SIGGRAPH 2008, link

http://doi.acm.org/10.1145/1399504.1360617


Henk Corporaal(

94

)

CELL

PS3

NVIDIARSXreality

synthesizer

NVIDIARSXreality

synthesizer

CellBroadband

Engine3.2 GHz

CellBroadband

Engine3.2 GHz

South BridgeSouth Bridge

GDDR3

GDDR3

GDDR3

GDDR3

XDR DRAM

XDR DRAM

XDR DRAM

XDR DRAM

drivesUSB

NetworkMedia

Video Memory

2.5 GB/sec 2.5 GB/sec

15 GB/sec 20 GB/sec

128pin * 1.4Gbps/pin = 22.4GB/sec

64pin * 3.2Gbps/pin = 25.6GB/sec

Main Memory


Henk Corporaal(

95

)

CELL – the architecture

1 x PPE 64-bit PowerPCL1: 32 KB I$ + 32 KB D$L2: 512 KB

8 x SPE cores:Local store: 256 KB 128 x 128 bit vector registers

Hybrid memory model: PPE: Rd/Wr SPEs: Asynchronous DMA

EIB: 205 GB/s sustained aggregate bandwidth

Processor-to-memory bandwidth: 25.6 GB/s

Processor-to-processor: 20 GB/s in each direction


Henk Corporaal(

96

)


Henk Corporaal(

97

)

Intel / AMD x86 – Historical overview


Henk Corporaal(

98

)

Nehalem architecture

In novel processors Core i7 & Xeon 5500s

Quad Core

3 cache levels

2 TLB levels

2 branch predictors

Out-of-Order execution

Simultaneous Multithreading

DVFS: dynamic voltage & frequency scaling

1 core


Henk Corporaal(

99

)

Nehalem pipeline (1/2)

Instruction Fetch and PreDecode

Instruction Queue

Decode

Rename/Alloc

Retirement unit(Re-Order Buffer)

Scheduler

EXE Unit Cluster 0

EXE Unit Cluster 1

EXE Unit Cluster 2

Load Store

L1D Cache and DTLB

L2 Cache

Inclusive L3 Cache by all cores

Micro-code ROM

QPI

Quick Path Interconnect (2x20 bit)


Henk Corporaal(

100

)

Nehalem pipeline (2/2)


Henk Corporaal(

101

)

Tylersburg: connecting 2 quad cores

Level Capacity Associativity(ways)

Line size(bytes)

Access Latency(clocks)

Access Throughput(clocks)

Write UpdatePolicy

L1D 4 x 32 KiB 8 64 4 1 Writeback

L1I 4 x 32 KiB 4 N/A N/A N/A N/A

L2U 4 x 256KiB 8 64 10 Varies Writeback

L3U 1 x 8 MiB 16 64 35-40 Varies Writeback

Core

L1D L1I

L2U

Core

L1D L1I

L2U

Core

L1D L1I

L2U

Core

L1D L1I

L2U

L3U

Memory controller QPI QPI

Core

L1D L1I

L2U

Core

L1D L1I

L2U

Core

L1D L1I

L2U

Core

L1D L1I

L2U

L3U

Memory controllerQPI QPIQPI

Main memoryD

DR

3Main memory

DD

R3

IOH


Henk Corporaal(

102

)

Programming these arechitectures: N-tap FIR

int i, j;

for (i = 0; i < M; i ++){

out[i] = 0;

for (j = 0; j < N; j ++)

out[i] +=n[i+j]*coeff[j];

}

1

0

][*][][N

j

jcoeffjiiniout

C-code:


Henk Corporaal(

103

)

Y0 Y1 Y2 Y3 Y4 Y5 Y6 Y7 Y8 Y9 Y10 Y11

X0 X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11

C0 C1 C2 C3

x x x x

C0 C1 C2 C3

x x x x

C0 C1 C2 C3

x x x x

C0 C1 C2 C3

x x x x

+

+

+

+


Henk Corporaal(

104

)

__m128 X, XH, XL, Y, C, H; int i, j;for(i = 0; i < (M/4); i ++){ XL = _mm_load_ps(&in[i*4]); Y = _mm_setzero_ps(); for(j = 0; j < (N/4); j ++){ XH = XL; XL = _mm_load_ps(&in[(i+j+1)*4]); C =_mm_load_ps(&coeff[j*4]); H =_mm_shuffle_ps (C, C, _MM_SHUFFLE(0,0,0,0)); X = _mm_mul_ps (XH, H); Y = _mm_add_ps (Y, X);

H =_mm_shuffle_ps (C, C, _MM_SHUFFLE(1,1,1,1)); X = _mm_alignr_epi8 (XL, XH, 4); X = _mm_mul_ps (X, H); Y = _mm_add_ps (Y, X);

H = _mm_shuffle_ps (C, C, _MM_SHUFFLE(2,2,2,2)); X = _mm_alignr_epi8 (XL, XH, 8); X = _mm_mul_ps (X, H); Y = _mm_add_ps (Y, X);

H = _mm_shuffle_ps (C, C, _MM_SHUFFLE(3,3,3,3)); X = _mm_alignr_epi8 (XL, XH, 12); X = _mm_mul_ps (X, H); Y = _mm_add_ps (Y, X); } _mm_store_ps(&out[i*4], Y);}

FIR with x86 SSE Intrinsics

Y0

Y1

Y2

Y3

=

X0

X1

X2

X3

C0

C0

C0

C0

x

X1

X2

X3

X4

C1

C1

C1

C1

x

X2

X3

X4

X5

C2

C2

C2

C2

x

X3

X4

X5

X6

C3

C3

C3

C3

x+ + +

Y H H H HX X X X


Henk Corporaal(

105

)

FIR using pthread

pthread_t fir_threads[N_THREAD];

fir_arg fa[N_THREAD];

tsize = M/N_THREAD;

for(i = 0; i < N_THREAD; i ++){

/*… Initialize thread

parameters fa[i] … */

rc = pthread_create(&fir_threads[i],\

NULL, fir_kernel, (void *)&fa[i]);

}

for(i=0; i<N_THREAD; i++) {

rc = pthread_join(fir_threads[i],\

&status);

}

split

T0 T1 T2 T3

join

Input

Sequential FIR kernel or

Vectorized FIR kernel


Henk Corporaal(

106

)

x86 FIR speedup

On Intel Core 2 Quad Q8300, gcc optimization level 2 Input: ~5M samples #threads in pthread: 4


Henk Corporaal(

107

)

FIR kernel on CELL SPE

Vectorization is similar to SSE

vector float,X, XH, XL, Y, H; int i, j; for(i = 0; i < (M/4); i ++){ XL = in[i]; Y = spu_splats(0.0f); for(j = 0; j < (N/4); j ++){ XH = XL; XL = in[i+j+1]); H=splats(coeff[j*4]); Y = spu_madd(XH, H, Y);

H=splats(coeff[j*4+1]); X = spu_shuffle(XH, XL, SHUFFLE_X1); Y = spu_madd(X, H, Y);

H=splats(coeff[j*4+2]); X = spu_shuffle(XH, XL, SHUFFLE_X2); Y = spu_madd(X, H, Y);

H=splats(coeff[j*4+3]); X = spu_shuffle(XH, XL, SHUFFLE_X3); Y = spu_madd(X, H, Y); } out[i] = Y;}


Henk Corporaal(

108

)

SPE DMA double buffering

...

Use iBuf0Write to oBuf0



Get iBuf0

Get iBuf1

timeGet

iBuf0

Get iBuf1

Put oBuf0

Put oBuf1

float iBuf[2][BUF_SIZE];float oBuf[2][BUF_SIZE];int idx=0; int buffers=size/BUF_SIZE;mfc_get(iBuf[idx],argp,\ BUF_SIZE*sizeof(float),\ tag[idx],0,0);for(int i = 1;I < buffers; i++){ wait_for_dma(tag[idx]); next_idx = idx^1; mfc_get(iBuf[next_idx],argp,\ BUF_SIZE*sizeof(float),0,0,0); fir_kernel(oBuf[idx], iBuf[idx],\ coeff,BUF_SIZE,taps); mfc_put(oBuf[idx],outbuf,\ BUF_SIZE*sizeof(float),\ tag[idx],0,0); idx = next_idx;}/* Finish up the last block ...*/


Henk Corporaal(

109

)

CELL FIR speedup

On PlayStation 3, CELL with six accessible SPE

Input: ~6M samples

Speed-up compare to scalar implementation on PPE


Henk Corporaal(

110

)

Roofline Model

Introduced by Samual Williams and David PattersonP

erfo

rman

ce in

GF

lops

/sec

Operational intensity in Flops/Byte

peak performance

peak

band

width

ridge pointbalanced architecture for given application


Henk Corporaal(

111

)

Roofline Model of GT8800 GPU


Henk Corporaal(

112

)

Roofline Model

Threads of one warp diverge into different paths at branch.


Henk Corporaal(

113

)

Roofline Model In G80 architecture, a non-coalesced global memory access will

be separated into 16 accesses.


Henk Corporaal(

114

)

Roofline Model

Previous examples assume memory latency can be hidden. Otherwise the program can be latency-bound.

Z. Guz, et al, "Many-Core vs. Many-Thread Machines: Stay Away From the Valley", IEEE Comp Arch Letters, 2009, linkS. Hong, et al. "An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness", ISCA09, link

rm : percentage of memory instruction in total instructiontavg : average memory latencyCPIexe : Cycle per Instruction

• There is one memory instruction in every (1/rm) instructions.• There is one memory instruction every (1/rm) x CPIexe cycles.• It takes (tavg x rm / CPIexe) threads to hide memory latency.

http://doi.ieeecomputersociety.org/10.1109/L-CA.2009.4

http://doi.acm.org/10.1145/1555815.1555775


Henk Corporaal(

115

)

Roofline Model

If not enough threads to hide the memory latency, the memory latency could become the bottleneck.

Samuel Williams, "Auto-tuning Performance on Multicore Computers", PhD Thesis, UC Berkeley, 2008, linkS. Hong, et al. "An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness", ISCA09, link

http://www.eecs.berkeley.edu/Pubs/TechRpts/2008/EECS-2008-164.html

http://doi.acm.org/10.1145/1555815.1555775


Henk Corporaal(

116

)

Four Architectures

667MHz DDR2 DIMMs667MHz DDR2 DIMMs

10.66 GB/s

2x64b memory controllers2x64b memory controllers

Hyp

erT

ran

spor

tH

yper

Tra

nsp

ortOpteronOpteron OpteronOpteron OpteronOpteron OpteronOpteron

667MHz DDR2 DIMMs667MHz DDR2 DIMMs

10.66 GB/s

2x64b memory controllers2x64b memory controllers

OpteronOpteron OpteronOpteron OpteronOpteron OpteronOpteron

512KBvictim

512KBvictim

512KBvictim

512KBvictim

512KBvictim

512KBvictim

512KBvictim

512KBvictim

512KBvictim

512KBvictim

512KBvictim

512KBvictim

512KBvictim

512KBvictim

512KBvictim

512KBvictim

2MB Shared quasi-victim (32 way)2MB Shared quasi-victim (32 way)

SRI / crossbarSRI / crossbar

2MB Shared quasi-victim (32 way)2MB Shared quasi-victim (32 way)

SRI / crossbarSRI / crossbarHyp

erT

ran

spor

tH

yper

Tra

nsp

ort

4G

B/s

(eac

h di

rect

ion

)

667MHz FBDIMMs 667MHz FBDIMMs

21.33 GB/s 10.66 GB/s

4MB Shared L2 (16 way)(64b interleaved)


4 Coherency Hubs4 Coherency Hubs

2x128b controllers2x128b controllers

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

CrossbarCrossbar

179 GB/s 90 GB/s

667MHz FBDIMMs 667MHz FBDIMMs

21.33 GB/s 10.66 GB/s



4 Coherency Hubs4 Coherency Hubs

2x128b controllers2x128b controllers

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

CrossbarCrossbar

179 GB/s 90 GB/s

8 x

6.4

GB

/s(1

pe

r h

ub p

er

dire

ctio

n)

inte

rcon

nect

inte

rcon

nect

86.4 GB/s

768MB 900MHz GDDR3 Device DRAM768MB 900MHz GDDR3 Device DRAM

ThreadCluster

ThreadCluster

ThreadCluster

ThreadCluster

ThreadCluster

ThreadCluster

ThreadCluster

ThreadCluster

192KB L2 (Textures only)192KB L2 (Textures only)

24 ROPs24 ROPs

6 x 64b memory controllers6 x 64b memory controllers

BIFBIF

512MB XDR DRAM512MB XDR DRAM

25.6 GB/s

EIB (ring network)EIB (ring network)

XDR memory controllersXDR memory controllers

VMTPPEVMTPPE

512KL2

512KL2

SP

ES

PE

256

K2

56K

MF

CM

FC

SP

ES

PE

256

K2

56K

MF

CM

FC

SP

ES

PE

256

K2

56K

MF

CM

FC

SP

ES

PE

256

K2

56K

MF

CM

FC

SP

ES

PE

256

K2

56K

MF

CM

FC

SP

ES

PE

256

K2

56K

MF

CM

FC

SP

ES

PE

256

K2

56K

MF

CM

FC

SP

ES

PE

256

K2

56K

MF

CM

FC B

IFBIF

512MB XDR DRAM512MB XDR DRAM

25.6 GB/s

EIB (ring network)EIB (ring network)

XDR memory controllersXDR memory controllers

SP

ES

PE

256

K2

56K

MF

CM

FC

SP

ES

PE

256

K2

56K

MF

CM

FC

SP

ES

PE

256

K2

56K

MF

CM

FC

SP

ES

PE

256

K2

56K

MF

CM

FC

SP

ES

PE

256

K2

56K

MF

CM

FC

SP

ES

PE

256

K2

56K

MF

CM

FC

SP

ES

PE

256

K2

56K

MF

CM

FC

SP

ES

PE

256

K2

56K

MF

CM

FC

VMTPPEVMTPPE

512KL2

512KL2

<20

GB

/s(e

ach

dire

ctio

n)

Sun Victoria FallsAMD Barcelona

NVIDIA G80IBM Cell Blade


Henk Corporaal(

117

)

32b Rooflines for the Four(in-core parallelism)

1/8

flop:DRAM byte ratio

att

ain

ab

le G

flop

/s (

32

b)

1/41/2 1 2 4 8 16

1/8


att

ain

ab

le G

flop

/s (

32

b)

1/41/2 1 2 4 8 16 1/8


att

ain

ab

le G

flop

/s (

32

b)

1/41/2 1 2 4 8 16

Sun Victoria FallsAMD Barcelona

NVIDIA G80

Single Precision Roofline models for the SMPs used in this work.

Based on micro-benchmarks, experience, and manuals

Ceilings =

in-core parallelism

Can the compiler find all this parallelism ?

NOTE: log-log scale Assumes perfect

SPMD

41/8


att

ain

ab

le G

flop

/s (

32

b)

8

16

32

64

128

256

1/41/2 1 2 4 8 16

IBM Cell Blade

4

8

16

32

64

128

256

4

8

16

32

64

128

256

4

8

16

32

64

128

256

512

512

512

512

peak SP

mul / add imbalance

w/out SIMD

w/out ILP

peak SP

peak SP

w/out FMA

w/out SIMD

w/out ILP

peak SP

w/out FMA

w/out

mem

ory c

oales

cing

w/out

NUM

A

w/out

SW

pre

fetc

h

w/out

NUM

A

w/out

SW

pre

fetc

h

w/out

NUM

A

w/out

DM

A con

curre

ncy


Henk Corporaal(

118

)

Let's conclude: Trends

Reliability + Fault Tolerance Requires run-time management, process migration

Power is the new metric Low power management at all levels - Scenarios - subthreshold,

back biasing, ….

Virtualization (1): do not disturb other applications composability

Virtualization (2): 1 virual target platform avoids porting problem 1 intermediate supporting multiple target

huge RT management support, JITC multiple OS

Compute servers

Transactional memory

3D: integrate different dies


Henk Corporaal(

119

)

3D using Through Silicon Vias (TSV)

Using TVS:Face-to-Back

(Scalable)

Flip-Chip:Face-to-Face

(limited to 2 die tiers)

4um pitch in 2011 (ITRS 2007)

Can enlarge device area

from Woo e.a. HPCA 2009


Henk Corporaal(

120

)

Don't forget Amdahl

However, see next slide!


Henk Corporaal(

121

)

Trends: Homogeneous vs Heterogeneous: where do we go ?

Homogenous: Easier to program Favored by DLP / Vector parallelism Fault tolerant / Task migration

Heterogeneous Energy efficiency demands Higher speedup

Amdahl++(see Hill and Marty, HPCA'08 on Amdahl's law in multi-core area)

Memory dominated suggests homogenous sea of heterogeneous cores

Sea of reconfigurable compute or processor blocks? many examples: Smart Memory, SmartCell, PicoChip, MathStar

FPOA, Stretch, XPP, ……. etc.


Henk Corporaal(

122

)

How does a future architecture look like

A couple of high performance (low latency) cores also sequential code should run fast

Add a whole battery of wide vector processors

Some shared memory (to reduce copying large data structures) Level 2 and 3 in 3D technology Huge bandwidth; exploit large vectors

Accelerators for dedicated domains

OS support (runtime mapping, DVFS, use of accelerators)


Henk Corporaal(

123

)

But the real problem is …..

Programming parallel is the real bottleneck new programming models like transaction based progr.

That's what we will talk about this week…


Henk Corporaal(

124

)

introduction to many-core architectures henk corporaal heco asci winterschool on embedded systems...

Documents

asci winterschool

henk corporaal14

henk corporaal6

henk corporaal4

misd multiple instruction

multiple processors

multiple data examples

core x86 slide