introduction to many-core architectures henk corporaal heco asci winterschool on embedded systems...
TRANSCRIPT
Introductionto
Many-Core Architectures
Henk Corporaalwww.ics.ele.tue.nl/~heco
ASCI Winterschool on Embedded Systems
Soesterberg, March 2010
ASCI Winterschool 2010
Henk Corporaal(2)
Intel Trends(K. Olukotun)
Core i7
3GHz
100W
5
ASCI Winterschool 2010
Henk Corporaal(3)
System-level integration (Chuck Moore, AMD at MICRO 2008)
Single-chip CPU Era: 1986 –2004 Extreme focus on single-threaded performance Multi-issue, out-of-order execution plus moderate cache hierarchy
Chip Multiprocessor (CMP) Era: 2004 –2010 Early: Hasty integration of multiple cores into same chip/package Mid-life: Address some of the HW scalability and interference issues Current: Homogeneous CPUs plus moderate system-level
functionality
System-level Integration Era: ~2010 onward Integration of substantial system-level functionality Heterogeneous processors and accelerators Introspective control systems for managing on-chip resources &
events
ASCI Winterschool 2010
Henk Corporaal(4)
Why many core?
Running into Frequency wall ILP wall Memory wall Energy wall
Chip area enabler: Moore's law goes well below 22 nm What to do with all this area? Multiple processors fit easily on a single die
Application demands
Cost effective (just connect existing processors or processor cores)
Low power: parallelism may allow lowering Vdd Performance/Watt is the new metric !!
ASCI Winterschool 2010
Henk Corporaal(5)
Low power through parallelism
Sequential Processor Switching capacitance C Frequency f Voltage V P1 = fCV2
Parallel Processor (two times the number of units) Switching capacitance 2C Frequency f/2 Voltage V’ < V P2 = f/2 2C V’2 = fCV’2 < P1
CPU
CPU1 CPU2
ASCI Winterschool 2010
Henk Corporaal(6)
How low Vdd can we go?
Subthreshold JPEG encoder Vdd 0.4 – 1.2 Volt
Engine
Engine
Engine
Engine
pJ /operation
8.3X5.6X4.4X3.4X
1.1 1.0 0.9 0.8 0.7 0.6 0.5Supply Voltage (V)
1.2 0.40.0
1.0
2.0
3.0
4.0
5.0
6.0
7.0
8.0
ASCI Winterschool 2010
Henk Corporaal(7)
Computational efficiency: how many MOPS/Watt?
Yifan He e.a., DAC 2010
ASCI Winterschool 2010
Henk Corporaal(8)
Computational efficiency: what do we need?
1
10
100
1000
10000
0.1 1 10 100
Be
tter
Po
wer Efficien
cy
1 Mops/mW
10 Mops/mW100 Mops/mW
1000 Mops/mW
SODA(65nm)
SODA (90nm)
TI C6X
Imagine
VIRAM Pentium M
IBM Cell
Pe
rfo
rma
nc
e (
Go
ps
)
Power (Watts )
3G Wireless
4G Wireless
Mobile HDVideo
Woh e.a., ISCA 2009
ASCI Winterschool 2010
Henk Corporaal(9)
Intel's opinion: 48-core x86
ASCI Winterschool 2010
Henk Corporaal(
10
)
Outline
Classifications of Parallel Architectures
Examples Various (research) architectures GPUs Cell Intel multi-cores
How much performance do you really get?Roofline model
Trends & Conclusions
ASCI Winterschool 2010
Henk Corporaal(
11
)
Classifications
Performance / parallelism driven: 4-5 D Flynn
Communication & Memory Message passing / Shared memory Shared memory issues: coherency, consistency,
synchronization
Interconnect
ASCI Winterschool 2010
Henk Corporaal(
12
)
Flynn's Taxomony
SISD (Single Instruction, Single Data) Uniprocessors
SIMD (Single Instruction, Multiple Data) Vector architectures also belong to this class
Multimedia extensions (MMX, SSE, VIS, AltiVec, …) Examples: Illiac-IV, CM-2, MasPar MP-1/2, Xetal, IMAP, Imagine,
GPUs, ……
MISD (Multiple Instruction, Single Data) Systolic arrays / stream based processing
MIMD (Multiple Instruction, Multiple Data) Examples: Sun Enterprise 5000, Cray T3D/T3E, SGI Origin
Flexible Most widely used
ASCI Winterschool 2010
Henk Corporaal(
13
)
Flynn's Taxomony
ASCI Winterschool 2010
Henk Corporaal(
14
)
Enhance performance: 4 architecture methods (Super)-pipelining
Powerful instructions MD-technique
multiple data operands per operation MO-technique
multiple operations per instruction
Multiple instruction issue Single stream: Superscalar Multiple streams
Single core, multiple threads: Simultaneously Multi-Threading
Multiple cores
ASCI Winterschool 2010
Henk Corporaal(
15
)
Architecture methodsPipelined Execution of Instructions
Purpose of pipelining: Reduce #gate_levels in critical path Reduce CPI close to one (instead of a large number for the
multicycle machine) More efficient Hardware
Problems Hazards: pipeline stalls
Structural hazards: add more hardware Control hazards, branch penalties: use branch prediction Data hazards: by passing required
IF: Instruction Fetch
DC: Instruction Decode
RF: Register Fetch
EX: Execute instruction
WB: Write Result Register
IF DC RF EX WBIF DC RF EX WB
IF DC RF EX WBIF DC RF EX WB
INS
TR
UC
TIO
N
CYCLE
1 2 43 5 6 7 8
12
3
4 Simple 5-stage pipeline
ASCI Winterschool 2010
Henk Corporaal(
16
)
Architecture methodsPipelined Execution of InstructionsSuperpipelining:
Split one or more of the critical pipeline stages
Superpipelining degree S:
*Op I_set
S(architecture) = f(Op) * lt (Op)
where: f(op) is frequency of operation op lt(op) is latency of operation op
ASCI Winterschool 2010
Henk Corporaal(
17
)
Architecture methodsPowerful Instructions (1)MD-technique
Multiple data operands per operation SIMD: Single Instruction Multiple Data
Vector instruction:
for (i=0, i++, i<64) c[i] = a[i] + 5*b[i];
or
c = a + 5*b
Assembly:
set vl,64ldv v1,0(r2)mulvi v2,v1,5ldv v1,0(r1)addv v3,v1,v2stv v3,0(r3)
ASCI Winterschool 2010
Henk Corporaal(
18
)
Architecture methodsPowerful Instructions (1)SIMD computing
All PEs (Processing Elements) execute same operation
Typical mesh or hypercube connectivity
Exploit data locality of e.g. image processing applications
Dense encoding (few instruction bits needed)
SIMD Execution Method
tim
e
Instruction 1
Instruction 2
Instruction 3
Instruction n
PE1 PE2 PEn
ASCI Winterschool 2010
Henk Corporaal(
19
)
Architecture methodsPowerful Instructions (1)Sub-word parallelism
SIMD on restricted scale: Used for Multi-media instructions
Examples MMX, SSE, SUN-VIS, HP MAX-2,
AMD-K7/Athlon 3Dnow, Trimedia II
Example: i=1..4|ai-bi| * * * *
ASCI Winterschool 2010
Henk Corporaal(
20
)
Architecture methodsPowerful Instructions (2)MO-technique: multiple operations per instruction
Two options: CISC (Complex Instruction Set Computer) VLIW (Very Long Instruction Word)
sub r8, r5, 3 and r1, r5, 12 mul r6, r5, r2 ld r3, 0(r5)
FU 1 FU 2 FU 3 FU 4field
instruction bnez r5, 13
FU 5
VLIW instruction example
ASCI Winterschool 2010
Henk Corporaal(
21
)
Execunit 1
Execunit 2
Execunit 3
Register file
Issue slot 1
Execunit 4
Execunit 5
Execunit 6
Execunit 7
Execunit 8
Execunit 9
Issue slot 2 Issue slot 3
Q: How many ports does the registerfile need for n-issue?
VLIW architecture: central Register File
ASCI Winterschool 2010
Henk Corporaal(
22
)
Architecture methodsMultiple instruction issue (per cycle)Who guarantees semantic correctness?
can instructions be executed in parallel
User: he specifies multiple instruction streams Multi-processor: MIMD (Multiple Instruction Multiple
Data)
HW: Run-time detection of ready instructions Superscalar
Compiler: Compile into dataflow representation Dataflow processors
ASCI Winterschool 2010
Henk Corporaal(
23
)
Four dimensional representation of the architecture design space <I, O, D, S>
Instructions/cycle ‘I’
Superpipelining Degree ‘S’
Operations/instruction ‘O’
Data/operation ‘D’
Superscalar MIMD Dataflow
Superpipelined
RISC
VLIW
10 100
1010
0.1
Vector
10
SIMD100
CISC
ASCI Winterschool 2010
Henk Corporaal(
24
)
Architecture design space
Architecture I O D S MparCISC 0.2 1.2 1.1 1 0.26RISC 1 1 1 1.2 1.2VLIW 1 10 1 1.2 12Superscalar 3 1 1 1.2 3.6SIMD 1 1 128 1.2 154MIMD 32 1 1 1.2 38GPU 32 2 8 24 12288Top500 Jaguar ???
Example values of <I, O, D, S> for different architectures
Mpar = I*O*D*S
Op I_set
S(architecture) = f(Op) * lt (Op)You should exploit this
amount of parallelism !!!
ASCI Winterschool 2010
Henk Corporaal(
25
)
Communication
Parallel Architecture extends traditional computer architecture with a communication network abstractions (HW/SW interface) organizational structure to realize abstraction
efficiently
Communication Network
Processingnode
Processingnode
Processingnode
Processingnode
Processingnode
ASCI Winterschool 2010
Henk Corporaal(
26
)
Communication models: Shared Memory
Coherence problem
Memory consistency issue
Synchronization problem
Process P1 Process P2
SharedMemory
(read, write)(read, write)
ASCI Winterschool 2010
Henk Corporaal(
27
)
Communication models: Shared memory
Shared address space
Communication primitives: load, store, atomic swap
Two varieties: Physically shared => Symmetric Multi-Processors (SMP)
usually combined with local caching
Physically distributed => Distributed Shared Memory (DSM)
ASCI Winterschool 2010
Henk Corporaal(
28
)
SMP: Symmetric Multi-Processor
Memory: centralized with uniform access time (UMA) and bus interconnect, I/O
Examples: Sun Enterprise 6000, SGI Challenge, Intel
Main memory I/O System
One ormore cache
levels
Processor
One ormore cache
levels
Processor
One ormore cache
levels
Processor
One ormore cache
levels
Processorcan be 1 bus, N busses, or any network
ASCI Winterschool 2010
Henk Corporaal(
29
)
DSM: Distributed Shared Memory
Nonuniform access time (NUMA) and scalable interconnect (distributed memory)
Interconnection NetworkInterconnection Network
Cache
Processor
Memory
Cache
Processor
Memory
Cache
Processor
Memory
Cache
Processor
Memory
Main memory I/O System
ASCI Winterschool 2010
Henk Corporaal(
30
)
Shared Address Model Summary
Each processor can name every physical location in the machine
Each process can name all data it shares with other processes
Data transfer via load and store
Data size: byte, word, ... or cache blocks
Memory hierarchy model applies: communication moves data to local proc. cache
ASCI Winterschool 2010
Henk Corporaal(
31
)
Three fundamental issues for shared memory multiprocessors
Coherence, about: Do I see the most recent data?
Consistency, about: When do I see a written value? e.g. do different processors see writes at the same time
(w.r.t. other memory accesses)?
SynchronizationHow to synchronize processes? how to protect access to shared data?
ASCI Winterschool 2010
Henk Corporaal(
32
)
Communication models: Message Passing
Communication primitives e.g., send, receive library calls standard MPI: Message Passing Interface
www.mpi-forum.org
Note that MP can be build on top of SM and vice versa!
Process P1 Process P2
receive
receive send
sendFiFO
ASCI Winterschool 2010
Henk Corporaal(
33
)
Message Passing Model
Explicit message send and receive operations
Send specifies local buffer + receiving process on remote computer
Receive specifies sending process on remote computer + local buffer to place data
Typically blocking communication, but may use DMA
Header Data Trailer
Message structure
ASCI Winterschool 2010
Henk Corporaal(
34
)
Message passing communication
Interconnection NetworkInterconnection Network
Networkinterface
Networkinterface
Networkinterface
Networkinterface
Cache
Processor
Memory
DMA
Cache
Processor
Memory
DMA
Cache
Processor
Memory
DMA
Cache
Processor
Memory
DMA
ASCI Winterschool 2010
Henk Corporaal(
35
)
Communication Models: Comparison
Shared-Memory: Compatibility with well-understood language mechanisms Ease of programming for complex or dynamic
communications patterns Shared-memory applications; sharing of large data
structures Efficient for small items Supports hardware caching
Messaging Passing: Simpler hardware Explicit communication Implicit synchronization (with any communication)
ASCI Winterschool 2010
Henk Corporaal(
36
)
Interconnect
How to connect your cores?
Some options: Connect everybody:
Single bus Hierarchical bus NoC
• multi-hop via routers• any topology possible• easy 2D layout helps
Connect with e.g. neighbors only e.g. using shift operation in SIMD or using dual-ported mems to connect 2 cores.
ASCI Winterschool 2010
Henk Corporaal(
37
)
Bus (shared) or Network (switched)
Network: claimed to be more scalable no bus arbitration point-to-point connections
but router overhead
node
R
node
R
node
R
node
R
node
R
node
R
node
R
node
R
Example:NoC with 2x4 meshrouting network
ASCI Winterschool 2010
Henk Corporaal(
38
)
Historical Perspective
Early machines were: Collection of microprocessors. Communication was performed using bi-directional queues
between nearest neighbors.
Messages were forwarded by processors on path “Store and forward” networking
There was a strong emphasis on topology in algorithms, in order to minimize the number of hops => minimize time
ASCI Winterschool 2010
Henk Corporaal(
39
)
Design Characteristics of a Network Topology (how things are connected):
Crossbar, ring, 2-D and 3-D meshes or torus, hypercube, tree, butterfly, perfect shuffle, ....
Routing algorithm (path used): Example in 2D torus: all east-west then all north-south (avoids
deadlock)
Switching strategy: Circuit switching: full path reserved for entire message, like the
telephone. Packet switching: message broken into separately-routed packets,
like the post office.
Flow control and buffering (what if there is congestion): Stall, store data temporarily in buffers re-route data to other nodes tell source node to temporarily halt, discard, etc.
QoS guarantees, Error handling, …., etc, etc.
ASCI Winterschool 2010
Henk Corporaal(
40
)
Switch / Network Topology
Topology determines: Degree: number of links from a node Diameter: max number of links crossed between nodes Average distance: number of links to random destination Bisection: minimum number of links that separate the
network into two halves Bisection bandwidth = link bandwidth * bisection
ASCI Winterschool 2010
Henk Corporaal(
41
)
Bisection Bandwidth Bisection bandwidth: bandwidth across smallest cut that divides
network into two equal halves
Bandwidth across “narrowest” part of the network
bisection cut
not a bisectioncut
bisection bw= link bw bisection bw = sqrt(n) * link bw
Bisection bandwidth is important for algorithms in which all processors need to communicate with all others
ASCI Winterschool 2010
Henk Corporaal(
42
)
Common Topologies
Type Degree Diameter Ave Dist Bisection
1D mesh 2 N-1 N/3 1
2D mesh 4 2(N1/2 - 1) 2N1/2 / 3 N1/2
3D mesh 6 3(N1/3 - 1) 3N1/3 / 3 N2/3
nD mesh 2n n(N1/n - 1) nN1/n / 3 N(n-1) / n
Ring 2 N/2 N/4 2
2D torus 4 N1/2 N1/2 / 2 2N1/2
Hypercube Log2N n=Log2N n/2 N/2
2D Tree 3 2Log2N ~2Log2 N 1
Crossbar N-1 1 1 N2/2
N = number of nodes, n = dimension
ASCI Winterschool 2010
Henk Corporaal(
43
)
Topologies in Real High End Machines
Red Storm (Opteron + Cray network, future)
3D Mesh
Blue Gene/L 3D Torus
SGI Altix Fat tree
Cray X1 4D Hypercube (approx)
Myricom (Millennium) Arbitrary
Quadrics (in HP Alpha server clusters)
Fat tree
IBM SP Fat tree (approx)
SGI Origin Hypercube
Intel Paragon 2D Mesh
BBN Butterfly Butterfly
old
er
n
ew
er
ASCI Winterschool 2010
Henk Corporaal(
44
)
Network: Performance metrics
Network Bandwidth Need high bandwidth in communication How does it scale with number of nodes?
Communication Latency Affects performance, since processor may have to wait Affects ease of programming, since it requires more thought to
overlap communication and computation
How can a mechanism help hide latency? overlap message send with computation, prefetch data, switch to other task or thread
ASCI Winterschool 2010
Henk Corporaal(
45
)
Examples of many core / PE architectures
SIMD Xetal (320 PEs), Imap (128 PEs), AnySP (Michigan Univ)
VLIW Itanium,TRIPS / EDGE, ADRES,
Multi-threaded idea: hide long latencies Denelcor HEP (1982), SUN Niagara (2005)
Multi-processor RaW, PicoChip, Intel/AMD, GRID, Farms, …..
Hybrid, like , Imagine, GPUs, XC-Core actually, most are hybrid !!
ASCI Winterschool 2010
Henk Corporaal(
46
)
IMAP from NEC
NEC IMAPSIMD•128 PEs•Supports indirect addressing
e.g. LD r1, (r2)•Each PE 5-issue VLIW
ASCI Winterschool 2010
Henk Corporaal(
47
)
TRIPS (Austin Univ / IBM)a statically mapped data flow architecture
R: register fileE: execution unitD: Data cacheI: Instruction cacheG: global control
ASCI Winterschool 2010
Henk Corporaal(
48
)
Compiling for TRIPS
1. Form hyperblocks (use unrolling, predication, inlining to enlarge scope)
2. Spatial map operations of each hyperblock registers are accessed at hyperblock boundaries
3. Schedule hyperblocks
ASCI Winterschool 2010
Henk Corporaal(
49
)
Multithreaded CategoriesTi
me
(pro
cess
or
cycle
)Superscalar Fine-Grained Coarse-Grained Multiprocessing
SimultaneousMultithreading
Thread 1
Thread 2Thread 3Thread 4
Thread 5Idle slot
Intel calls this 'Hyperthreading'
ASCI Winterschool 2010
Henk Corporaal(
50
)
SUN Niagara processing element
4 threads per processor 4 copies of PC logic, Instr. buffer, Store buffer, Register file
ASCI Winterschool 2010
Henk Corporaal(
51
)
Really BIG: Jaguar-Cray XT5-HE Oak Ridge Nat
Lab
224,256 AMD Opteron cores
2.33 PetaFloppeak perf.
299 Tbyte main memory
10 Petabyte disk
478GB/s mem bandwidth
6.9 MegaWatt
3D torus
TOP 500 #1(Nov 2009)
ASCI Winterschool 2010
Henk Corporaal(
52
)
Graphic Processing Units (GPUs)
NVIDIA GT 340(2010)
ATI 5970(2009)
ASCI Winterschool 2010
Henk Corporaal(
53
)
Why GPUs
ASCI Winterschool 2010
Henk Corporaal(
54
)
In Need of TeraFlops?
3 * GTX295• 1440 PEs• 5.3 TeraFlop
ASCI Winterschool 2010
Henk Corporaal(
55
)
How Do GPUs Spend Their Die Area?GPUs are designed to match the workload of 3D graphics.
J. Roca, et al. "Workload Characterization of 3D Games", IISWC 2006, linkT. Mitra, et al. "Dynamic 3D Graphics Workload Characterization and the Architectural Implications", Micro 1999, link
Die photo of GeForce GTX 280 (source: NVIDIA)
ASCI Winterschool 2010
Henk Corporaal(
56
)
How Do CPUs Spend Their Die Area?
CPUs are designed for low latency instead of high throughput
Die photo of Intel Penryn (source: Intel)
ASCI Winterschool 2010
Henk Corporaal(
57
)
GPU: Graphics Processing Unit
The Utah teapot: http://en.wikipedia.org/wiki/Utah_teapot
From polygon mesh to image pixel.
ASCI Winterschool 2010
Henk Corporaal(
58
)
The Graphics Pipeline
K. Fatahalian, et al. "GPUs: a Closer Look", ACM Queue 2008, http://doi.acm.org/10.1145/1365490.1365498
ASCI Winterschool 2010
Henk Corporaal(
59
)
The Graphics Pipeline
K. Fatahalian, et al. "GPUs: a Closer Look", ACM Queue 2008, http://doi.acm.org/10.1145/1365490.1365498
ASCI Winterschool 2010
Henk Corporaal(
60
)
The Graphics Pipeline
K. Fatahalian, et al. "GPUs: a Closer Look", ACM Queue 2008, http://doi.acm.org/10.1145/1365490.1365498
ASCI Winterschool 2010
Henk Corporaal(
61
)
The Graphics Pipeline
K. Fatahalian, et al. "GPUs: a Closer Look", ACM Queue 2008, http://doi.acm.org/10.1145/1365490.1365498
ASCI Winterschool 2010
Henk Corporaal(
62
)
GPUs: what's inside?Basically an SIMD:
• A single instruction stream operates on multiple data streams
• All PEs execute the same instruction at the same time
• PEs operate concurrently on their own piece of memory
• However, GPU far more complex !!
• Instruction Memory
Control Processor
PE1
PE2
PE3
PE4
...PE320
Interconnect
Data memory
PE5
PE6
Instr.
Addr.
Instr.
Status•
Add Add Add Add Add Add Add Add Add
ASCI Winterschool 2010
Henk Corporaal(
63
)
CPU Programming: NVIDIA CUDA example
Single thread program float A[4][8];do-all(i=0;i<4;i++){ do-all(j=0;j<8;j++){ A[i][j]++; }}
CUDA program
float A[4][8]; kernelF<<<(4,1),(8,1)>>>(A); __device__ kernelF(A){ i = blockIdx.x; j = threadIdx.x; A[i][j]++;}
• CUDA program expresses data level parallelism (DLP) in terms of thread level parallelism (TLP).
• Hardware converts TLP into DLP at run time.
ASCI Winterschool 2010
Henk Corporaal(
64
)
System Architecture
Erik Lindholm, et al. "NVIDIA Tesla: A Unified Graphics and Computing Architecture", IEEE Micro 2008, link
ASCI Winterschool 2010
Henk Corporaal(
65
)
NVIDIA Tesla Architecture (G80)
Erik Lindholm, et al. "NVIDIA Tesla: A Unified Graphics and Computing Architecture", IEEE Micro 2008, link
ASCI Winterschool 2010
Henk Corporaal(
66
)
Texture Processor Cluster (TPC)
ASCI Winterschool 2010
Henk Corporaal(
67
)
Deeply pipelined SM for high throughput
One instruction executed by a warp of 32 threads One warp is executed on 8 PEs over 4 shader cycles
Let's start with a simple example: execution of 1 instruction
ASCI Winterschool 2010
Henk Corporaal(
68
)
Issue an Instruction for 32 Threads
ASCI Winterschool 2010
Henk Corporaal(
69
)
Read Source Operands of 32 Threads
ASCI Winterschool 2010
Henk Corporaal(
70
)
Buffer Source Operands to Op Collector
ASCI Winterschool 2010
Henk Corporaal(
71
)
Execute Threads 0~7
ASCI Winterschool 2010
Henk Corporaal(
72
)
Execute Threads 8~15
ASCI Winterschool 2010
Henk Corporaal(
73
)
Execute Threads 16~23
ASCI Winterschool 2010
Henk Corporaal(
74
)
Execute Threads 24~31
ASCI Winterschool 2010
Henk Corporaal(
75
)
Write Back from Result Queue to Reg
ASCI Winterschool 2010
Henk Corporaal(
76
)
Warp: Basic Scheduling Unit in Hardware
One warp consists of 32 consecutive threads Warps are transparent to programmer, formed at run
time
ASCI Winterschool 2010
Henk Corporaal(
77
)
Warp Scheduling • Schedule at most
24 warps in an interleaved manner
• Zero overhead for interleaved issue of warps
ASCI Winterschool 2010
Henk Corporaal(
78
)
Handling BranchThreads within a warp are free to branch.
if( $r17 > $r19 ){ $r16 = $r20 + $r31 }else{ $r16 = $r21 - $r32 }$r18 = $r15 + $r16
Assembly code on the right are disassembled from cuda binary (cubin) using "decuda", link
ASCI Winterschool 2010
Henk Corporaal(
79
)
Branch Divergence within a Warp
If threads within a warp diverge, both paths have to be executed.
Masks are set to filter out threads not executing on current path.
ASCI Winterschool 2010
Henk Corporaal(
80
)
CPU Programming: NVIDIA CUDA example
Single thread program float A[4][8];do-all(i=0;i<4;i++){ do-all(j=0;j<8;j++){ A[i][j]++; }}
CUDA program
float A[4][8]; kernelF<<<(4,1),(8,1)>>>(A); __device__ kernelF(A){ i = blockIdx.x; j = threadIdx.x; A[i][j]++;}
• CUDA program expresses data level parallelism (DLP) in terms of thread level parallelism (TLP).
• Hardware converts TLP into DLP at run time.
ASCI Winterschool 2010
Henk Corporaal(
81
)
CUDA Programming
kernelF<<<(2,2),(4,2)>>>(A); __device__ kernelF(A){ i = blockDim.x * blockIdx.y + blockIdx.x; j = threadDim.x * threadIdx.y + threadIdx.x; A[i][j]++;}
Both grid and thread block can have two dimensional index.
ASCI Winterschool 2010
Henk Corporaal(
82
)
Mapping Thread Blocks to SMs One thread block can only run on one SM Thread block can not migrate from one SM to another SM Threads of the same thread block can share data using shared
memory
Example: mapping 12 thread blocks on 4 SMs.
ASCI Winterschool 2010
Henk Corporaal(
83
)
Mapping Thread Blocks (0,0)/(0,1)/(0,2)/(0,3)
ASCI Winterschool 2010
Henk Corporaal(
84
)
CUDA Compilation Trajectorycudafe: CUDA front endnvopencc: customized open64 compiler for CUDAptx: high level assemble code (documented)ptxas: ptx assemblercubin: CUDA binrary
decuda, http://wiki.github.com/laanwj/decuda
ASCI Winterschool 2010
Henk Corporaal(
85
)
Optimization Guide Optimizations on memory latency tolerance
Reduce register pressure Reduce shared memory pressure
Optimizations on memory bandwidth Global memory coalesce Shared memory bank conflicts Grouping byte access Avoid Partition camping
Optimizations on computation efficiency Mul/Add balancing Increase floating point proportion
Optimizations on operational intensity Use tiled algorithm Tuning thread granularity
ASCI Winterschool 2010
Henk Corporaal(
86
)
Global Memory: Coalesced Access
NVIDIA, "CUDA Programming Guide", link
perfectly coalesced allow threads skipping LD/ST
ASCI Winterschool 2010
Henk Corporaal(
87
)
Global Memory: Non-Coalesced Access
NVIDIA, "CUDA Programming Guide", link
non-consecutiveaddress
starting address not aligned to 128 Byte
non-consecutiveaddress
stride larger than one word
ASCI Winterschool 2010
Henk Corporaal(
88
)
Shared Memory: without Bank Conflict
NVIDIA, "CUDA Programming Guide", link
one access per bank one access per bank with shuffling
access the same address (broadcast)
partial broadcast and skipping some banks
ASCI Winterschool 2010
Henk Corporaal(
89
)
Shared Memory: with Bank Conflict
NVIDIA, "CUDA Programming Guide", link
access more than one address per bank
broadcast more than one address per bank
ASCI Winterschool 2010
Henk Corporaal(
90
)
Optimizing MatrixMul
Matrix Multiplication example from the 5kk70 course in TU/e, link.The CUDA@MIT course also provides Matrix Multiplication as a hands-on example, link.
ASCI Winterschool 2010
Henk Corporaal(
91
)
ATI Cypress (RV870)• 1600 shader ALUs
ref: tom's hardware, link
ASCI Winterschool 2010
Henk Corporaal(
92
)
ATI Cypress (RV870)• VLIW PEs
ref: tom's hardware, link
ASCI Winterschool 2010
Henk Corporaal(
93
)
Intel Larrabee• x86 core, 8/16/32 cores.
Larry Seiler, et al. "Larrabee: a many-core x86 architecture for visual computing", SIGGRAPH 2008, link
ASCI Winterschool 2010
Henk Corporaal(
94
)
CELL
PS3
NVIDIARSXreality
synthesizer
NVIDIARSXreality
synthesizer
CellBroadband
Engine3.2 GHz
CellBroadband
Engine3.2 GHz
South BridgeSouth Bridge
GDDR3
GDDR3
GDDR3
GDDR3
XDR DRAM
XDR DRAM
XDR DRAM
XDR DRAM
drivesUSB
NetworkMedia
Video Memory
2.5 GB/sec 2.5 GB/sec
15 GB/sec 20 GB/sec
128pin * 1.4Gbps/pin = 22.4GB/sec
64pin * 3.2Gbps/pin = 25.6GB/sec
Main Memory
ASCI Winterschool 2010
Henk Corporaal(
95
)
CELL – the architecture
1 x PPE 64-bit PowerPCL1: 32 KB I$ + 32 KB D$L2: 512 KB
8 x SPE cores:Local store: 256 KB 128 x 128 bit vector registers
Hybrid memory model: PPE: Rd/Wr SPEs: Asynchronous DMA
EIB: 205 GB/s sustained aggregate bandwidth
Processor-to-memory bandwidth: 25.6 GB/s
Processor-to-processor: 20 GB/s in each direction
ASCI Winterschool 2010
Henk Corporaal(
96
)
ASCI Winterschool 2010
Henk Corporaal(
97
)
Intel / AMD x86 – Historical overview
ASCI Winterschool 2010
Henk Corporaal(
98
)
Nehalem architecture
In novel processors Core i7 & Xeon 5500s
Quad Core
3 cache levels
2 TLB levels
2 branch predictors
Out-of-Order execution
Simultaneous Multithreading
DVFS: dynamic voltage & frequency scaling
1 core
ASCI Winterschool 2010
Henk Corporaal(
99
)
Nehalem pipeline (1/2)
Instruction Fetch and PreDecode
Instruction Queue
Decode
Rename/Alloc
Retirement unit(Re-Order Buffer)
Scheduler
EXE Unit Cluster 0
EXE Unit Cluster 1
EXE Unit Cluster 2
Load Store
L1D Cache and DTLB
L2 Cache
Inclusive L3 Cache by all cores
Micro-code ROM
QPI
Quick Path Interconnect (2x20 bit)
ASCI Winterschool 2010
Henk Corporaal(
100
)
Nehalem pipeline (2/2)
ASCI Winterschool 2010
Henk Corporaal(
101
)
Tylersburg: connecting 2 quad cores
Level Capacity Associativity(ways)
Line size(bytes)
Access Latency(clocks)
Access Throughput(clocks)
Write UpdatePolicy
L1D 4 x 32 KiB 8 64 4 1 Writeback
L1I 4 x 32 KiB 4 N/A N/A N/A N/A
L2U 4 x 256KiB 8 64 10 Varies Writeback
L3U 1 x 8 MiB 16 64 35-40 Varies Writeback
Core
L1D L1I
L2U
Core
L1D L1I
L2U
Core
L1D L1I
L2U
Core
L1D L1I
L2U
L3U
Memory controller QPI QPI
Core
L1D L1I
L2U
Core
L1D L1I
L2U
Core
L1D L1I
L2U
Core
L1D L1I
L2U
L3U
Memory controllerQPI QPIQPI
Main memoryD
DR
3Main memory
DD
R3
IOH
ASCI Winterschool 2010
Henk Corporaal(
102
)
Programming these arechitectures: N-tap FIR
int i, j;
for (i = 0; i < M; i ++){
out[i] = 0;
for (j = 0; j < N; j ++)
out[i] +=n[i+j]*coeff[j];
}
1
0
][*][][N
j
jcoeffjiiniout
C-code:
ASCI Winterschool 2010
Henk Corporaal(
103
)
Y0 Y1 Y2 Y3 Y4 Y5 Y6 Y7 Y8 Y9 Y10 Y11
X0 X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11
C0 C1 C2 C3
x x x x
C0 C1 C2 C3
x x x x
C0 C1 C2 C3
x x x x
C0 C1 C2 C3
x x x x
+
+
+
+
ASCI Winterschool 2010
Henk Corporaal(
104
)
__m128 X, XH, XL, Y, C, H; int i, j;for(i = 0; i < (M/4); i ++){ XL = _mm_load_ps(&in[i*4]); Y = _mm_setzero_ps(); for(j = 0; j < (N/4); j ++){ XH = XL; XL = _mm_load_ps(&in[(i+j+1)*4]); C =_mm_load_ps(&coeff[j*4]); H =_mm_shuffle_ps (C, C, _MM_SHUFFLE(0,0,0,0)); X = _mm_mul_ps (XH, H); Y = _mm_add_ps (Y, X);
H =_mm_shuffle_ps (C, C, _MM_SHUFFLE(1,1,1,1)); X = _mm_alignr_epi8 (XL, XH, 4); X = _mm_mul_ps (X, H); Y = _mm_add_ps (Y, X);
H = _mm_shuffle_ps (C, C, _MM_SHUFFLE(2,2,2,2)); X = _mm_alignr_epi8 (XL, XH, 8); X = _mm_mul_ps (X, H); Y = _mm_add_ps (Y, X);
H = _mm_shuffle_ps (C, C, _MM_SHUFFLE(3,3,3,3)); X = _mm_alignr_epi8 (XL, XH, 12); X = _mm_mul_ps (X, H); Y = _mm_add_ps (Y, X); } _mm_store_ps(&out[i*4], Y);}
FIR with x86 SSE Intrinsics
Y0
Y1
Y2
Y3
=
X0
X1
X2
X3
C0
C0
C0
C0
x
X1
X2
X3
X4
C1
C1
C1
C1
x
X2
X3
X4
X5
C2
C2
C2
C2
x
X3
X4
X5
X6
C3
C3
C3
C3
x+ + +
Y H H H HX X X X
ASCI Winterschool 2010
Henk Corporaal(
105
)
FIR using pthread
pthread_t fir_threads[N_THREAD];
fir_arg fa[N_THREAD];
tsize = M/N_THREAD;
for(i = 0; i < N_THREAD; i ++){
/*… Initialize thread
parameters fa[i] … */
rc = pthread_create(&fir_threads[i],\
NULL, fir_kernel, (void *)&fa[i]);
}
for(i=0; i<N_THREAD; i++) {
rc = pthread_join(fir_threads[i],\
&status);
}
split
T0 T1 T2 T3
join
Input
Sequential FIR kernel or
Vectorized FIR kernel
ASCI Winterschool 2010
Henk Corporaal(
106
)
x86 FIR speedup
On Intel Core 2 Quad Q8300, gcc optimization level 2 Input: ~5M samples #threads in pthread: 4
ASCI Winterschool 2010
Henk Corporaal(
107
)
FIR kernel on CELL SPE
Vectorization is similar to SSE
vector float,X, XH, XL, Y, H; int i, j; for(i = 0; i < (M/4); i ++){ XL = in[i]; Y = spu_splats(0.0f); for(j = 0; j < (N/4); j ++){ XH = XL; XL = in[i+j+1]); H=splats(coeff[j*4]); Y = spu_madd(XH, H, Y);
H=splats(coeff[j*4+1]); X = spu_shuffle(XH, XL, SHUFFLE_X1); Y = spu_madd(X, H, Y);
H=splats(coeff[j*4+2]); X = spu_shuffle(XH, XL, SHUFFLE_X2); Y = spu_madd(X, H, Y);
H=splats(coeff[j*4+3]); X = spu_shuffle(XH, XL, SHUFFLE_X3); Y = spu_madd(X, H, Y); } out[i] = Y;}
ASCI Winterschool 2010
Henk Corporaal(
108
)
SPE DMA double buffering
...
Use iBuf0Write to oBuf0
Use iBuf1Write to oBuf1
Use iBuf0Write to oBuf0
Get iBuf0
Get iBuf1
timeGet
iBuf0
Get iBuf1
Put oBuf0
Put oBuf1
float iBuf[2][BUF_SIZE];float oBuf[2][BUF_SIZE];int idx=0; int buffers=size/BUF_SIZE;mfc_get(iBuf[idx],argp,\ BUF_SIZE*sizeof(float),\ tag[idx],0,0);for(int i = 1;I < buffers; i++){ wait_for_dma(tag[idx]); next_idx = idx^1; mfc_get(iBuf[next_idx],argp,\ BUF_SIZE*sizeof(float),0,0,0); fir_kernel(oBuf[idx], iBuf[idx],\ coeff,BUF_SIZE,taps); mfc_put(oBuf[idx],outbuf,\ BUF_SIZE*sizeof(float),\ tag[idx],0,0); idx = next_idx;}/* Finish up the last block ...*/
ASCI Winterschool 2010
Henk Corporaal(
109
)
CELL FIR speedup
On PlayStation 3, CELL with six accessible SPE
Input: ~6M samples
Speed-up compare to scalar implementation on PPE
ASCI Winterschool 2010
Henk Corporaal(
110
)
Roofline Model
Introduced by Samual Williams and David PattersonP
erfo
rman
ce in
GF
lops
/sec
Operational intensity in Flops/Byte
peak performance
peak
band
width
ridge pointbalanced architecture for given application
ASCI Winterschool 2010
Henk Corporaal(
111
)
Roofline Model of GT8800 GPU
ASCI Winterschool 2010
Henk Corporaal(
112
)
Roofline Model
Threads of one warp diverge into different paths at branch.
ASCI Winterschool 2010
Henk Corporaal(
113
)
Roofline Model In G80 architecture, a non-coalesced global memory access will
be separated into 16 accesses.
ASCI Winterschool 2010
Henk Corporaal(
114
)
Roofline Model
Previous examples assume memory latency can be hidden. Otherwise the program can be latency-bound.
Z. Guz, et al, "Many-Core vs. Many-Thread Machines: Stay Away From the Valley", IEEE Comp Arch Letters, 2009, linkS. Hong, et al. "An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness", ISCA09, link
rm : percentage of memory instruction in total instructiontavg : average memory latencyCPIexe : Cycle per Instruction
• There is one memory instruction in every (1/rm) instructions.• There is one memory instruction every (1/rm) x CPIexe cycles.• It takes (tavg x rm / CPIexe) threads to hide memory latency.
ASCI Winterschool 2010
Henk Corporaal(
115
)
Roofline Model
If not enough threads to hide the memory latency, the memory latency could become the bottleneck.
Samuel Williams, "Auto-tuning Performance on Multicore Computers", PhD Thesis, UC Berkeley, 2008, linkS. Hong, et al. "An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness", ISCA09, link
ASCI Winterschool 2010
Henk Corporaal(
116
)
Four Architectures
667MHz DDR2 DIMMs667MHz DDR2 DIMMs
10.66 GB/s
2x64b memory controllers2x64b memory controllers
Hyp
erT
ran
spor
tH
yper
Tra
nsp
ortOpteronOpteron OpteronOpteron OpteronOpteron OpteronOpteron
667MHz DDR2 DIMMs667MHz DDR2 DIMMs
10.66 GB/s
2x64b memory controllers2x64b memory controllers
OpteronOpteron OpteronOpteron OpteronOpteron OpteronOpteron
512KBvictim
512KBvictim
512KBvictim
512KBvictim
512KBvictim
512KBvictim
512KBvictim
512KBvictim
512KBvictim
512KBvictim
512KBvictim
512KBvictim
512KBvictim
512KBvictim
512KBvictim
512KBvictim
2MB Shared quasi-victim (32 way)2MB Shared quasi-victim (32 way)
SRI / crossbarSRI / crossbar
2MB Shared quasi-victim (32 way)2MB Shared quasi-victim (32 way)
SRI / crossbarSRI / crossbarHyp
erT
ran
spor
tH
yper
Tra
nsp
ort
4G
B/s
(eac
h di
rect
ion
)
667MHz FBDIMMs 667MHz FBDIMMs
21.33 GB/s 10.66 GB/s
4MB Shared L2 (16 way)(64b interleaved)
4MB Shared L2 (16 way)(64b interleaved)
4 Coherency Hubs4 Coherency Hubs
2x128b controllers2x128b controllers
MT
SP
AR
CM
T S
PA
RC
MT
SP
AR
CM
T S
PA
RC
MT
SP
AR
CM
T S
PA
RC
MT
SP
AR
CM
T S
PA
RC
MT
SP
AR
CM
T S
PA
RC
MT
SP
AR
CM
T S
PA
RC
MT
SP
AR
CM
T S
PA
RC
MT
SP
AR
CM
T S
PA
RC
CrossbarCrossbar
179 GB/s 90 GB/s
667MHz FBDIMMs 667MHz FBDIMMs
21.33 GB/s 10.66 GB/s
4MB Shared L2 (16 way)(64b interleaved)
4MB Shared L2 (16 way)(64b interleaved)
4 Coherency Hubs4 Coherency Hubs
2x128b controllers2x128b controllers
MT
SP
AR
CM
T S
PA
RC
MT
SP
AR
CM
T S
PA
RC
MT
SP
AR
CM
T S
PA
RC
MT
SP
AR
CM
T S
PA
RC
MT
SP
AR
CM
T S
PA
RC
MT
SP
AR
CM
T S
PA
RC
MT
SP
AR
CM
T S
PA
RC
MT
SP
AR
CM
T S
PA
RC
CrossbarCrossbar
179 GB/s 90 GB/s
8 x
6.4
GB
/s(1
pe
r h
ub p
er
dire
ctio
n)
inte
rcon
nect
inte
rcon
nect
86.4 GB/s
768MB 900MHz GDDR3 Device DRAM768MB 900MHz GDDR3 Device DRAM
ThreadCluster
ThreadCluster
ThreadCluster
ThreadCluster
ThreadCluster
ThreadCluster
ThreadCluster
ThreadCluster
192KB L2 (Textures only)192KB L2 (Textures only)
24 ROPs24 ROPs
6 x 64b memory controllers6 x 64b memory controllers
BIFBIF
512MB XDR DRAM512MB XDR DRAM
25.6 GB/s
EIB (ring network)EIB (ring network)
XDR memory controllersXDR memory controllers
VMTPPEVMTPPE
512KL2
512KL2
SP
ES
PE
256
K2
56K
MF
CM
FC
SP
ES
PE
256
K2
56K
MF
CM
FC
SP
ES
PE
256
K2
56K
MF
CM
FC
SP
ES
PE
256
K2
56K
MF
CM
FC
SP
ES
PE
256
K2
56K
MF
CM
FC
SP
ES
PE
256
K2
56K
MF
CM
FC
SP
ES
PE
256
K2
56K
MF
CM
FC
SP
ES
PE
256
K2
56K
MF
CM
FC B
IFBIF
512MB XDR DRAM512MB XDR DRAM
25.6 GB/s
EIB (ring network)EIB (ring network)
XDR memory controllersXDR memory controllers
SP
ES
PE
256
K2
56K
MF
CM
FC
SP
ES
PE
256
K2
56K
MF
CM
FC
SP
ES
PE
256
K2
56K
MF
CM
FC
SP
ES
PE
256
K2
56K
MF
CM
FC
SP
ES
PE
256
K2
56K
MF
CM
FC
SP
ES
PE
256
K2
56K
MF
CM
FC
SP
ES
PE
256
K2
56K
MF
CM
FC
SP
ES
PE
256
K2
56K
MF
CM
FC
VMTPPEVMTPPE
512KL2
512KL2
<20
GB
/s(e
ach
dire
ctio
n)
Sun Victoria FallsAMD Barcelona
NVIDIA G80IBM Cell Blade
ASCI Winterschool 2010
Henk Corporaal(
117
)
32b Rooflines for the Four(in-core parallelism)
1/8
flop:DRAM byte ratio
att
ain
ab
le G
flop
/s (
32
b)
1/41/2 1 2 4 8 16
1/8
flop:DRAM byte ratio
att
ain
ab
le G
flop
/s (
32
b)
1/41/2 1 2 4 8 16 1/8
flop:DRAM byte ratio
att
ain
ab
le G
flop
/s (
32
b)
1/41/2 1 2 4 8 16
Sun Victoria FallsAMD Barcelona
NVIDIA G80
Single Precision Roofline models for the SMPs used in this work.
Based on micro-benchmarks, experience, and manuals
Ceilings =
in-core parallelism
Can the compiler find all this parallelism ?
NOTE: log-log scale Assumes perfect
SPMD
41/8
flop:DRAM byte ratio
att
ain
ab
le G
flop
/s (
32
b)
8
16
32
64
128
256
1/41/2 1 2 4 8 16
IBM Cell Blade
4
8
16
32
64
128
256
4
8
16
32
64
128
256
4
8
16
32
64
128
256
512
512
512
512
peak SP
mul / add imbalance
w/out SIMD
w/out ILP
peak SP
peak SP
w/out FMA
w/out SIMD
w/out ILP
peak SP
w/out FMA
w/out
mem
ory c
oales
cing
w/out
NUM
A
w/out
SW
pre
fetc
h
w/out
NUM
A
w/out
SW
pre
fetc
h
w/out
NUM
A
w/out
DM
A con
curre
ncy
ASCI Winterschool 2010
Henk Corporaal(
118
)
Let's conclude: Trends
Reliability + Fault Tolerance Requires run-time management, process migration
Power is the new metric Low power management at all levels - Scenarios - subthreshold,
back biasing, ….
Virtualization (1): do not disturb other applications composability
Virtualization (2): 1 virual target platform avoids porting problem 1 intermediate supporting multiple target
huge RT management support, JITC multiple OS
Compute servers
Transactional memory
3D: integrate different dies
ASCI Winterschool 2010
Henk Corporaal(
119
)
3D using Through Silicon Vias (TSV)
Using TVS:Face-to-Back
(Scalable)
Flip-Chip:Face-to-Face
(limited to 2 die tiers)
4um pitch in 2011 (ITRS 2007)
Can enlarge device area
from Woo e.a. HPCA 2009
ASCI Winterschool 2010
Henk Corporaal(
120
)
Don't forget Amdahl
However, see next slide!
ASCI Winterschool 2010
Henk Corporaal(
121
)
Trends: Homogeneous vs Heterogeneous: where do we go ?
Homogenous: Easier to program Favored by DLP / Vector parallelism Fault tolerant / Task migration
Heterogeneous Energy efficiency demands Higher speedup
Amdahl++(see Hill and Marty, HPCA'08 on Amdahl's law in multi-core area)
Memory dominated suggests homogenous sea of heterogeneous cores
Sea of reconfigurable compute or processor blocks? many examples: Smart Memory, SmartCell, PicoChip, MathStar
FPOA, Stretch, XPP, ……. etc.
ASCI Winterschool 2010
Henk Corporaal(
122
)
How does a future architecture look like
A couple of high performance (low latency) cores also sequential code should run fast
Add a whole battery of wide vector processors
Some shared memory (to reduce copying large data structures) Level 2 and 3 in 3D technology Huge bandwidth; exploit large vectors
Accelerators for dedicated domains
OS support (runtime mapping, DVFS, use of accelerators)
ASCI Winterschool 2010
Henk Corporaal(
123
)
But the real problem is …..
Programming parallel is the real bottleneck new programming models like transaction based progr.
That's what we will talk about this week…
ASCI Winterschool 2010
Henk Corporaal(
124
)