![Page 1: ECE8833 Polymorphous and Many-Core Computer Architecture](https://reader030.vdocuments.us/reader030/viewer/2022032708/56812ddb550346895d932ad5/html5/thumbnails/1.jpg)
ECE8833 Polymorphous and Many-Core Computer Architecture
Prof. Hsien-Hsin S. LeeSchool of Electrical and Computer Engineering
Lecture 1 Early ILP Processors and Performance Bound Model
![Page 2: ECE8833 Polymorphous and Many-Core Computer Architecture](https://reader030.vdocuments.us/reader030/viewer/2022032708/56812ddb550346895d932ad5/html5/thumbnails/2.jpg)
2ECE8833 H.-H. S. Lee 2009
Decoupled Access/Execute Computer Architectures
James E. Smith, ACM TOCS, 1984
(a earlier version was published in ISCA 1982)
![Page 3: ECE8833 Polymorphous and Many-Core Computer Architecture](https://reader030.vdocuments.us/reader030/viewer/2022032708/56812ddb550346895d932ad5/html5/thumbnails/3.jpg)
3ECE8833 H.-H. S. Lee 2009
Background of DAE, circa. 1982• Written at a time when vector machine was dominating
LV v1, mem[a1]MULV v3, v2, v1ADDV v5, v4, v3
MULV v3, v2, v1
LV v1, mem[a1]
ADDV v5, v4, v3
Time line
Vector chaining(Cray-1)
MULV v3, v2, v1
LV v1, mem[a1]
ADDV v5, v4, v3
64-bit register
0 63
4096-bit
![Page 4: ECE8833 Polymorphous and Many-Core Computer Architecture](https://reader030.vdocuments.us/reader030/viewer/2022032708/56812ddb550346895d932ad5/html5/thumbnails/4.jpg)
4ECE8833 H.-H. S. Lee 2009
Background of DAE, circa. 1982• Written at a time when vector machine was dominating
LV v1, mem[a1]MULV v3, v2, v1ADDV v5, v4, v3
v1
v3
Memory
MUL
v2
v4
ADDv5
What about modern
SIMD ISA ?
![Page 5: ECE8833 Polymorphous and Many-Core Computer Architecture](https://reader030.vdocuments.us/reader030/viewer/2022032708/56812ddb550346895d932ad5/html5/thumbnails/5.jpg)
5ECE8833 H.-H. S. Lee 2009
Today State-of-the-art ?• Intel AVX
• Intel Larrabee NI
![Page 6: ECE8833 Polymorphous and Many-Core Computer Architecture](https://reader030.vdocuments.us/reader030/viewer/2022032708/56812ddb550346895d932ad5/html5/thumbnails/6.jpg)
6ECE8833 H.-H. S. Lee 2009
DAE, circa. 1982• Fine-grained parallelism: Vector vs. Superscalar
• What about scalar performance?– Remember what’s Flynn’s bottleneck?
Page 290
![Page 7: ECE8833 Polymorphous and Many-Core Computer Architecture](https://reader030.vdocuments.us/reader030/viewer/2022032708/56812ddb550346895d932ad5/html5/thumbnails/7.jpg)
7ECE8833 H.-H. S. Lee 2009
Flynn’s Bottleneck• ILP 1.86
– Programs on IBM 7090– Basically, he sort of said one cannot
execute more than one instruction per cycle– ILP exploited within basic blocks
• [Riseman & Foster’72][Riseman & Foster’72]– Breaking control dependency– A perfect machine model– Benchmark includes numerical programs,
assembler and compiler
passed jumps 0 jump
1 jump
2 jumps
8 jumps
32 jumps
128 jumps
jumps
Average ILP 1.72 2.72 3.62 7.21 14.8 24.2 51.2
BB0
BB1
BB3
BB2
BB4
![Page 8: ECE8833 Polymorphous and Many-Core Computer Architecture](https://reader030.vdocuments.us/reader030/viewer/2022032708/56812ddb550346895d932ad5/html5/thumbnails/8.jpg)
8ECE8833 H.-H. S. Lee 2009
DAE, circa. 1982, 1984• Issues in CDC6600 & IBM 360/91
– Overlap instructions by OoO complex control slower clock offset the benefit
– Complex issue methods were abandoned by their manufacturers
• Less determinism• Problems in HW debugging• Errors may not be reproducible
– Complexity can be shifted to system software
![Page 9: ECE8833 Polymorphous and Many-Core Computer Architecture](https://reader030.vdocuments.us/reader030/viewer/2022032708/56812ddb550346895d932ad5/html5/thumbnails/9.jpg)
9ECE8833 H.-H. S. Lee 2009
Decoupled Access/Execute Architecture• An architecture with two instruction streams to
break Flynn’s bottleneck– Access processor– eXecute processor
– Hey, this was 1980s
• Separate RFs (A0, A1, A2 .. , An-1 & X0, X1, X2 .. ,Xm-1), which can be totally incompatible – Synchronization issue?
![Page 10: ECE8833 Polymorphous and Many-Core Computer Architecture](https://reader030.vdocuments.us/reader030/viewer/2022032708/56812ddb550346895d932ad5/html5/thumbnails/10.jpg)
10ECE8833 H.-H. S. Lee 2009
DAE
![Page 11: ECE8833 Polymorphous and Many-Core Computer Architecture](https://reader030.vdocuments.us/reader030/viewer/2022032708/56812ddb550346895d932ad5/html5/thumbnails/11.jpg)
11ECE8833 H.-H. S. Lee 2009
Data Movement
Data In
Data Out
paired
XLQ, XSQ, are specified as registers
at the ISA level
![Page 12: ECE8833 Polymorphous and Many-Core Computer Architecture](https://reader030.vdocuments.us/reader030/viewer/2022032708/56812ddb550346895d932ad5/html5/thumbnails/12.jpg)
12ECE8833 H.-H. S. Lee 2009
Register-to-Register Synch
Xi Aj
![Page 13: ECE8833 Polymorphous and Many-Core Computer Architecture](https://reader030.vdocuments.us/reader030/viewer/2022032708/56812ddb550346895d932ad5/html5/thumbnails/13.jpg)
13ECE8833 H.-H. S. Lee 2009
Branch Synch-up
• One Runhead• One execute uncond.
Jump (BFQ instruction)
Branch outcomes in XBQ can be used to reduce I-fetch from X-Processor.
![Page 14: ECE8833 Polymorphous and Many-Core Computer Architecture](https://reader030.vdocuments.us/reader030/viewer/2022032708/56812ddb550346895d932ad5/html5/thumbnails/14.jpg)
14ECE8833 H.-H. S. Lee 2009
DAE Code Example
![Page 15: ECE8833 Polymorphous and Many-Core Computer Architecture](https://reader030.vdocuments.us/reader030/viewer/2022032708/56812ddb550346895d932ad5/html5/thumbnails/15.jpg)
15ECE8833 H.-H. S. Lee 2009
Modern Issue Consideration• Despite it is a ‘82/’84 paper, it considers
![Page 16: ECE8833 Polymorphous and Many-Core Computer Architecture](https://reader030.vdocuments.us/reader030/viewer/2022032708/56812ddb550346895d932ad5/html5/thumbnails/16.jpg)
16ECE8833 H.-H. S. Lee 2009
Precise Exception• Simple approach force the instructions to complete in order• In DAE, applied to each of the streams separately
• Example of Imprecise exception issues• Require cautiousness when coding A and E programs
![Page 17: ECE8833 Polymorphous and Many-Core Computer Architecture](https://reader030.vdocuments.us/reader030/viewer/2022032708/56812ddb550346895d932ad5/html5/thumbnails/17.jpg)
17ECE8833 H.-H. S. Lee 2009
Requirement for Precise Exception
![Page 18: ECE8833 Polymorphous and Many-Core Computer Architecture](https://reader030.vdocuments.us/reader030/viewer/2022032708/56812ddb550346895d932ad5/html5/thumbnails/18.jpg)
18ECE8833 H.-H. S. Lee 2009
Why (and How) It Works?• Avg. speedup = 1.58 for LFK• Executions between 2
processors are somewhat balanced
• Why?– Work nicely as shown in LFK– X-processor’s computation is not as
fast• 6-cycle FP add• 7-cycle FP multiply
– A-process takes care of • Memory (11-cycle load)• Branch resolution
![Page 19: ECE8833 Polymorphous and Many-Core Computer Architecture](https://reader030.vdocuments.us/reader030/viewer/2022032708/56812ddb550346895d932ad5/html5/thumbnails/19.jpg)
19ECE8833 H.-H. S. Lee 2009
Disadvantages of DAE Architecture
1. Writing 2 separate programs• What High-level language ?• Who should do it?
2. Certain duplication in Hardware• Instruction memory/cache• Instruction fetch unit• Decoder
![Page 20: ECE8833 Polymorphous and Many-Core Computer Architecture](https://reader030.vdocuments.us/reader030/viewer/2022032708/56812ddb550346895d932ad5/html5/thumbnails/20.jpg)
20ECE8833 H.-H. S. Lee 2009
Interleaving Instruction Streams
• Use a bit to tag streams• No split branch instruction
(1) X7 is XLQ or XSQ; (2) Once loaded, it is used once.(3) It must be stored after X-processor writes to it
(A)X
![Page 21: ECE8833 Polymorphous and Many-Core Computer Architecture](https://reader030.vdocuments.us/reader030/viewer/2022032708/56812ddb550346895d932ad5/html5/thumbnails/21.jpg)
21ECE8833 H.-H. S. Lee 2009
Summary of DAE Architecture• 2-wide issue per cycle
• Allow a constrained type of OoO – Data accesses could be done well in advance
(i.e., “slip” ahead)– Enable certain level of data prefetching
• Was novel in 1982!
![Page 22: ECE8833 Polymorphous and Many-Core Computer Architecture](https://reader030.vdocuments.us/reader030/viewer/2022032708/56812ddb550346895d932ad5/html5/thumbnails/22.jpg)
22ECE8833 H.-H. S. Lee 2009
The ZS-1 Central Processor
James E. Smith, et al. in ASPLOS-II, 1987
![Page 23: ECE8833 Polymorphous and Many-Core Computer Architecture](https://reader030.vdocuments.us/reader030/viewer/2022032708/56812ddb550346895d932ad5/html5/thumbnails/23.jpg)
23ECE8833 H.-H. S. Lee 2009
Astronautics ZS-1 ZS-1 Central Processor• A realization of DAE (by the same author)
• Decouple instruction stream into– Fixed point/memory – Floating-point operations
• Communicate via Architectural queues
• Is extensively pipelined
• 22.5 MFLOPS, 45 MIPS
![Page 24: ECE8833 Polymorphous and Many-Core Computer Architecture](https://reader030.vdocuments.us/reader030/viewer/2022032708/56812ddb550346895d932ad5/html5/thumbnails/24.jpg)
24ECE8833 H.-H. S. Lee 2009
ZS-1 Central Processor
Communicate with memory
31 A (and X) registers + 1 Queue entry= 5-bit encoded operands
Hold 24 insts
Hold 4 insts
![Page 25: ECE8833 Polymorphous and Many-Core Computer Architecture](https://reader030.vdocuments.us/reader030/viewer/2022032708/56812ddb550346895d932ad5/html5/thumbnails/25.jpg)
25ECE8833 H.-H. S. Lee 2009
ZS-1 Central Processor+ Instruction cannot be issued unless the dependency is resolved.
+ A load may bypass independent stores
+ Maintain load-load, store-store order
![Page 26: ECE8833 Polymorphous and Many-Core Computer Architecture](https://reader030.vdocuments.us/reader030/viewer/2022032708/56812ddb550346895d932ad5/html5/thumbnails/26.jpg)
26ECE8833 H.-H. S. Lee 2009
Can Load Bypass Load?• Why not?
Load R1, (A)Load R2, (A)
Core 1
Store (A), R3
Core 2
(A)=100 R3=25
(1)(2)
(3)
• What’s wrong with (2)(3)(1)?
![Page 27: ECE8833 Polymorphous and Many-Core Computer Architecture](https://reader030.vdocuments.us/reader030/viewer/2022032708/56812ddb550346895d932ad5/html5/thumbnails/27.jpg)
27ECE8833 H.-H. S. Lee 2009
ZS-1: Processing of Two Iterations
S: splitterB: inst buffer readD: decodedI: issued E: Execution
![Page 28: ECE8833 Polymorphous and Many-Core Computer Architecture](https://reader030.vdocuments.us/reader030/viewer/2022032708/56812ddb550346895d932ad5/html5/thumbnails/28.jpg)
28ECE8833 H.-H. S. Lee 2009
IBM RS/6000 and POWER• Evolved from IBM ACS and 801
• Foundation of POWER architecture (Performance Optimization With Enhanced RISC)– 10 discrete chips in the early POWER1 system– Single chip solution in RSC and some
subsequent POWER2 version called P2SC
![Page 29: ECE8833 Polymorphous and Many-Core Computer Architecture](https://reader030.vdocuments.us/reader030/viewer/2022032708/56812ddb550346895d932ad5/html5/thumbnails/29.jpg)
29ECE8833 H.-H. S. Lee 2009
POWER2 Processor Node• 8 Discrete chips on MCM• 66.7 MHz, 6-issue (2 reserved for
br/comp)• 2 FXUs
– Memory, INT, Logical– 2 per cycles
• 3 dual-pipe FPUs can perform– 2 DP Fma– 2 FP loads– 2 FP stores
---
I-Cache(32KB)
Dispatch
DualBranch
Processors
Instruction Cache Unit
Instruction Buffer
Execution Unit w/oMult/Div
Execution Unit w
Mult/Div
Instruction Buffer
ArithmeticExecution
Unit
Store Execution
Unit
Load Execution
Unit
Sync
Fixed-Point Unit (FXU) Floating-Point Unit (FPU)
Data Cache Unit (DCU)4 separate chips
(32KB each)
Memory Unit(64MB – 512MB)
OptionalSecondary Cache
(1 or 2MB)
Storage Control Unit
![Page 30: ECE8833 Polymorphous and Many-Core Computer Architecture](https://reader030.vdocuments.us/reader030/viewer/2022032708/56812ddb550346895d932ad5/html5/thumbnails/30.jpg)
30ECE8833 H.-H. S. Lee 2009
MACS Performance Bound Model
Actual Run Time
M Bound
MA Bound
MAC Bound
MACS Bound
PhysicallyMeasured
GAP A
GAP C
GAP S
GAP P
• To analyze achievable performance (mostly FP) in scientific applications
![Page 31: ECE8833 Polymorphous and Many-Core Computer Architecture](https://reader030.vdocuments.us/reader030/viewer/2022032708/56812ddb550346895d932ad5/html5/thumbnails/31.jpg)
31ECE8833 H.-H. S. Lee 2009
MACS Performance Bound Model• Gap A (keep you from attaining peak performance)
– Excessive loads/stores (more than essential ones, i.e., a[i] = b[i])
– Loop bookkeeping
• GAP C (reason we may want to have 432?)– Hardware restriction (architectural registers)– Redundant instructions – Load/store overhead in function calls
• GAP S– Weak scheduling algorithm– Resource conflicts preventing tighter schedule – Sol: Modulo scheduling to compact the code
• GAP P– Cache misses, inter-core communication, system effect
(i.e., context switches)– Sol: prefetch, loop blocking, loop fusion, loop exchange,
etc.
![Page 32: ECE8833 Polymorphous and Many-Core Computer Architecture](https://reader030.vdocuments.us/reader030/viewer/2022032708/56812ddb550346895d932ad5/html5/thumbnails/32.jpg)
32ECE8833 H.-H. S. Lee 2009
POWER2 M Bound (Ideal, Ideal)
M Bound Peak = 1 fma to 2 FPU pipelines = 0.25 CPF
---
Instruction Buffer
ArithmeticExecution
Unit
Store Execution
Unit
Load Execution
Unit
Floating-Point Unit (FPU)
Dispatch
![Page 33: ECE8833 Polymorphous and Many-Core Computer Architecture](https://reader030.vdocuments.us/reader030/viewer/2022032708/56812ddb550346895d932ad5/html5/thumbnails/33.jpg)
33ECE8833 H.-H. S. Lee 2009
POWER2 MA Bound (Ideal compiler and rest)MA Bound 1. Given the visible workload of the high level application
2. Calculate the essential operations must be performed
sqrtdivmama
dimfxflMA f*4f*4f*2ff
) t, t, t, t, MAX(tt
Time bound for all FP operations
Essential, minimum FP operations to complete the
computation A factor of 4 for div and sqrt is a common choice to reflect their relative weight to other computations
![Page 34: ECE8833 Polymorphous and Many-Core Computer Architecture](https://reader030.vdocuments.us/reader030/viewer/2022032708/56812ddb550346895d932ad5/html5/thumbnails/34.jpg)
34ECE8833 H.-H. S. Lee 2009
POWER2 MA Bound (Ideal compiler and rest)
)I
L(MAX t
)sl,sMAX(lt
2
sl t
4
slffffft
)2
,2
,2
f*27f*17fffMAX(t
r
rr cycles recurrence d
fxfxflflm
flflfx
flflsqrtdivmamai
sqrtdivmamafl
flfl ls
r recurrencein iterations of # :r
I
dependency carried-loop theoflatency Total :r
L
2 pipelines
Max 4 dispatches to FPU and FXU
Other fixed-point considered irrelevant
Simplified memory model
Non-pipelined FP ops
![Page 35: ECE8833 Polymorphous and Many-Core Computer Architecture](https://reader030.vdocuments.us/reader030/viewer/2022032708/56812ddb550346895d932ad5/html5/thumbnails/35.jpg)
35ECE8833 H.-H. S. Lee 2009
POWER2 MAC Bound
4
n -length code compiled t'
compare andbranch of # n where2
n t'
div and mul FXU ofnumber
other s' l' n where)n ,
2
nMAX( t'
othersf'*27f'*17f'f'f' n where)2
',
2
',
2
nMAX(t'
BCi
BCBC
b
fxfx FXUFXMD
FXUfx
sqrtdivmaabaFPUFPU
fl
flfl ls
MAC BoundSimilar to computing MA Bound but using actual, generated instruction count
sqrtdivmama
dibmfxflMAC f*4f*4f*2ff
) t', t', t', t', t',MAX(t' t
![Page 36: ECE8833 Polymorphous and Many-Core Computer Architecture](https://reader030.vdocuments.us/reader030/viewer/2022032708/56812ddb550346895d932ad5/html5/thumbnails/36.jpg)
36ECE8833 H.-H. S. Lee 2009
POWER2 MACS Bound
MACS BoundSimilar to computing MAC Bound but the numerator is the actual compiler-scheduled code
![Page 37: ECE8833 Polymorphous and Many-Core Computer Architecture](https://reader030.vdocuments.us/reader030/viewer/2022032708/56812ddb550346895d932ad5/html5/thumbnails/37.jpg)
37ECE8833 H.-H. S. Lee 2009
IBM SP2 Performance Bound• Later expansion to include inter-processor
communication bound