hasim fpga-based processor models: multicore models and time-multiplexing
DESCRIPTION
HAsim FPGA-Based Processor Models: Multicore Models and Time-Multiplexing. Michael Adler Elliott Fleming Michael Pellauer Joel Emer. Simulating Multicores. Simulating an N- multicore target Fundametally N times the work Plus on-chip network. CPU. CPU. CPU. CPU. CPU. Network. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: HAsim FPGA-Based Processor Models: Multicore Models and Time-Multiplexing](https://reader035.vdocuments.us/reader035/viewer/2022062501/568162b0550346895dd3379c/html5/thumbnails/1.jpg)
HAsim FPGA-Based Processor Models: Multicore Models and Time-Multiplexing
Michael AdlerElliott FlemingMichael PellauerJoel Emer
![Page 2: HAsim FPGA-Based Processor Models: Multicore Models and Time-Multiplexing](https://reader035.vdocuments.us/reader035/viewer/2022062501/568162b0550346895dd3379c/html5/thumbnails/2.jpg)
2
Simulating Multicores
Simulating an N-multicore target•Fundametally N times the work•Plus on-chip network
Duplicating cores will quickly fill FPGAMulti-FPGA will slow simulation
CPU
CPU CPU CPUCPU
CPU CPU CPU CPU
Network
![Page 3: HAsim FPGA-Based Processor Models: Multicore Models and Time-Multiplexing](https://reader035.vdocuments.us/reader035/viewer/2022062501/568162b0550346895dd3379c/html5/thumbnails/3.jpg)
3
Trading Time for Space
Can leverage separation of model clock and FPGA clock to save space• Two techniques: serialization and time-multiplexing
But doesn’t this just slow down our simulator?
The tradeoff is a good idea if we can:• Save a lot of space• Improve FPGA critical path• Improve utilization• Slow down rare events, keep common events fast
LI approach enables a wide range of tradeoff options
![Page 4: HAsim FPGA-Based Processor Models: Multicore Models and Time-Multiplexing](https://reader035.vdocuments.us/reader035/viewer/2022062501/568162b0550346895dd3379c/html5/thumbnails/4.jpg)
4
Serialization: A First Tradeoff
![Page 5: HAsim FPGA-Based Processor Models: Multicore Models and Time-Multiplexing](https://reader035.vdocuments.us/reader035/viewer/2022062501/568162b0550346895dd3379c/html5/thumbnails/5.jpg)
5
Example Tradeoff: Multi-Port Register File
2 Read Ports, 2 Write Ports• 5-bit index, 32-bit data• Reads take zero clock cycles
Virtex 2Pro FPGA: 9242 (>25%) slices, 104 MHz
2R/2WRegister
File
rd addr 1
rd addr 2
wr addr 1wr val 1
wr addr2wr val 2
rd val 1
rd val 2
![Page 6: HAsim FPGA-Based Processor Models: Multicore Models and Time-Multiplexing](https://reader035.vdocuments.us/reader035/viewer/2022062501/568162b0550346895dd3379c/html5/thumbnails/6.jpg)
6
Trading Time for Space
Simulate the circuit sequentially using BlockRAM• 94 slices (<1%), 1 BlockRAM, 224 MHz (2.2x)• Simulation rate is 224 / 3 = 75 MHz
rd addr 1
rd addr 2
wr addr 1wr val 1
wr addr 2wr val 2
rd val 1
rd val 2
1R/1WBlockRAM
FSM
• Each module may have different FMR• A-Ports allow us to connect many such modules together• Maintain a consistent notion of model time
FPGA-cycle to Model Cycle Ratio(FMR)
![Page 7: HAsim FPGA-Based Processor Models: Multicore Models and Time-Multiplexing](https://reader035.vdocuments.us/reader035/viewer/2022062501/568162b0550346895dd3379c/html5/thumbnails/7.jpg)
7
Example: Inorder Front End
FET
BranchPred
IMEM PCResolve
InstQ
I$
ITLB1 1 1 0
1
2
0
0first
deq
slot
enqor
drop
1
fault
mispred
1training
pred
rspImm
rspDel
1
1redirect
1vaddr
(from Back End)
vaddr
0
(from Back End)
paddr
0paddr
1
LinePred
00
instor
fault
Legend: Ready to simulate?
YesNo
FET
Part
IMEM
• Modules may simulate at any wall-clock rate• Corollary: adjacent modules may not be simulating the same
model cycle
![Page 8: HAsim FPGA-Based Processor Models: Multicore Models and Time-Multiplexing](https://reader035.vdocuments.us/reader035/viewer/2022062501/568162b0550346895dd3379c/html5/thumbnails/8.jpg)
8
Simulator “Slip”
Adjacent modules simulating different cycles!• In paper: distributed resynchronization scheme
This can speed up simulation• Case study: Achieved 17% better performance than centralized controller• Can get performance = dynamic average
FET DEC1FET DEC1 vs
Let’s see how...
![Page 9: HAsim FPGA-Based Processor Models: Multicore Models and Time-Multiplexing](https://reader035.vdocuments.us/reader035/viewer/2022062501/568162b0550346895dd3379c/html5/thumbnails/9.jpg)
9
Traditional Software Simulation
Wallclock time FET DEC EXE MEM WB
0 A1 A2 NOP3 NOP4 NOP5 NOP6 B7 B8 A9 A10 NOP
= model cycle
![Page 10: HAsim FPGA-Based Processor Models: Multicore Models and Time-Multiplexing](https://reader035.vdocuments.us/reader035/viewer/2022062501/568162b0550346895dd3379c/html5/thumbnails/10.jpg)
10
2008.06.30
Challenges in Conducting Compelling Architecture Research10
Global Controller “Barrier” Synchronization
FPGA CC
FET DEC EXE MEM WB
0 A NOP NOP NOP NOP1 A2 A3 B A NOP NOP NOP4 B A5 A6 C B A NOP NOP7 B8 D C B A NOP9 D10 D
= model cycle
![Page 11: HAsim FPGA-Based Processor Models: Multicore Models and Time-Multiplexing](https://reader035.vdocuments.us/reader035/viewer/2022062501/568162b0550346895dd3379c/html5/thumbnails/11.jpg)
11
A-Ports Distributed SynchronizationFPGA CC
FET DEC EXE MEM WB
0 A NOP NOP NOP NOP1 B A NOP NOP NOP2 C B A NOP NOP3 D B A NOP4 E
(full)B A
5 B A6 B A7 C B A8 F D C B A9 G
(full)D C B
10 D C11 D12 D
long-running opscan overlap evenif on different CC
run-ahead in timeuntil buffering fills
Takeaway: LI makes serialization tradeoffs more appealing
![Page 12: HAsim FPGA-Based Processor Models: Multicore Models and Time-Multiplexing](https://reader035.vdocuments.us/reader035/viewer/2022062501/568162b0550346895dd3379c/html5/thumbnails/12.jpg)
12
Modeling large caches
Expensive instructions
CPU
Leveraging Latency-Insensitivity
1 1
FPU
EXE
LEAP InstructionEmulator
(M5)
RRR
[With Parashar,
Adler]
FPGA
1 1
L2$
CacheController
BRAM(KBs, 1 CC)
SRAM(MBs,
10s CCs) SystemMemory
(GBs, 100s CCs)
RAM256 KB
FPGA
LEAP
LEAP Scratchpad
![Page 13: HAsim FPGA-Based Processor Models: Multicore Models and Time-Multiplexing](https://reader035.vdocuments.us/reader035/viewer/2022062501/568162b0550346895dd3379c/html5/thumbnails/13.jpg)
13
Time-Multiplexing: A Tradeoff to Scale Multicores
(resume at 3:45)
![Page 14: HAsim FPGA-Based Processor Models: Multicore Models and Time-Multiplexing](https://reader035.vdocuments.us/reader035/viewer/2022062501/568162b0550346895dd3379c/html5/thumbnails/14.jpg)
14
Drawbacks:• Probably won’t fit• Low utilization of functional units
Benefits:• Simple to describe• Maximum parallelism
Multicores Revisited
What if we duplicate the cores?
state state state
CORE 0 CORE 1 CORE 2
![Page 15: HAsim FPGA-Based Processor Models: Multicore Models and Time-Multiplexing](https://reader035.vdocuments.us/reader035/viewer/2022062501/568162b0550346895dd3379c/html5/thumbnails/15.jpg)
15
Module Utilization
FET DEC1FET DEC1
A module is unutilized on an FPGA cycle if:• Waiting for all input ports to be non-empty or• Waiting for all output ports to be non-full
Case Study: In-order functional units were utilized 13% of FPGA cycles on average
1 1
![Page 16: HAsim FPGA-Based Processor Models: Multicore Models and Time-Multiplexing](https://reader035.vdocuments.us/reader035/viewer/2022062501/568162b0550346895dd3379c/html5/thumbnails/16.jpg)
16
• Drawbacks:• More expensive than
duplication(!)
Benefits:• Better unit utilization
Time-Multiplexing: First Approach
Duplicate state, Sequentially share logic
state
state
state physicalpipeline
virtualinstances
![Page 17: HAsim FPGA-Based Processor Models: Multicore Models and Time-Multiplexing](https://reader035.vdocuments.us/reader035/viewer/2022062501/568162b0550346895dd3379c/html5/thumbnails/17.jpg)
17
• Drawbacks:• Head-of-line blocking may limit
performance
Benefits:• Much better area• Good unit utilization
Round-Robin Time Multiplexing
Fix ordering, remove multiplexors
statestatestate
physicalpipeline
• Need to limit impact of slow events• Pipeline at a fine granularity• Need a distributed, controller-free mechanism to coordinate...
![Page 18: HAsim FPGA-Based Processor Models: Multicore Models and Time-Multiplexing](https://reader035.vdocuments.us/reader035/viewer/2022062501/568162b0550346895dd3379c/html5/thumbnails/18.jpg)
18
Port-Based Time-Multiplexing
• Duplicate local state in each module• Change port implementation:
• Minimum buffering: N * latency + 1• Initialize each FIFO with: # of tokens = N * latency
• Result: Adjacent modules can be simultaneously simulating different virtual instances
![Page 19: HAsim FPGA-Based Processor Models: Multicore Models and Time-Multiplexing](https://reader035.vdocuments.us/reader035/viewer/2022062501/568162b0550346895dd3379c/html5/thumbnails/19.jpg)
19
The Front End Multiplexed
FET
BranchPred
IMEM PCResolve
InstQ
I$
ITLB1 1 1 0
1
2
0
0first
deq
slot
enqor
drop
1
fault
mispred
1training
pred
rspImm
rspDel
1
1redirect
1vaddr
(from Back End)
vaddr
0
(from Back End)
paddr
0paddr
1
LinePred
00
instor
fault
Legend: Ready to simulate?
CPU1No CPU
2
FET IMEM
![Page 20: HAsim FPGA-Based Processor Models: Multicore Models and Time-Multiplexing](https://reader035.vdocuments.us/reader035/viewer/2022062501/568162b0550346895dd3379c/html5/thumbnails/20.jpg)
20
On-Chip Networks in a Time-Multiplexed World
![Page 21: HAsim FPGA-Based Processor Models: Multicore Models and Time-Multiplexing](https://reader035.vdocuments.us/reader035/viewer/2022062501/568162b0550346895dd3379c/html5/thumbnails/21.jpg)
21
Problem: On-Chip Network
CPUL1/L2 $
msg credit
Memory Control
rr r r
[0 1 2] [0 1 2]
CPU 0L1/L2 $
CPU 1L1/L2 $
CPU 2L1/L2 $
r
router
msg msg
credit credit
• Problem: routing wires to/from each router• Similar to the “global controller” scheme• Also utilization is low
![Page 22: HAsim FPGA-Based Processor Models: Multicore Models and Time-Multiplexing](https://reader035.vdocuments.us/reader035/viewer/2022062501/568162b0550346895dd3379c/html5/thumbnails/22.jpg)
22
Router0..3
Multiplexing On-Chip Network Routers
Router3
Router0
Router2
Router1
cur to 1 to 2 to 3 fr 1 fr 2 fr 30123
0
001
1
1 2 3
2
2 33
reorder
reorder
reorder
σ(x) = (x + 1) mod 4
σ(x) = (x + 2) mod 4
σ(x) = (x + 3) mod 4
1 2 3
0
001
12
2 33
Simulate the network without a network
![Page 23: HAsim FPGA-Based Processor Models: Multicore Models and Time-Multiplexing](https://reader035.vdocuments.us/reader035/viewer/2022062501/568162b0550346895dd3379c/html5/thumbnails/23.jpg)
23
Ring/Double Ring Topology Multiplexed
Router3
Router0
Router2
Router1
Router0..3
“to next”“from prev”
???
cur to N fr P0
1
2
3
σ(x) = (x + 1) mod 4
1 3
0
012
23
Opposite direction: flip to/from
![Page 24: HAsim FPGA-Based Processor Models: Multicore Models and Time-Multiplexing](https://reader035.vdocuments.us/reader035/viewer/2022062501/568162b0550346895dd3379c/html5/thumbnails/24.jpg)
24
Implementing Permutations on FPGAs Efficiently
Side Buffer•Fits networks like ring/torus (e.g. x+1 mod N)
Indirection Table•More general, but more expensive
PermTable
RAMBuffer
FSM
σ(x) = (x + 1) mod 4
1000 0001
Move first to Nth
Move Nth to first Move every K to N-K
![Page 25: HAsim FPGA-Based Processor Models: Multicore Models and Time-Multiplexing](https://reader035.vdocuments.us/reader035/viewer/2022062501/568162b0550346895dd3379c/html5/thumbnails/25.jpg)
25
Torus/Mesh Topology Multiplexed
Mesh: Don’t transmit on non-existent links
![Page 26: HAsim FPGA-Based Processor Models: Multicore Models and Time-Multiplexing](https://reader035.vdocuments.us/reader035/viewer/2022062501/568162b0550346895dd3379c/html5/thumbnails/26.jpg)
26
Dealing with Heterogeneous Networks
Compose “Mux Ports” with Permutation PortsIn paper: generalize to any topology
![Page 27: HAsim FPGA-Based Processor Models: Multicore Models and Time-Multiplexing](https://reader035.vdocuments.us/reader035/viewer/2022062501/568162b0550346895dd3379c/html5/thumbnails/27.jpg)
27
Putting It All Together
![Page 28: HAsim FPGA-Based Processor Models: Multicore Models and Time-Multiplexing](https://reader035.vdocuments.us/reader035/viewer/2022062501/568162b0550346895dd3379c/html5/thumbnails/28.jpg)
28
Typical HAsim Model Leveraging these Techniques
• 16-core chip multiprocessor• 10-stage pipeline (speculative, bypassed)• 64-bit Alpha ISA, floating point• 8 KB lockup-free L1 caches• 256 KB 4-way set associative L2 cache• Network: 2 v. channels, 4 slots, x-y wormhole
F BP1 BP2 PCC IQ D X DM CQ C
ITLB I$ DTLB D$ L/S Q
L2$ Route
• Single detailed pipeline, 16-way time-multiplexed• 64-bit Alpha functional partition, floating point• Caches modeled with different cache hierarchy• Single router, multiplexed, 4 permutations
Regs LUTs BRAM0%
25%
50%
75%
100%
Synthesis Results, percentage of Xilinx V5 330T
LEAPFuncOCNL1/L2Core
![Page 29: HAsim FPGA-Based Processor Models: Multicore Models and Time-Multiplexing](https://reader035.vdocuments.us/reader035/viewer/2022062501/568162b0550346895dd3379c/html5/thumbnails/29.jpg)
29
Time-Multiplexed Multicore Simulation Rate Scaling
Best Worst Avg
FMR 15.7 27.1 18.4
![Page 30: HAsim FPGA-Based Processor Models: Multicore Models and Time-Multiplexing](https://reader035.vdocuments.us/reader035/viewer/2022062501/568162b0550346895dd3379c/html5/thumbnails/30.jpg)
30
Time-Multiplexed Multicore Simulation Rate Scaling
Best Worst Avg
FMR Per-Core 5.4 14.4 8.95
![Page 31: HAsim FPGA-Based Processor Models: Multicore Models and Time-Multiplexing](https://reader035.vdocuments.us/reader035/viewer/2022062501/568162b0550346895dd3379c/html5/thumbnails/31.jpg)
31
Time-Multiplexed Multicore Simulation Rate Scaling
Best Worst Avg
FMR Per-Core 8.5 13.5 11.6
![Page 32: HAsim FPGA-Based Processor Models: Multicore Models and Time-Multiplexing](https://reader035.vdocuments.us/reader035/viewer/2022062501/568162b0550346895dd3379c/html5/thumbnails/32.jpg)
32
Time-Multiplexed Multicore Simulation Rate Scaling
Best Worst Avg
FMR Per-Core 8.45 19.8 11.5
![Page 33: HAsim FPGA-Based Processor Models: Multicore Models and Time-Multiplexing](https://reader035.vdocuments.us/reader035/viewer/2022062501/568162b0550346895dd3379c/html5/thumbnails/33.jpg)
33
Takeaways
The Latency-Insensitive approach provides a unified approach to interesting tradeoffs
Serialization: Leverage FPGA-efficient circuits at the cost of FMR• A-Port-based synchronization can amortize cost by giving
dynamic average• Especially if long events are rare
Time-Multiplexing: Reuse datapaths and only duplicate state• A-Port based approach means not all modules are fully utilized• Increased utilization means that performance degradation is
sublinear• Time-multiplexing the on-chip network requires permutations
![Page 34: HAsim FPGA-Based Processor Models: Multicore Models and Time-Multiplexing](https://reader035.vdocuments.us/reader035/viewer/2022062501/568162b0550346895dd3379c/html5/thumbnails/34.jpg)
34
Next Steps
Here we were able to push one FPGA to its limits
What if we want to scale farther?
Next, we’ll explore how latency-Insensitivity can help us scale to multiple FPGAs with better performance than traditional techniques
Also how we can increase designer productivity by abstracting platform
![Page 35: HAsim FPGA-Based Processor Models: Multicore Models and Time-Multiplexing](https://reader035.vdocuments.us/reader035/viewer/2022062501/568162b0550346895dd3379c/html5/thumbnails/35.jpg)
![Page 36: HAsim FPGA-Based Processor Models: Multicore Models and Time-Multiplexing](https://reader035.vdocuments.us/reader035/viewer/2022062501/568162b0550346895dd3379c/html5/thumbnails/36.jpg)
36
Resynchronizing Ports
Modules follow modified scheme:• If any incoming port is heavy, or any outgoing port is light, simulate next
cycle (when ready)• Result: balanced w/o centralized coordination
Argument: • Modules farthest ahead in time will never proceed• Ports in (out) of this set will be light (resp. heavy)
– Therefore those modules will try to proceed, but may not be able to
• There’s also a set farthest behind in time– Always able to proceed– Since graph is connected, simulating only enables modules, makes progress
towards quiescence
![Page 37: HAsim FPGA-Based Processor Models: Multicore Models and Time-Multiplexing](https://reader035.vdocuments.us/reader035/viewer/2022062501/568162b0550346895dd3379c/html5/thumbnails/37.jpg)
37
Other Topologies
Tree
Butterfly
[1 , 1 , 1 , 0 , 0 , 0 , 0 ] [1 , 1 , 1 , 0 , 0 , 0 , 0 ][0 , 0 , 0 , 1 , 1 , 1 , 1 ]
[2 , 0 , 1 , 0 , 1 , 0 , 1 ] [0 , 1 , 2 , 1 , 2 , 1 , 2 ]
P h ys ica lR ou ter
[0 , 0 , 0, 1 , 1 , 1 , 1 ]
R ou ter0
R ou ter2
R ou te r1
R ou te r6
R ou ter5
R ou te r4
R ou ter3
[0 , 0 , 1 , 1 , 0 , 1 , 0 , 1 , 2 , 2 , 2 , 2 ] [1 , 1 , 2 , 2 , 1 , 2 , 1 , 2 , 0 , 0 , 0 , 0 ]
[0 , 1 , 0 , 1 , 0 , 1 , 0 , 1 ]
[0 , 1 , 0 , 1 , 0 , 1 , 0 , 1 ]
[2 , 2 , 2 , 2 , 0 , 0 , 1 , 1 , 0 , 1 , 0 , 1 ]
F rom P hys ica l C o re
To P hys ica l C ore
[2 , 2 , 2 , 2 , 0 , 0 , 1 , 1 , 0 , 1 , 0 , 1 ]
P hys ica lR ou ter
To C ore 0
To C o re 1
R ou te r8
To C ore 2
To C ore 3
R ou te r9
To C ore 4
To C ore 5
R ou te r10
To C ore 6
To C ore 7
R ou te r11
R ou ter4
R ou ter5
R ou ter6
R ou ter7
R outer0
R outer1
R outer2
R outer3
F rom C ore 0
F rom C ore 1
F rom C ore 2
F rom C ore 3
F rom C ore 4
F rom C ore 5
F rom C ore 6
F rom C ore 7
![Page 38: HAsim FPGA-Based Processor Models: Multicore Models and Time-Multiplexing](https://reader035.vdocuments.us/reader035/viewer/2022062501/568162b0550346895dd3379c/html5/thumbnails/38.jpg)
38
Generalizing OCN Permutations
•Represent model as Directed Graph G=(M,P)•Label modules M with simulation order: 0..(N-1)•Partition ports into sets P0..Pm where:
– No two ports in a set Pm share a source– No two ports in a set Pm share a destination
• Transform each Pm into a permutation σm
– Forall {s, d} in Pm, σm(s) = d– Holes in range represent “don’t cares”– Always send NoMessage on those steps
• Time-Multiplex module as usual– Associate each σm with a physical port
![Page 39: HAsim FPGA-Based Processor Models: Multicore Models and Time-Multiplexing](https://reader035.vdocuments.us/reader035/viewer/2022062501/568162b0550346895dd3379c/html5/thumbnails/39.jpg)
39
Example: Arbitrary Network
0
4
3
2
15
A
1032
543210
10543210
14543210
C
0
2
1
B
(1, 0)(3, 1)
P0
P1
P2
(5, 1)
(1, 2)(2, 3)(4, 0)
(0, 4)(4, 1)
![Page 40: HAsim FPGA-Based Processor Models: Multicore Models and Time-Multiplexing](https://reader035.vdocuments.us/reader035/viewer/2022062501/568162b0550346895dd3379c/html5/thumbnails/40.jpg)
40
Results: Multicore Simulation Rate
FMR Simulation Rate
Min Max Avg Min Max Avg
Overall 16 218 80 160 KHz 3.2 MHz 625 KHz
Per-Core 5 27 11 1.84 9.5 MHz 4.54 MHz
• Must simulate multiple cores to get full benefit of time-multiplexed pipelines
• Functional cache-pressure rate-limiting factor