hardware acceleration of sequential loops · monster core • let’s assume we are given a large...
TRANSCRIPT
![Page 1: Hardware acceleration of sequential loops · Monster core • Let’s assume we are given a large silicon area and power ... – E.g., register dependency analysis – When possible,](https://reader034.vdocuments.us/reader034/viewer/2022050205/5f5824310858ab78636bfee7/html5/thumbnails/1.jpg)
1
Hardware acceleration of sequential loops
ALF team INRIA Rennes – Bretagne Atlantique
December 12, 2011
Pierre Michaud
![Page 2: Hardware acceleration of sequential loops · Monster core • Let’s assume we are given a large silicon area and power ... – E.g., register dependency analysis – When possible,](https://reader034.vdocuments.us/reader034/viewer/2022050205/5f5824310858ab78636bfee7/html5/thumbnails/2.jpg)
Outline
1. Sequential accelerators
2. Hardware acceleration of sequential loops
2
![Page 3: Hardware acceleration of sequential loops · Monster core • Let’s assume we are given a large silicon area and power ... – E.g., register dependency analysis – When possible,](https://reader034.vdocuments.us/reader034/viewer/2022050205/5f5824310858ab78636bfee7/html5/thumbnails/3.jpg)
Monster core
• Let’s assume we are given a large silicon area and power budget and we seek to maximize single-thread performance
• What silicon area and power would we need to double the performance of a 4-wide superscalar core ? – 4 times the area and power of a 4-wide superscalar ? – 8 times ?
• What about a monster core 128 times bigger than a 4-wide superscalar ?
3
![Page 4: Hardware acceleration of sequential loops · Monster core • Let’s assume we are given a large silicon area and power ... – E.g., register dependency analysis – When possible,](https://reader034.vdocuments.us/reader034/viewer/2022050205/5f5824310858ab78636bfee7/html5/thumbnails/4.jpg)
The multi-core era
4
Small single-thread performance improvement
2005 : superscalar 2009 : 4 cores
![Page 5: Hardware acceleration of sequential loops · Monster core • Let’s assume we are given a large silicon area and power ... – E.g., register dependency analysis – When possible,](https://reader034.vdocuments.us/reader034/viewer/2022050205/5f5824310858ab78636bfee7/html5/thumbnails/5.jpg)
The multi-core era
5
Small single-thread performance improvement
2013 : 16 cores
![Page 6: Hardware acceleration of sequential loops · Monster core • Let’s assume we are given a large silicon area and power ... – E.g., register dependency analysis – When possible,](https://reader034.vdocuments.us/reader034/viewer/2022050205/5f5824310858ab78636bfee7/html5/thumbnails/6.jpg)
The many-core era
6
Small single-thread performance improvement
2017 : 64 cores ?
![Page 7: Hardware acceleration of sequential loops · Monster core • Let’s assume we are given a large silicon area and power ... – E.g., register dependency analysis – When possible,](https://reader034.vdocuments.us/reader034/viewer/2022050205/5f5824310858ab78636bfee7/html5/thumbnails/7.jpg)
The many-core era
7
Small single-thread performance improvement
2021 : 256 cores ?
![Page 8: Hardware acceleration of sequential loops · Monster core • Let’s assume we are given a large silicon area and power ... – E.g., register dependency analysis – When possible,](https://reader034.vdocuments.us/reader034/viewer/2022050205/5f5824310858ab78636bfee7/html5/thumbnails/8.jpg)
What do we choose for 2021 ?
8
Do we prefer double parallel performance (maybe) or double sequential performance ?
monster core
![Page 9: Hardware acceleration of sequential loops · Monster core • Let’s assume we are given a large silicon area and power ... – E.g., register dependency analysis – When possible,](https://reader034.vdocuments.us/reader034/viewer/2022050205/5f5824310858ab78636bfee7/html5/thumbnails/9.jpg)
How do we build a monster core ?
• 8-wide superscalar ? – Probably the first step, but that does not make a monster core
• 16-wide superscalar ? – Never implemented so far – can we increase the IPC without impacting the clock cycle ?
• Huge NUCA cache ?
• What else ?
9
![Page 10: Hardware acceleration of sequential loops · Monster core • Let’s assume we are given a large silicon area and power ... – E.g., register dependency analysis – When possible,](https://reader034.vdocuments.us/reader034/viewer/2022050205/5f5824310858ab78636bfee7/html5/thumbnails/10.jpg)
Sequential accelerator
• In the remaining, “core” = “superscalar core”
• Sequential accelerator may consist of several physical cores
• Sequential accelerator is specialized for sequential execution
10
Monster core = sequential accelerator
better term
![Page 11: Hardware acceleration of sequential loops · Monster core • Let’s assume we are given a large silicon area and power ... – E.g., register dependency analysis – When possible,](https://reader034.vdocuments.us/reader034/viewer/2022050205/5f5824310858ab78636bfee7/html5/thumbnails/11.jpg)
Proposition A
• Build a very aggressive superscalar core – 8-wide, high clock frequency
• Replicate that core (e.g., 8 times)
• Use a single core at a time, power-gate the unused ones
• Migrate the execution periodically to maintain temperature at a safe level – P. Michaud, Y. Sazeides, A. Seznec, “Proposition for a sequential
accelerator in future general-purpose many-core processors and the problem of migration-induced cache misses”, Computing Frontiers 2010
11
![Page 12: Hardware acceleration of sequential loops · Monster core • Let’s assume we are given a large silicon area and power ... – E.g., register dependency analysis – When possible,](https://reader034.vdocuments.us/reader034/viewer/2022050205/5f5824310858ab78636bfee7/html5/thumbnails/12.jpg)
Proposition B
• Hardware acceleration of sequential loops
• The focus of this presentation
– P. Michaud, “Hardware acceleration of sequential loops”, INRIA technical report, 2011.
– http://hal.inria.fr/hal-00641350
12
![Page 13: Hardware acceleration of sequential loops · Monster core • Let’s assume we are given a large silicon area and power ... – E.g., register dependency analysis – When possible,](https://reader034.vdocuments.us/reader034/viewer/2022050205/5f5824310858ab78636bfee7/html5/thumbnails/13.jpg)
Hardware acceleration of sequential loops
13
![Page 14: Hardware acceleration of sequential loops · Monster core • Let’s assume we are given a large silicon area and power ... – E.g., register dependency analysis – When possible,](https://reader034.vdocuments.us/reader034/viewer/2022050205/5f5824310858ab78636bfee7/html5/thumbnails/14.jpg)
Dynamic loop
• dynamic loop = periodic sequence of dynamic instructions
• Dynamic loop ≠ program loop
– A program loop does not always generate a dynamic loop • control flow behavior inside the loop body may be more or less random • e.g., “if” statements inside the loop body
– A dynamic loop may result from a recursive function
14
![Page 15: Hardware acceleration of sequential loops · Monster core • Let’s assume we are given a large silicon area and power ... – E.g., register dependency analysis – When possible,](https://reader034.vdocuments.us/reader034/viewer/2022050205/5f5824310858ab78636bfee7/html5/thumbnails/15.jpg)
Idea
• Why not build a hardware accelerator for dynamic loops ?
– Loop detector = hardware mechanism for detecting dynamic loops
– When dynamic loop encountered, migrate the execution to the accelerator
– When loop exit condition (e.g., branch direction changes), migrate the execution back to the superscalar core
15
![Page 16: Hardware acceleration of sequential loops · Monster core • Let’s assume we are given a large silicon area and power ... – E.g., register dependency analysis – When possible,](https://reader034.vdocuments.us/reader034/viewer/2022050205/5f5824310858ab78636bfee7/html5/thumbnails/16.jpg)
Let’s kill the suspense
16
percentage of performance increase over baseline
![Page 17: Hardware acceleration of sequential loops · Monster core • Let’s assume we are given a large silicon area and power ... – E.g., register dependency analysis – When possible,](https://reader034.vdocuments.us/reader034/viewer/2022050205/5f5824310858ab78636bfee7/html5/thumbnails/17.jpg)
Definitions
• loop length = sequence length (in instructions)
• loop body size B = smallest period
• loop body = first B instructions
17
ABCDE ABCDE ABCDE ABCDE
body
length = 20
![Page 18: Hardware acceleration of sequential loops · Monster core • Let’s assume we are given a large silicon area and power ... – E.g., register dependency analysis – When possible,](https://reader034.vdocuments.us/reader034/viewer/2022050205/5f5824310858ab78636bfee7/html5/thumbnails/18.jpg)
Some questions
• What fraction of time programs spend in dynamic loops ?
• Are dynamic loops long enough to amortize the penalty of migrating the execution to and from the accelerator ?
• What is the typical loop body size ?
• ➔ First let’s define the loop detector
18
![Page 19: Hardware acceleration of sequential loops · Monster core • Let’s assume we are given a large silicon area and power ... – E.g., register dependency analysis – When possible,](https://reader034.vdocuments.us/reader034/viewer/2022050205/5f5824310858ab78636bfee7/html5/thumbnails/19.jpg)
Loop detector
• Loop detector = loop monitor + loop table
• Loop monitor – Observe instructions as they retire from the processor reorder buffer – Detect long dynamic loops
• when loop length > 900, count subsequent instructions as loop instructions, until periodic behavior ends
• Loop table – Records information about loops already encountered
19
![Page 20: Hardware acceleration of sequential loops · Monster core • Let’s assume we are given a large silicon area and power ... – E.g., register dependency analysis – When possible,](https://reader034.vdocuments.us/reader034/viewer/2022050205/5f5824310858ab78636bfee7/html5/thumbnails/20.jpg)
Remarks
• A loop body may contain several instances of the same static instruction – Tiny loop unrolled, same function called multiple times, …
• There is a backward jump somewhere in the loop body
• A loop body may contain several backward jumps
20
![Page 21: Hardware acceleration of sequential loops · Monster core • Let’s assume we are given a large silicon area and power ... – E.g., register dependency analysis – When possible,](https://reader034.vdocuments.us/reader034/viewer/2022050205/5f5824310858ab78636bfee7/html5/thumbnails/21.jpg)
Loop monitor
• Associate a monitor to each static backward jump
• For each retired instruction – Update a signature = hash of instructions PCs – Increment the body size – If body size > MaxBodySize, close the monitor
• Each time we encounter that same backward jump – Compare the signature with the body signature – If signatures match, add body size to the loop length – Reset signature and body size
• Monitor several backward jumps simultaneously – 4 monitors are enough in practice ; that which first reaches 900 “wins”
21
![Page 22: Hardware acceleration of sequential loops · Monster core • Let’s assume we are given a large silicon area and power ... – E.g., register dependency analysis – When possible,](https://reader034.vdocuments.us/reader034/viewer/2022050205/5f5824310858ab78636bfee7/html5/thumbnails/22.jpg)
Loop table
• If loop length was greater than 900, record entry for that loop in the loop table
• Loop is identified by its body signature
• Next time we encounter that loop, immediately start counting instructions as loop instructions – don’t wait till 900 instructions have been retired
22
![Page 23: Hardware acceleration of sequential loops · Monster core • Let’s assume we are given a large silicon area and power ... – E.g., register dependency analysis – When possible,](https://reader034.vdocuments.us/reader034/viewer/2022050205/5f5824310858ab78636bfee7/html5/thumbnails/23.jpg)
Simulations
• Trace-driven, based on Pin (x86)
• Nehalem-like microarchitecture (roughly)
• Hardware cache prefetchers – L1,L2,L3
• Benchmarks : SPEC CPU2006 – 1 trace per benchmark – 1 trace = 40 samples of 50 millions instructions (total = 2 billions) – Samples regularly spaced throughout whole benchmark execution
23
![Page 24: Hardware acceleration of sequential loops · Monster core • Let’s assume we are given a large silicon area and power ... – E.g., register dependency analysis – When possible,](https://reader034.vdocuments.us/reader034/viewer/2022050205/5f5824310858ab78636bfee7/html5/thumbnails/24.jpg)
CINT2006 loop behavior
24
percentage dynamic instructions
percentage clock cycles
maximum body size
benchmarks
18000
average loop length
5000
5000 8000 5000
![Page 25: Hardware acceleration of sequential loops · Monster core • Let’s assume we are given a large silicon area and power ... – E.g., register dependency analysis – When possible,](https://reader034.vdocuments.us/reader034/viewer/2022050205/5f5824310858ab78636bfee7/html5/thumbnails/25.jpg)
CFP2006 loop behavior
25
6M
4000
2000
6000
3000 3000
5000
4000
3000
![Page 26: Hardware acceleration of sequential loops · Monster core • Let’s assume we are given a large silicon area and power ... – E.g., register dependency analysis – When possible,](https://reader034.vdocuments.us/reader034/viewer/2022050205/5f5824310858ab78636bfee7/html5/thumbnails/26.jpg)
Summary
• Long dynamic loops represent a significant part of the execution for half of the CPU2006 benchmarks – Up to 90%
• Typical loop length is several thousands instructions
• Body size is diverse, from a few tens up to a few hundreds instructions
• ➔ Worth trying to accelerate loops – But all loops, with a small or large body
26
![Page 27: Hardware acceleration of sequential loops · Monster core • Let’s assume we are given a large silicon area and power ... – E.g., register dependency analysis – When possible,](https://reader034.vdocuments.us/reader034/viewer/2022050205/5f5824310858ab78636bfee7/html5/thumbnails/27.jpg)
How to accelerate loops ?
• Modify a superscalar core ? – instruction fetch (done), register renaming, what else ?
• Loop accelerator – Sits beside the superscalar core – Specialized for executing periodic sequences of instructions
• Try to remove repetitive work from the loop execution ➔ do that work before executing the loop – E.g., register dependency analysis – When possible, store information in the loop table for future
occurrences of that loop
27
![Page 28: Hardware acceleration of sequential loops · Monster core • Let’s assume we are given a large silicon area and power ... – E.g., register dependency analysis – When possible,](https://reader034.vdocuments.us/reader034/viewer/2022050205/5f5824310858ab78636bfee7/html5/thumbnails/28.jpg)
Global view
28
superscalar core
DL1 arch. regs Loop
Detector Loop
Builder Loop
Accelerator
L2 L3
retired micro-ops
![Page 29: Hardware acceleration of sequential loops · Monster core • Let’s assume we are given a large silicon area and power ... – E.g., register dependency analysis – When possible,](https://reader034.vdocuments.us/reader034/viewer/2022050205/5f5824310858ab78636bfee7/html5/thumbnails/29.jpg)
Transitions
29
Normal mode : execute on core
Build mode : • for X loop iterations • still execute on core • prepare loop
Acceleration mode : execute on loop accelerator
long loop detected
early loop exit
loop exit ➔ update arch. regs
• flush reorder buffer • read loop input regs • migrate the execution
![Page 30: Hardware acceleration of sequential loops · Monster core • Let’s assume we are given a large silicon area and power ... – E.g., register dependency analysis – When possible,](https://reader034.vdocuments.us/reader034/viewer/2022050205/5f5824310858ab78636bfee7/html5/thumbnails/30.jpg)
Proposed loop accelerator
• Get rid of main superscalar bottlenecks – Register renaming, dynamic scheduling, bypass network,…
• Scheduling is done during build mode – Each static micro-op is mapped to an Execution Unit (EU) – All dynamic instances of that micro-op will be executed by that EU,
in program order
• Static micro-ops communicate through FIFO queues – No registers – Data-flow execution
30
![Page 31: Hardware acceleration of sequential loops · Monster core • Let’s assume we are given a large silicon area and power ... – E.g., register dependency analysis – When possible,](https://reader034.vdocuments.us/reader034/viewer/2022050205/5f5824310858ab78636bfee7/html5/thumbnails/31.jpg)
Cyclic register dependency graph
31
example loop ➔ body = 5 micro-ops
A B C D E A B C D E A B C
A
B
C
D
E
cycles in the graph = long dependency chains
latency-tolerant dependency
![Page 32: Hardware acceleration of sequential loops · Monster core • Let’s assume we are given a large silicon area and power ... – E.g., register dependency analysis – When possible,](https://reader034.vdocuments.us/reader034/viewer/2022050205/5f5824310858ab78636bfee7/html5/thumbnails/32.jpg)
Constraint edges
32
A
B
C
D
E
dynamic instances of a static micro-op execute in program order
this does not increase the length of the longest chain
![Page 33: Hardware acceleration of sequential loops · Monster core • Let’s assume we are given a large silicon area and power ... – E.g., register dependency analysis – When possible,](https://reader034.vdocuments.us/reader034/viewer/2022050205/5f5824310858ab78636bfee7/html5/thumbnails/33.jpg)
Proposed loop accelerator (cont’d)
33
I-node F-node
S-node
L-node
I-node I-node F-node
I-node • 8 nodes • 4 EUs per node • 4-lane pipelined ring
![Page 34: Hardware acceleration of sequential loops · Monster core • Let’s assume we are given a large silicon area and power ... – E.g., register dependency analysis – When possible,](https://reader034.vdocuments.us/reader034/viewer/2022050205/5f5824310858ab78636bfee7/html5/thumbnails/34.jpg)
Proposed loop accelerator (cont’d)
34
I-node F-node
S-node
L-node
I-node I-node F-node
I-node • 8 nodes • 4 EUs per node • 4-lane pipelined ring
1-cycle micro-ops (ALU, branch, address generation, ...)
![Page 35: Hardware acceleration of sequential loops · Monster core • Let’s assume we are given a large silicon area and power ... – E.g., register dependency analysis – When possible,](https://reader034.vdocuments.us/reader034/viewer/2022050205/5f5824310858ab78636bfee7/html5/thumbnails/35.jpg)
Proposed loop accelerator (cont’d)
35
I-node F-node
S-node
L-node
I-node I-node F-node
I-node • 8 nodes • 4 EUs per node • 4-lane pipelined ring
FP add, FP mul, int mul,…
![Page 36: Hardware acceleration of sequential loops · Monster core • Let’s assume we are given a large silicon area and power ... – E.g., register dependency analysis – When possible,](https://reader034.vdocuments.us/reader034/viewer/2022050205/5f5824310858ab78636bfee7/html5/thumbnails/36.jpg)
Proposed loop accelerator (cont’d)
36
I-node F-node
S-node
L-node
I-node I-node F-node
I-node • 8 nodes • 4 EUs per node • 4-lane pipelined ring
loads
![Page 37: Hardware acceleration of sequential loops · Monster core • Let’s assume we are given a large silicon area and power ... – E.g., register dependency analysis – When possible,](https://reader034.vdocuments.us/reader034/viewer/2022050205/5f5824310858ab78636bfee7/html5/thumbnails/37.jpg)
Proposed loop accelerator (cont’d)
37
I-node F-node
S-node
L-node
I-node I-node F-node
I-node • 8 nodes • 4 EUs per node • 4-lane pipelined ring
stores
![Page 38: Hardware acceleration of sequential loops · Monster core • Let’s assume we are given a large silicon area and power ... – E.g., register dependency analysis – When possible,](https://reader034.vdocuments.us/reader034/viewer/2022050205/5f5824310858ab78636bfee7/html5/thumbnails/38.jpg)
Remarks
• Implement only a subset of the ISA – Some operations are rare in the loops we have found – E.g., integer division
• Build mode determines if the loop cannot be executed by the accelerator
• Floating-point divisions are not rare – One specific I-node can be configured during build mode to execute
FP divs only
38
![Page 39: Hardware acceleration of sequential loops · Monster core • Let’s assume we are given a large silicon area and power ... – E.g., register dependency analysis – When possible,](https://reader034.vdocuments.us/reader034/viewer/2022050205/5f5824310858ab78636bfee7/html5/thumbnails/39.jpg)
I-node / F-node
39
EU CEC
operand Qs
OQ LEQ
EU CEC
operand Qs
OQ LEQ
EU CEC
operand Qs
OQ LEQ
EU CEC
operand Qs
OQ LEQ
DISTRIBUTOR
IQ IQ IQ IQ lane 0 lane 1 lane 2 lane 3
lane 0 lane 1 lane 2 lane 3
![Page 40: Hardware acceleration of sequential loops · Monster core • Let’s assume we are given a large silicon area and power ... – E.g., register dependency analysis – When possible,](https://reader034.vdocuments.us/reader034/viewer/2022050205/5f5824310858ab78636bfee7/html5/thumbnails/40.jpg)
Lane 0
40
EU CEC
operand Qs
OQ LEQ
IQ Up to 12 static micro-ops may be mapped on that EU
The Cyclic Execution Controller schedules them in program order
DISTRIBUTOR
![Page 41: Hardware acceleration of sequential loops · Monster core • Let’s assume we are given a large silicon area and power ... – E.g., register dependency analysis – When possible,](https://reader034.vdocuments.us/reader034/viewer/2022050205/5f5824310858ab78636bfee7/html5/thumbnails/41.jpg)
Lane 0
41
EU CEC
operand Qs
OQ LEQ
IQ
Dependent micro-ops can be mapped on the same EU
DISTRIBUTOR
![Page 42: Hardware acceleration of sequential loops · Monster core • Let’s assume we are given a large silicon area and power ... – E.g., register dependency analysis – When possible,](https://reader034.vdocuments.us/reader034/viewer/2022050205/5f5824310858ab78636bfee7/html5/thumbnails/42.jpg)
Input Queue buffers copies of values traveling on the lane
The IQ must never get full ➔ early loop exit when this happens
Lane 0
42
EU CEC
operand Qs
OQ LEQ
IQ
DISTRIBUTOR
![Page 43: Hardware acceleration of sequential loops · Monster core • Let’s assume we are given a large silicon area and power ... – E.g., register dependency analysis – When possible,](https://reader034.vdocuments.us/reader034/viewer/2022050205/5f5824310858ab78636bfee7/html5/thumbnails/43.jpg)
Values coming from other nodes are stored in one of 12 Operand Queues
A given queue holds up to 32 values from a single static micro-op
Lane 0
43
EU CEC
operand Qs
OQ LEQ
IQ
DISTRIBUTOR
![Page 44: Hardware acceleration of sequential loops · Monster core • Let’s assume we are given a large silicon area and power ... – E.g., register dependency analysis – When possible,](https://reader034.vdocuments.us/reader034/viewer/2022050205/5f5824310858ab78636bfee7/html5/thumbnails/44.jpg)
Values produced by the EU are sent to other nodes through the Output Queue
Lane 0
44
EU CEC
operand Qs
OQ LEQ
IQ
DISTRIBUTOR
![Page 45: Hardware acceleration of sequential loops · Monster core • Let’s assume we are given a large silicon area and power ... – E.g., register dependency analysis – When possible,](https://reader034.vdocuments.us/reader034/viewer/2022050205/5f5824310858ab78636bfee7/html5/thumbnails/45.jpg)
Thanks to the distributor, copies of values produced on a lane can be sent to EUs on other lanes
Lane 0
45
EU CEC
operand Qs
OQ LEQ
IQ
DISTRIBUTOR
![Page 46: Hardware acceleration of sequential loops · Monster core • Let’s assume we are given a large silicon area and power ... – E.g., register dependency analysis – When possible,](https://reader034.vdocuments.us/reader034/viewer/2022050205/5f5824310858ab78636bfee7/html5/thumbnails/46.jpg)
Loop Exit Queue contains last 128 values produced by the EU
➔ for updating registers after loop exit
Lane 0
46
EU CEC
operand Qs
OQ LEQ
IQ
DISTRIBUTOR
![Page 47: Hardware acceleration of sequential loops · Monster core • Let’s assume we are given a large silicon area and power ... – E.g., register dependency analysis – When possible,](https://reader034.vdocuments.us/reader034/viewer/2022050205/5f5824310858ab78636bfee7/html5/thumbnails/47.jpg)
L-node, lane 0
47
LIB CEC
operand Qs
LOQ LEQ DL1 cache
Port 0
Port 1
Port 2
Port 3
MSHRs
![Page 48: Hardware acceleration of sequential loops · Monster core • Let’s assume we are given a large silicon area and power ... – E.g., register dependency analysis – When possible,](https://reader034.vdocuments.us/reader034/viewer/2022050205/5f5824310858ab78636bfee7/html5/thumbnails/48.jpg)
L-node, lane 0
48
LIB CEC
operand Qs
LOQ LEQ DL1 cache
Port 0
Port 1
Port 2
Port 3
MSHRs
Load address comes from an I-node
![Page 49: Hardware acceleration of sequential loops · Monster core • Let’s assume we are given a large silicon area and power ... – E.g., register dependency analysis – When possible,](https://reader034.vdocuments.us/reader034/viewer/2022050205/5f5824310858ab78636bfee7/html5/thumbnails/49.jpg)
L-node, lane 0
49
LIB CEC
operand Qs
LOQ LEQ DL1 cache
Port 0
Port 1
Port 2
Port 3
MSHRs
64-entry Load Issue Buffer ➔ select one load per cycle
![Page 50: Hardware acceleration of sequential loops · Monster core • Let’s assume we are given a large silicon area and power ... – E.g., register dependency analysis – When possible,](https://reader034.vdocuments.us/reader034/viewer/2022050205/5f5824310858ab78636bfee7/html5/thumbnails/50.jpg)
L-node, lane 0
50
LIB CEC
operand Qs
LOQ LEQ DL1 cache
Port 0
Port 1
Port 2
Port 3
MSHRs
128-entry Load Output Queue
Send data to lane 0 in program order
![Page 51: Hardware acceleration of sequential loops · Monster core • Let’s assume we are given a large silicon area and power ... – E.g., register dependency analysis – When possible,](https://reader034.vdocuments.us/reader034/viewer/2022050205/5f5824310858ab78636bfee7/html5/thumbnails/51.jpg)
S-node, lane 0
51
EU CEC
operand Qs
LSQ
64-entry Loop Store Queue
store address store data
![Page 52: Hardware acceleration of sequential loops · Monster core • Let’s assume we are given a large silicon area and power ... – E.g., register dependency analysis – When possible,](https://reader034.vdocuments.us/reader034/viewer/2022050205/5f5824310858ab78636bfee7/html5/thumbnails/52.jpg)
S-node, store validator
52
LSQ LSQ LSQ LSQ
Store Validator
DL1 cache
lane 0
lane 1
lane 2
lane 3
Validates stores in program order
A validated store can write in the DL1
Global synchronizer updates an executed iteration count that tells which stores can be validated
![Page 53: Hardware acceleration of sequential loops · Monster core • Let’s assume we are given a large silicon area and power ... – E.g., register dependency analysis – When possible,](https://reader034.vdocuments.us/reader034/viewer/2022050205/5f5824310858ab78636bfee7/html5/thumbnails/53.jpg)
Global synchronization
• At a given time, different EUs may be in different loop iterations
• Make sure that no store is validated before older instructions have been executed
• When a loop exit occurs, make sure that the values needed to update the architectural registers have not been pushed out of the Loop Exit Queue – Prevent an EU to go too far ahead of the others
53
![Page 54: Hardware acceleration of sequential loops · Monster core • Let’s assume we are given a large silicon area and power ... – E.g., register dependency analysis – When possible,](https://reader034.vdocuments.us/reader034/viewer/2022050205/5f5824310858ab78636bfee7/html5/thumbnails/54.jpg)
Global synchronizer
54
I-node F-node
S-node
L-node
I-node I-node F-node
I-node
synchronizer
![Page 55: Hardware acceleration of sequential loops · Monster core • Let’s assume we are given a large silicon area and power ... – E.g., register dependency analysis – When possible,](https://reader034.vdocuments.us/reader034/viewer/2022050205/5f5824310858ab78636bfee7/html5/thumbnails/55.jpg)
Global synchronization (cont’d)
• Each EU has an iteration counter (IC)
• EU freezes when IC = ICMAX – ICMAX not the same for all EUs – For the S-node, ICMAX = ICMIN – For other nodes, ICMAX ≥ ICMIN
• When the IC reaches ICMIN, the EU sends a signal to the synchronizer
• After the synchronizer has received a signal from each EU, it sends back an unfreeze signal to all EUs
• Upon receiving the unfreeze signal, each EU subtracts ICMIN from its IC, and the store validator adds ICMIN to the executed iteration count
55
![Page 56: Hardware acceleration of sequential loops · Monster core • Let’s assume we are given a large silicon area and power ... – E.g., register dependency analysis – When possible,](https://reader034.vdocuments.us/reader034/viewer/2022050205/5f5824310858ab78636bfee7/html5/thumbnails/56.jpg)
Loop exit
• Typical loop exit a branch changes its direction
• An EU that detects a loop exit condition sends a signal to the synchronizer – The signal contains the loop exit point = the IC value and the micro-op rank
inside the body
• The synchronizer forwards the loop exit point to all the EUs and to the store validator – EUs freeze when they reach the loop exit point – Store validation stops at the loop exit point
56
![Page 57: Hardware acceleration of sequential loops · Monster core • Let’s assume we are given a large silicon area and power ... – E.g., register dependency analysis – When possible,](https://reader034.vdocuments.us/reader034/viewer/2022050205/5f5824310858ab78636bfee7/html5/thumbnails/57.jpg)
Memory dependencies
• Instruction window can be very large
– example : body size = 100, ICMAX=32 3200 instructions
• Exploit loop properties
– spatial locality is often very good
– constant-stride accesses are frequent
– store-load dependency generally repeats on each iteration
57
![Page 58: Hardware acceleration of sequential loops · Monster core • Let’s assume we are given a large silicon area and power ... – E.g., register dependency analysis – When possible,](https://reader034.vdocuments.us/reader034/viewer/2022050205/5f5824310858ab78636bfee7/html5/thumbnails/58.jpg)
Memory independence checker
• Accessed by loads and stores
• At store validation, tells if the store might conflict with an already executed load – exploits spatial locality and constant-stride accesses to keep structures
reasonably sized – see tech report for details
• If answer is no, everything is ok
• If answer is yes, may be a memory order violation trigger an early loop exit – the store is a safe loop exit point
58
![Page 59: Hardware acceleration of sequential loops · Monster core • Let’s assume we are given a large silicon area and power ... – E.g., register dependency analysis – When possible,](https://reader034.vdocuments.us/reader034/viewer/2022050205/5f5824310858ab78636bfee7/html5/thumbnails/59.jpg)
Store-load bypassing
59
store
addr S
addr L
load
X Y
Z
transform the loop body during build mode
predicted dependency
store
addr S
addr L
X Y
Z
mov
mov
mov
check
![Page 60: Hardware acceleration of sequential loops · Monster core • Let’s assume we are given a large silicon area and power ... – E.g., register dependency analysis – When possible,](https://reader034.vdocuments.us/reader034/viewer/2022050205/5f5824310858ab78636bfee7/html5/thumbnails/60.jpg)
Memory dependency (cont’d)
• Assumes distance between store and load is less than a loop body – sufficient in most cases – similar method can be used for distances between 1 and 2 bodies
• beneficial only for 456.hmmer
• ‘check’ micro-op executed by I-node, triggers early loop exit if fails
• Must verify that stores between the dependent store-load pair do not conflict
• Bypassed-Store Table accessed by each store at validation – one entry per bypassed store – if conflict detected early loop exit
60
![Page 61: Hardware acceleration of sequential loops · Monster core • Let’s assume we are given a large silicon area and power ... – E.g., register dependency analysis – When possible,](https://reader034.vdocuments.us/reader034/viewer/2022050205/5f5824310858ab78636bfee7/html5/thumbnails/61.jpg)
Loop body reduction
• Micro-ops with no input dependencies inside the loop body produce the same result on each iteration
• may remove these redundant micro-ops during build mode
• Recursive definition : micro-ops depending only on redundant micro-ops are themselves redundant – they produce constant results after a fixed number of iterations
• Keep redundant stores (memory consistency)
• Can remove redundant loads, but record their address in a Removed-Loads Table check conflicts at store validation
• Typically, between 10% and 20% of micro-ops can be removed
61
![Page 62: Hardware acceleration of sequential loops · Monster core • Let’s assume we are given a large silicon area and power ... – E.g., register dependency analysis – When possible,](https://reader034.vdocuments.us/reader034/viewer/2022050205/5f5824310858ab78636bfee7/html5/thumbnails/62.jpg)
Static mapping
• On which EU do we put a static micro-op ?
• Very important for performance
62
![Page 63: Hardware acceleration of sequential loops · Monster core • Let’s assume we are given a large silicon area and power ... – E.g., register dependency analysis – When possible,](https://reader034.vdocuments.us/reader034/viewer/2022050205/5f5824310858ab78636bfee7/html5/thumbnails/63.jpg)
Cyclic register dependency graph
63
A
B
C
D
E
If we put micro-ops A and C on the same EU, we create an artificial CRDG cycle of size 3
![Page 64: Hardware acceleration of sequential loops · Monster core • Let’s assume we are given a large silicon area and power ... – E.g., register dependency analysis – When possible,](https://reader034.vdocuments.us/reader034/viewer/2022050205/5f5824310858ab78636bfee7/html5/thumbnails/64.jpg)
Artificial CRDG cycle
64
I-node F-node
S-node
L-node
I-node I-node F-node
I-node A
B
C
CRDG cycle may increases loop execution time very much
When possible, try to avoid creating artificial CRDG cycles
![Page 65: Hardware acceleration of sequential loops · Monster core • Let’s assume we are given a large silicon area and power ... – E.g., register dependency analysis – When possible,](https://reader034.vdocuments.us/reader034/viewer/2022050205/5f5824310858ab78636bfee7/html5/thumbnails/65.jpg)
EU sharing
65
I-node F-node
S-node
L-node
I-node I-node F-node
I-node
Iteration rate is limited by the most loaded EU
A B C
D E
![Page 66: Hardware acceleration of sequential loops · Monster core • Let’s assume we are given a large silicon area and power ... – E.g., register dependency analysis – When possible,](https://reader034.vdocuments.us/reader034/viewer/2022050205/5f5824310858ab78636bfee7/html5/thumbnails/66.jpg)
EU sharing
66
I-node F-node
S-node
L-node
I-node I-node F-node
I-node
Iteration rate is limited by the most loaded EU
A B
C
D E Try to balance number of micro-ops per EU
![Page 67: Hardware acceleration of sequential loops · Monster core • Let’s assume we are given a large silicon area and power ... – E.g., register dependency analysis – When possible,](https://reader034.vdocuments.us/reader034/viewer/2022050205/5f5824310858ab78636bfee7/html5/thumbnails/67.jpg)
Lane segment sharing
67
I-node F-node
S-node
L-node
I-node I-node F-node
I-node
Iteration rate is limited by the most loaded lane segment
A B
C D E
![Page 68: Hardware acceleration of sequential loops · Monster core • Let’s assume we are given a large silicon area and power ... – E.g., register dependency analysis – When possible,](https://reader034.vdocuments.us/reader034/viewer/2022050205/5f5824310858ab78636bfee7/html5/thumbnails/68.jpg)
Lane segment sharing
68
I-node F-node
S-node
L-node
I-node I-node F-node
I-node
Iteration rate is limited by the most loaded lane segment
A B
C D
E
Try to balance number of dependencies per lane segment
![Page 69: Hardware acceleration of sequential loops · Monster core • Let’s assume we are given a large silicon area and power ... – E.g., register dependency analysis – When possible,](https://reader034.vdocuments.us/reader034/viewer/2022050205/5f5824310858ab78636bfee7/html5/thumbnails/69.jpg)
Mapping heuristic
• Try to balance EU sharing – important for loops with small body
• Try to detect natural and artificial CRDG cycles – artificial CRDG cycles can be avoided if detected – natural cycles try to put micro-ops on same EU
• Try to put dependent micro-ops on the same EU – that dependency does not use any lane segment
• Try to minimize number of lane segments used
69
![Page 70: Hardware acceleration of sequential loops · Monster core • Let’s assume we are given a large silicon area and power ... – E.g., register dependency analysis – When possible,](https://reader034.vdocuments.us/reader034/viewer/2022050205/5f5824310858ab78636bfee7/html5/thumbnails/70.jpg)
Configuration simulated
• Loop detector MaxBodySize = 128 • Build mode 5 iterations • Transition penalty
– loop table hit 100 cycles – loop table miss 10000 cycles
• Loop accelerator – 4 lanes, 8 nodes – 4 loads / cycle – 2 stores validated / cycle – up to 12 static micro-ops per EU – 12 operand queues per EU (queue depth = 32) – load issue buffer 64 loads – unfreeze signal latency = 4 cycles
70
![Page 71: Hardware acceleration of sequential loops · Monster core • Let’s assume we are given a large silicon area and power ... – E.g., register dependency analysis – When possible,](https://reader034.vdocuments.us/reader034/viewer/2022050205/5f5824310858ab78636bfee7/html5/thumbnails/71.jpg)
Performance
71
percentage of performance increase over baseline
![Page 72: Hardware acceleration of sequential loops · Monster core • Let’s assume we are given a large silicon area and power ... – E.g., register dependency analysis – When possible,](https://reader034.vdocuments.us/reader034/viewer/2022050205/5f5824310858ab78636bfee7/html5/thumbnails/72.jpg)
Local loop speedup
72
1.4
2.2 1.8
1.6
2.7
2.1
1.8
1.3 1.4 1.3
![Page 73: Hardware acceleration of sequential loops · Monster core • Let’s assume we are given a large silicon area and power ... – E.g., register dependency analysis – When possible,](https://reader034.vdocuments.us/reader034/viewer/2022050205/5f5824310858ab78636bfee7/html5/thumbnails/73.jpg)
Impact of disabling features
73
percentage of performance loss when disabling a single feature
![Page 74: Hardware acceleration of sequential loops · Monster core • Let’s assume we are given a large silicon area and power ... – E.g., register dependency analysis – When possible,](https://reader034.vdocuments.us/reader034/viewer/2022050205/5f5824310858ab78636bfee7/html5/thumbnails/74.jpg)
Some future research
• How does performance scale if we increase the number of nodes and/or the number of lanes ?
• Can we over-clock the loop accelerator ?
• Is a pipelined ring the best choice ?
• Is it a good choice not to provide any direct data path between EUs on same node ?
• What about applications other than the SPECs ?
• How could a compiler exploit a hardware loop accelerator ?
74
![Page 75: Hardware acceleration of sequential loops · Monster core • Let’s assume we are given a large silicon area and power ... – E.g., register dependency analysis – When possible,](https://reader034.vdocuments.us/reader034/viewer/2022050205/5f5824310858ab78636bfee7/html5/thumbnails/75.jpg)
Poor man’s loop acceleration ?
• Some ideas may be useful even if you don’t buy the loop accelerator
• loop detector
• loop body reduction
• store-load bypassing
• Can we improve dynamic instruction scheduling in a superscalar core if we know that we are in a dynamic loop ?
75
you pay the overhead only once for several thousands instructions
![Page 76: Hardware acceleration of sequential loops · Monster core • Let’s assume we are given a large silicon area and power ... – E.g., register dependency analysis – When possible,](https://reader034.vdocuments.us/reader034/viewer/2022050205/5f5824310858ab78636bfee7/html5/thumbnails/76.jpg)
Conclusions
• Long dynamic loops are frequent – > 30 % of all the instructions executed by the SPEC CPU2006 suite
• Try doing the repetitive work only once, before executing the loop
• The proposed loop accelerator avoids some important superscalar bottlenecks – no registers, no register renaming – no dynamic scheduling – huge instruction window latency tolerance
• This is preliminary research, many questions still pending
76
![Page 77: Hardware acceleration of sequential loops · Monster core • Let’s assume we are given a large silicon area and power ... – E.g., register dependency analysis – When possible,](https://reader034.vdocuments.us/reader034/viewer/2022050205/5f5824310858ab78636bfee7/html5/thumbnails/77.jpg)
Questions ?
77