quiz 1. question 1) according to the study on “simultaneous timing driven clustering and placement...
TRANSCRIPT
1
QUIZ
2
Question• 1) According to the study on “Simultaneous Timing Driven
Clustering and Placement for FPGAs”, what is a fragment level move and which drawbacks of the traditional FPGA CAD flow are targeted with the fragment level moves?
3
BSPlace: A BLE Swapping technique for placement
04.11.2014
Minsik Hong
George Hwang
Hemayamini Kurra
Minjun Seo
4
Outline• SCPlace
• Introduction• Algorithm flowchart• Net Counting Algorithm• Results
• BSPlace• Algorithm• Demo
• Backup Slides• If you guys ask minimal questions we can cover more
• Net Weighting• VPR Datastructures
5
Rajavel, Senthilkumar Thoravi, and Ali Akoglu. "MO-Pack: Many-objective clustering for FPGA CAD." Proceedings of the 48th Design Automation Conference. ACM, 2011.
6
Simultaneous timing driven clustering and placement for FPGAs.
Chen, Gang, and Jason Cong. Field Programmable Logic and Application. Springer Berlin Heidelberg, 2004. 158-167.
7
Key concept• Fragment level move
• BLE to a new CLB• Check for valid CLB configuration• Feasibility (number of BLEs and input pins)• Update the cost function
• Block level move• CLB to CLB
•
8
BLE Level Swapping• Advantages
• Fix Packing issues during simulated annealing• Better Congestion Mitigation• Better at Routeability
• Disadvantages• Speed• Complexity
9
SCPlace Algorithm
10
11
Additional feature of Journal version SCPlace
12
Use Novel net weighting
Use Novel net weighting
13
A novel net weighting algorithm for timing-driven placement
Kong, Tim Tianming. Proceedings of the 2002 IEEE/ACM international conference on Computer-aided design. ACM, 2002.
14
Accurate All Path Counting
15
a
b
c d
e
f5
71
5
3
0/0
0/2
7/7 8/8
13/13
11/13
ARR/REQ
a
b
c d
e
f
Calculate F(t)
Fs(a, c) = 7 – 0 – 7 = 0Fs(b, c) = 7 – 0 – 2 = 2
2
00
0
0
D{Fs(a, c), T} = D{0,13} = 1D{Fs(b, c), T} = D{2,13} = = 0.88D{Fs(c, d), T} = D{0,13} = = 1D{Fs(d, e), T} = D{0,13} = = 1D{Fs(d, f), T} = D{0,13} == 1
a=2, T: the longest path delay
1
1
0
0
0
0
F(c) = F(c) + D{Fs(a, c), T} x F(a) + D{Fs(b, c), T} x F(b) = 0 + 1x1 + 0.88x1 = 1.88
1.88 1.88
1.88
1.88
1
1
delay
16
Calculate B(s)
a
b
c d
e
f5
71
5
3
0/0
0/2
7/7 8/8
13/13
11/13
ARR/REQ
a
b
c d
e
f
0 0
1
1
0
0
Bs(d, e) = 13 – 5 – 8 = 0Bs(d, f) = 13 – 3 – 8 = 2
0
00
0
2
a=2, T: the longest path delay
D{Bs(a, c), T} = D{0,13} = 1D{Bs(b, c), T} = D{0,13} = 1D{Bs(c, d), T} = D{0,13} = 1D{Bs(d, e), T} = D{0,13} = 1D{Bs(d, f), T} = D{2,13} = 0.88
B(d) = B(d) + D{Bs(d, e), T} x B(e) + D{Bs(d, f), T} x B(f) = 0 + 1x1 + 0.88x1 = 1.88
1.88 1.88
1
1
1.88
1.88
17
Calculate AP(s, t) (a=2)
D{slack(a, c), T} = D{0,13} = 1D{slack(b, c), T} = D{2,13} = 0.88D{slack(c, d), T} = D{0,13} = 1D{slack(d, e), T} = D{0,13} = 1D{slack(d, f), T} = D{2,13} = 0.88
a
b
c d
e
f
1.88/1.88 1.88/1.88
1.88/1
1.88/1
1/1.88
1/1.882
0
0
0
2
F(s)/B(t)
slack
AP(a,c) = F(a) x B(c) x D{slack(a, c), T} = 1 x 1.88 x 1 = 1.88AP(b,c) = F(b) x B(c) x D{slack(b, c), T} = 1 x 1.88 x 0.88 = 1.65
a
b
c d
e
f
1.88
1.65
3.531.88
1.65
18
Results (Only use BLE swapping)
CLB = 4
19
Results (Only use BLE swapping)
20
Results (BLE + CLB swapping)
where 0 ≤ α ≤ 1
The number of CLB moves: The number of BLE moves:
21
Results (BLE + CLB swapping)
T-Vpack+VPR vs SCPlace (α=0.5)
22
BSPlace
23
BSPlace• BLE Level Swapping within Simulated Annealing with
Rent’s Rule• Advantages
• Fix packing issues as they occur.• Potentially better routability.• Potentially better congestion due to combination of placement and pack-
ing.
• Disadvantages• Execution time – We need to do memory allocation and deallocation for
any ble swapping.• Code Complexity – VPR is complex. We focus a lot of time with debug-
ging and testing instead of algorithms.
24
Rent’s Rule Threshold Value• Calculate the k value to get threshold• Enter simulated annealing process
• Outer loop process• Inner loop process
• Choose random CLB to move from current position to another position• Check Rent’s Rule Threshold• If we get a better result for swap
• Queue BLE Swapping
• Otherwise• Do CLB swapping :Use T-v place
• Loop Through BLE Swapping• Do BLE Swap after checking whether swap overlaps with previous swap• Re-Allocated Memory and return to outer loop
Pio kBT
25
Current Status• Code
• Created our own BLE swapping mechanism using VPR data struc-ture.• We have a whole suite of test fixtures to test code.• Testing still continuing, but we are finding minimal issues.
• We have done a swap within placement.• We have started to integrate our cost function
• Validation• We intend to run VPR benchmarks. Our BLE swapping solution
should be better or the same as TV-Place.• Our VPR benchmarks should also be comparable to IRAC.
26
The circuit below abstracts the MUX, switchboxes, and connection boxes. The connections represent the direct connections between bles in clbs. Op-timize this circuit by performing one BLE swap. Explain why your optimiza-tion will result in better performance.
Architecture ParameterK = 2I = 3N = 2MeasurementCritical Path Delay = 1.182ns
Demo
28
Demo
29
Thanks.
30
Backup Slides
31
Impact of duplication on placement
Delay = 2 Delay = 1
32
A novel net weighting algorithm for timing-driven placement
Kong, Tim Tianming. Proceedings of the 2002 IEEE/ACM international conference on Computer-aided design. ACM, 2002.
33
A Novel Net Weighting Algorithm• Accurate path counting algorithm
• The first known accurate path counting algorithm that considers all paths
• Due to experimental number of paths present in the circuit, accu-rate all path counting has been considered very difficult.
• Significant performance improvement• Little loss in total wirelength• No runtime overhead
34
A Novel Net Weighting Algorithm• consider the path sharing effect
• If two critical paths share a common segment, the edges in the common segment should receive higher weights.
• Define two variables• Forward path F(p) - the number of different critical paths starting
from PI elements, terminating at p.
• Backward path B(p) – the number of different critical paths staring from PO elements, terminating at p, if we reverse all signal flow di-rections.
35
Background
36
Background
37
Example
a
b
c d
e
f5
71
5
3
Timing of a circuit
0
0
7 8
13
11
5
71
5
3
ARR(t)
0
2
7 8
13
13
5
71
5
3
REQ(s)
The longest path delay (T)
38
Example
0
2
0 0
0
25
71
5
3
Slack(s, t)
5
71
5
3
0/0
0/2
7/7 8/8
13/13
11/13
39
Example
0
0 0
0
71
5
d(π) = 13, slack(π) = 0
2
0 0
25
1
3
0
0 0
2
71
3 2
0 0
0
5
1
5
d(π) = 9, slack(π) = 4
d(π) = 11, slack(π) = 2
d(π) = 11, slack(π) = 2
40
Critical Path counting
41
Calculate F(p)
0
0
0 0
0
05
71
5
3
1
1
0 0
0
05
71
5
3
1
1
2 2
2
25
71
5
3
42
Calculate B(p)
0
0
0 0
0
05
71
5
3
0
0
0 0
1
15
71
5
3
2
2
2 2
1
15
71
5
3
43
Calculate GP(s,t)
2
2
2 2
1
15
71
5
3
1
1
2 2
2
25
71
5
3
a
b
c d
e
f
2
2
4
2
2
44
Accurate All Path Counting• Use discount function to get accurate counting result
• ‘a’ is a positive constant number• x
• Fs(s,t) = ARR(t) – ARR(s) – d(s,t)• Bs(s,t) = REQ(t) – REQ(s) – d(s,t)
• y is the longest path delay (T)
45
Accurate All Path Counting
46
Ex. Calculate F(t) (a=2)
a
b
c d
e
f5
71
5
3
0/0
0/2
7/7 8/8
13/13
11/13
D{Fs(a, c), T} = D{0,13} = 1D{Fs(b, c), T} = D{2,13} = 0.88D{Fs(c, d), T} = D{0,13} = 1D{Fs(d, e), T} = D{0,13} = 1D{Fs(d, f), T} = D{0,13} = 1
a
b
c d
e
f5
71
5
3
1
1
1+0.88
1.88
1.88
1.88
47
Ex. Calculate B(s) (a=2)
a
b
c d
e
f5
71
5
3
0/0
0/2
7/7 8/8
13/13
11/13
D{Bs(a, c), T} = D{0,13} = 1D{Bs(b, c), T} = D{0,13} = 1D{Bs(c, d), T} = D{0,13} = 1D{Bs(d, e), T} = D{0,13} = 1D{Bs(d, f), T} = D{2,13} = 0.88
a
b
c d
e
f5
71
5
3
1.88
1.88
1.88 1+0.88
1
1
48
Ex. Calculate AP(s,t) (a=2)
a
b
c d
e
f5
71
5
3
1.88
1.88
1.88 1+0.88
1
1
a
b
c d
e
f5
71
5
3
1
1
1+0.88
1.88
1.88
1.88
a
b
c d
e
f
1*1.88*1= 1.88
D{slack(a, c), T} = D{0,13} = 1D{slack(b, c), T} = D{2,13} = 0.88D{slack(c, d), T} = D{0,13} = 1D{slack(d, e), T} = D{0,13} = 1D{slack(d, f), T} = D{2,13} = 0.88
1*1.88*0.88=1.65
1.88*1.88*1=3.53
1.88*1*1=1.88
1.88*1*0.88=1.65
49
Compare results
a
b
c d
e
f
1.88
1.65
3.53
1.88
1.65
a
b
c d
e
f
2
2
4
2
2
Using Critical counting method (GPATH), it is difficult to get accurate re-sult.However, if we use proposed algorithm, we can get more accurate result.
50
VPR Datastructures• Resource Routing Graph• Physical Block Graph• Netlist
• Global CLB Netlist• Global Atom Netlist
• Blocks
51
Blocks
• Contains CLB• Contains the Input Output• Contains the Resource Routing Graph• Contains the Physical Blocks
• Physical Blocks represents the BLE• Physical Blocks represents the Flip Flop• Physical Blocks also contains the LUTs
52
Resource Routing Graph
• Nodes are pins• Edges are architectural connections• Each pin is associated with a net num• Prev Nodes and Edges represents
the actual connections per ble.
53
Global Netlist
54
Atom Netlist