high performance implementation of microtubule …...high performance implementation of microtubule...
TRANSCRIPT
High Performance Implementation of Microtubule Modeling on FPGA using
Vivado HLS
+7 495 947 9017www.rosta.ru
Yury [email protected]
1
+7 495 947 9017www.rosta.ru
Agenda
1. Intoducing Rosta and Hardware overview
2. Microtubule Modeling Problem
3. Vivado HLS Implementation
4. Vivado Challenges: Floorplan and Timing Closure
5. Conclusion
+7 495 947 9017www.rosta.ru
• Established at 1993 • First activity - distribution
Sole distributor for Transtech (UK), Myricom (USA)• First design (1996) based on Transputer (Inmos, UK),
TMS320C4X (Texas Instruments), SHARC (Analog Devices)• Since 2000 – Virtex family FPGA by Xilinx
20 years of growing
Rosta products portfolio overview+7 495 947 9017
www.rosta.ru
Main Design Principles
1. Largest FPGA
2. Standard Interface
3. Scalable Solutions
4
+7 495 947 9017www.rosta.ru
RB-8V7 Computing Platform
• 1 U form factor
• 8 Virtex-7 FPGA - XC7V72000T
• 2 x PCIe x4 gen3 upstream connection to Host
5
+7 495 947 9017www.rosta.ru
High Performance Computer RB-8V7• 4 of 32-bit DDR3 memory banks• 2 banks per FPGA• 1 GB memory per FPGA• Total memory 2GB
2x RC47 boards4x
• 8 Xilinx Virtex-7 FPGA
RB-8V7 Hardware
6
+7 495 947 9017www.rosta.ru
RB-8V7. Connection to Host
RC-47RB-8V7
RC-47
Host
RHA-25PCIe x8 Gen 38 GB/s
PCIe x4 Gen3 (optic)4 GB/s
8732
8732
8725
PCIe x4 Gen3 (optic)4 GB/s
7
+7 495 947 9017www.rosta.ru
Vivado HLS 2014.4
Vivado 2014.4
Board Support Package
int hls_top(
uint32_t p1, p2, p3,
volatile uint64_t *bus_ptr
);
8
+7 495 947 9017www.rosta.ru
Agenda
1. Intoducing Rosta and Hardware Overview
2. Microtubule Modeling Problem
3. Vivado HLS Implementation
4. Vivado Challenges: Floorplan and Timing Closure
5. Conclusion
+7 495 947 9017www.rosta.ru
Problem Overview
Model time ~ 100 sTime step = 0.2 nsTotal steps ~ 5 ∗ 1011
Platform Computation time of one step
Total compute time
Xeon CPU8 cores
20 us 100 days
FPGA 1.3 us 6 days
Too long!!
15x Speedup!
10
+7 495 947 9017www.rosta.ru
Mathematical Model
Longitudal up
Longitudal down
Lateral left Lateral right
Lateral
bond energy, kBT
r lat r lat , nm
r inter , nm
r inter
-10
10
-10
10
0
0
0.3 0.6 0.9
0.3 0.6 0.9
Θ
Longitudinal
bonds energy, kBT
2,, )(
2 onkbending
nkBg Θ−Θ=
Molecule coordinates: Χ, Υ, Θ
Number of molecules: 13 * 12 = 15611
During each iteration
1. We know molecules coordinates – So we compute forces (gradient of energy)
2. Update coordinates
T = 100 s, dt = 0.2 ns, 𝑁𝑁𝑡𝑡 = 5 ∗ 1011 iterations
+7 495 947 9017www.rosta.ru
Steps of algorithm
Calculate with Langevin equations
)1,0(2,
1,, NdtTk
qUdtqq
qBi
nk
total
q
ink
ink ⋅+
∂∂⋅−= −
γγ
( )
⋅−
⋅−
−⋅
⋅=
o
ernk
ero
ernk
ernk
ernker
nk rr
brr
rr
Arvϕ
2int,
int
int,
2
0
int,
int,int, expexp)(
( )∑∑= =
++=13
1 1,
int,,
n
K
i
bendingnk
ernk
latnktotal
n
gvvU
Longitudal up
Longitudal down
Lateral left Lateral right
12
+7 495 947 9017www.rosta.ru
Agenda
1. Intoducing Rosta and Hardware overview
2. Microtubule Modeling Problem
3. Vivado HLS Implementation
4. Vivado Challenges: Floorplan and Timing Closure
5. Conclusion
+7 495 947 9017www.rosta.ru
HLS ImplementationForce Pipelines
void calc_lateral_gradients(
float_3d m1, // current moleculefloat_3d m2, // left moleculefloat_3d *left_lat_r_ret,float_3d *c_lat_l_ret
);
+7 495 947 9017www.rosta.ru
HLS ImplementationForce Pipelines
void calc_longitudal_gradiets(
float_3d m1, // current moleculefloat_3d m3, // upper moleculefloat_3d *c_long_u_ret,float_3d *up_long_d_ret
);
+7 495 947 9017www.rosta.ru
One Pipeline Computational SchemeFirst Step
+7 495 947 9017www.rosta.ru
One Pipeline Computational SchemeSecond Step
+7 495 947 9017www.rosta.ru
HLS ImplementationOne Pipeline Memory Requirements
One pipeline computation scheme requires coordinates of three molecules each cycle3*3*4 = 36 bytes
typedef struct {float x;float y;float t;
} float_3d;
float_3d m1[13][N_d];
#pragma HLS DATA_PACK variable=m1
BRAM Data bus width = 12 bytesUsing two ports we can read 24 bytes each cycle < 36 bytes requirement
#pragma HLS ARRAY_PARTITION variable=m1 cyclic factor=2 dim=2
All data stored in BRAM: less than 4 KB for coordinates
+7 495 947 9017www.rosta.ru
HLS ImplementationOne Pipeline Utilization and Performance
Frequency II Latency DSP FF LUT200 MHzPeriod = 5 ns
1 187 208 85467 133298 Total2160 2443200 1221600 Available9 % 3 % 11 % Utilization
One iteration latency
N – number of molecules = 13*12 = 152
𝑇𝑇𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 = 𝐿𝐿 + 𝑁𝑁
𝑇𝑇𝑖𝑖𝑡𝑡 = 343 * 5 ns = 1,7 мкс
How to increase performance? Add more computation pipelines to process severalmolecules in parallel.
XC7V72000T
+7 495 947 9017www.rosta.ru
Three Pipelines Computational Scheme First Step
+7 495 947 9017www.rosta.ru
Three Pipelines Computational Scheme Second Step
+7 495 947 9017www.rosta.ru
HLS ImplementationThree Pipelines Utilization and Performance
Frequency II Latency DSP FF LUT200 MHzPeriod = 5 ns
1 187 625 247349 405527 Total2160 2443200 1221600 Available28 % 10 % 33 % Utilization
Memory requirements: 7 molecules or 84 bytes each cycle
#pragma HLS ARRAY_PARTITION variable=m1 cyclic factor=4 dim=2
One iteration latency
𝑇𝑇𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 = 𝐿𝐿 + 𝑁𝑁/3 = 239 => 1.2 us
XC7V72000T
+7 495 947 9017www.rosta.ru
Heat Modeling
Calculate with Langevin equations
)1,0(2,
1,, NdtTk
qUdtqq
qBi
nk
total
q
ink
ink ⋅+
∂∂⋅−= −
γγNormally distributed pseudo random numbers
Each cycle 3 molecules coordinates are updated => we need 9 random numbers each cycle
Algorithm for generating normal numbers
1. Generate 2 uniformly distributed numbers (Mersenne Twister algorithm)2. Apply Box-Muller transform
3. Get 2 normal numbers
And finally we need 5 such blocks operate in parallel
We used Vivado HLS and achieved II = 1
+7 495 947 9017www.rosta.ru
Agenda
1. Intoducing Rosta and Hardware overview
2. Microtubule Modeling Problem
3. Vivado HLS Implementation
4. Vivado Challenges: Floorplan and Timing Closure
5. Conclusion
+7 495 947 9017www.rosta.ru
Floorplan Scheme
Big silicon XC7V2000t – 4 SLRs
HLS core doesn’t fit in one SLR
– breaks Xilinx recommendation
Need to minimize logic in HLS core, split between two HLS cores
1. Deterministic part (forces calculation and coordinates update) – main core2. Pseudo random number generators - rand core
Main HLS core is still too big – fits in two SLRs - can’t do anything about it
DSP FF LUT625 247349 4055272160 2443200 122160028 % 10 % 33 %
+7 495 947 9017www.rosta.ru
Floorplan Scheme
pblock_base – PCIe DMA, DDR3 controller, Rand HLS core –SLR2
pblock_hls – Main HLS coreSLR0 + SLR1
+7 495 947 9017www.rosta.ru
Floorplan SchemeImplementation Results
REDPCIe DMA, DDR3 controller
PURPLERand HLS core
CYAN Main HLS core
+7 495 947 9017www.rosta.ru
Timing Closure
Problems:
1. HLS Clock PeriodIncrease HLS clock uncertainty. This effectively decreases clock frequency, increasing pipelines depths and latencies, but not dramatically
2. DSP usageToo many float operation in design, require lots of DSP – Timing was very badHad to apply HLS Resource directive to decrease number of DSP cores
3. SLR boundary crossingRegister signals crossing SLRs
4. BRAM Access LatencyIncrease latency to insert FFs in address BRAM bus, thus breaking critical paths
5. Run phys_opt_design implementation stage
Thanks to Sergei Storojev and John Blaine from Xilinx!
+7 495 947 9017www.rosta.ru
Timing Closure DSP Usage
Very inconvenient! Suggestion - to be able to apply Resource directive to ALL cores inside function
Current Vivado HLS functionality – apply Resource directive to specific operation, represented by individual variable
+7 495 947 9017www.rosta.ru
Timing ClosureSLR Crossing
Register nets crossing SLR:Use Register Slices on AXI MM and Stream interfaces
+7 495 947 9017www.rosta.ru
Timing ClosureBRAM Access Latency
First synthesis results showed lots of very long combinatorial paths in front of BRAM Address for HLS arrays
Good Idea was to insert FF in this path using Vivado HLS directive
#pragma HLS RESOURCE variable=m1 core=RAM_2P_BRAM latency=5
+7 495 947 9017www.rosta.ru
Agenda
1. Intoducing Rosta and Hardware overview
2. Microtubule Modeling Problem
3. Vivado HLS Implementation
4. Vivado Challenges: Floorplan and Timing Closure
5. Conclusion
+7 495 947 9017www.rosta.ru
Conclusion
Big FPGA is capable of HPC using Vivado HLS
My experience1. Achieve II = 1 pipeline is a must2. Use Array Partition directive to feed pipeline with data3. Try to fit HLS core into one SLR. Floorplanning is a must4. Register nets crossing SLR
Tip:1. Try to increase BRAM access latency if facing timing issues on address bus
Suggestion 1. To be able to apply Resource directive to ALL cores inside function
+7 495 947 9017www.rosta.ru
Future Work
We are on step of obtaining new scientific results using our accelerated implementation.
Future technical plans:
Implement this algorithm using SDAccell on Rosta new board RC-4KU with Kintex Ultrascale silicon
If we have to tick to rule: one HLS core (or OpenCL kernel) per one SLR, then there is urgent need for implementing external pipes functionality in SDAccell
Thank you!
RC-47 board – Closer Look+7 495 947 9017
www.rosta.ru
KC1
KC2
Ножевой разъем
SD Card
USB
Life Support System
PEX 8732
С0
С1 С2
С3
35