high performance implementation of microtubule …...high performance implementation of microtubule...

35
High Performance Implementation of Microtubule Modeling on FPGA using Vivado HLS +7 495 947 9017 www.rosta.ru Yury Rumyantsev [email protected] 1

Upload: others

Post on 25-Jul-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: High Performance Implementation of Microtubule …...High Performance Implementation of Microtubule Modeling on FPGA using Vivado HLS +7 495 947 9017 Yury Rumyantsev rumyantsev@rosta.ru

High Performance Implementation of Microtubule Modeling on FPGA using

Vivado HLS

+7 495 947 9017www.rosta.ru

Yury [email protected]

1

Page 2: High Performance Implementation of Microtubule …...High Performance Implementation of Microtubule Modeling on FPGA using Vivado HLS +7 495 947 9017 Yury Rumyantsev rumyantsev@rosta.ru

+7 495 947 9017www.rosta.ru

Agenda

1. Intoducing Rosta and Hardware overview

2. Microtubule Modeling Problem

3. Vivado HLS Implementation

4. Vivado Challenges: Floorplan and Timing Closure

5. Conclusion

Page 3: High Performance Implementation of Microtubule …...High Performance Implementation of Microtubule Modeling on FPGA using Vivado HLS +7 495 947 9017 Yury Rumyantsev rumyantsev@rosta.ru

+7 495 947 9017www.rosta.ru

• Established at 1993 • First activity - distribution

Sole distributor for Transtech (UK), Myricom (USA)• First design (1996) based on Transputer (Inmos, UK),

TMS320C4X (Texas Instruments), SHARC (Analog Devices)• Since 2000 – Virtex family FPGA by Xilinx

20 years of growing

Page 4: High Performance Implementation of Microtubule …...High Performance Implementation of Microtubule Modeling on FPGA using Vivado HLS +7 495 947 9017 Yury Rumyantsev rumyantsev@rosta.ru

Rosta products portfolio overview+7 495 947 9017

www.rosta.ru

Main Design Principles

1. Largest FPGA

2. Standard Interface

3. Scalable Solutions

4

Page 5: High Performance Implementation of Microtubule …...High Performance Implementation of Microtubule Modeling on FPGA using Vivado HLS +7 495 947 9017 Yury Rumyantsev rumyantsev@rosta.ru

+7 495 947 9017www.rosta.ru

RB-8V7 Computing Platform

• 1 U form factor

• 8 Virtex-7 FPGA - XC7V72000T

• 2 x PCIe x4 gen3 upstream connection to Host

5

Page 6: High Performance Implementation of Microtubule …...High Performance Implementation of Microtubule Modeling on FPGA using Vivado HLS +7 495 947 9017 Yury Rumyantsev rumyantsev@rosta.ru

+7 495 947 9017www.rosta.ru

High Performance Computer RB-8V7• 4 of 32-bit DDR3 memory banks• 2 banks per FPGA• 1 GB memory per FPGA• Total memory 2GB

2x RC47 boards4x

• 8 Xilinx Virtex-7 FPGA

RB-8V7 Hardware

6

Page 7: High Performance Implementation of Microtubule …...High Performance Implementation of Microtubule Modeling on FPGA using Vivado HLS +7 495 947 9017 Yury Rumyantsev rumyantsev@rosta.ru

+7 495 947 9017www.rosta.ru

RB-8V7. Connection to Host

RC-47RB-8V7

RC-47

Host

RHA-25PCIe x8 Gen 38 GB/s

PCIe x4 Gen3 (optic)4 GB/s

8732

8732

8725

PCIe x4 Gen3 (optic)4 GB/s

7

Page 8: High Performance Implementation of Microtubule …...High Performance Implementation of Microtubule Modeling on FPGA using Vivado HLS +7 495 947 9017 Yury Rumyantsev rumyantsev@rosta.ru

+7 495 947 9017www.rosta.ru

Vivado HLS 2014.4

Vivado 2014.4

Board Support Package

int hls_top(

uint32_t p1, p2, p3,

volatile uint64_t *bus_ptr

);

8

Page 9: High Performance Implementation of Microtubule …...High Performance Implementation of Microtubule Modeling on FPGA using Vivado HLS +7 495 947 9017 Yury Rumyantsev rumyantsev@rosta.ru

+7 495 947 9017www.rosta.ru

Agenda

1. Intoducing Rosta and Hardware Overview

2. Microtubule Modeling Problem

3. Vivado HLS Implementation

4. Vivado Challenges: Floorplan and Timing Closure

5. Conclusion

Page 10: High Performance Implementation of Microtubule …...High Performance Implementation of Microtubule Modeling on FPGA using Vivado HLS +7 495 947 9017 Yury Rumyantsev rumyantsev@rosta.ru

+7 495 947 9017www.rosta.ru

Problem Overview

Model time ~ 100 sTime step = 0.2 nsTotal steps ~ 5 ∗ 1011

Platform Computation time of one step

Total compute time

Xeon CPU8 cores

20 us 100 days

FPGA 1.3 us 6 days

Too long!!

15x Speedup!

10

Page 11: High Performance Implementation of Microtubule …...High Performance Implementation of Microtubule Modeling on FPGA using Vivado HLS +7 495 947 9017 Yury Rumyantsev rumyantsev@rosta.ru

+7 495 947 9017www.rosta.ru

Mathematical Model

Longitudal up

Longitudal down

Lateral left Lateral right

Lateral

bond energy, kBT

r lat r lat , nm

r inter , nm

r inter

-10

10

-10

10

0

0

0.3 0.6 0.9

0.3 0.6 0.9

Θ

Longitudinal

bonds energy, kBT

2,, )(

2 onkbending

nkBg Θ−Θ=

Molecule coordinates: Χ, Υ, Θ

Number of molecules: 13 * 12 = 15611

Page 12: High Performance Implementation of Microtubule …...High Performance Implementation of Microtubule Modeling on FPGA using Vivado HLS +7 495 947 9017 Yury Rumyantsev rumyantsev@rosta.ru

During each iteration

1. We know molecules coordinates – So we compute forces (gradient of energy)

2. Update coordinates

T = 100 s, dt = 0.2 ns, 𝑁𝑁𝑡𝑡 = 5 ∗ 1011 iterations

+7 495 947 9017www.rosta.ru

Steps of algorithm

Calculate with Langevin equations

)1,0(2,

1,, NdtTk

qUdtqq

qBi

nk

total

q

ink

ink ⋅+

∂∂⋅−= −

γγ

( )

⋅−

⋅−

−⋅

⋅=

o

ernk

ero

ernk

ernk

ernker

nk rr

brr

rr

Arvϕ

2int,

int

int,

2

0

int,

int,int, expexp)(

( )∑∑= =

++=13

1 1,

int,,

n

K

i

bendingnk

ernk

latnktotal

n

gvvU

Longitudal up

Longitudal down

Lateral left Lateral right

12

Page 13: High Performance Implementation of Microtubule …...High Performance Implementation of Microtubule Modeling on FPGA using Vivado HLS +7 495 947 9017 Yury Rumyantsev rumyantsev@rosta.ru

+7 495 947 9017www.rosta.ru

Agenda

1. Intoducing Rosta and Hardware overview

2. Microtubule Modeling Problem

3. Vivado HLS Implementation

4. Vivado Challenges: Floorplan and Timing Closure

5. Conclusion

Page 14: High Performance Implementation of Microtubule …...High Performance Implementation of Microtubule Modeling on FPGA using Vivado HLS +7 495 947 9017 Yury Rumyantsev rumyantsev@rosta.ru

+7 495 947 9017www.rosta.ru

HLS ImplementationForce Pipelines

void calc_lateral_gradients(

float_3d m1, // current moleculefloat_3d m2, // left moleculefloat_3d *left_lat_r_ret,float_3d *c_lat_l_ret

);

Page 15: High Performance Implementation of Microtubule …...High Performance Implementation of Microtubule Modeling on FPGA using Vivado HLS +7 495 947 9017 Yury Rumyantsev rumyantsev@rosta.ru

+7 495 947 9017www.rosta.ru

HLS ImplementationForce Pipelines

void calc_longitudal_gradiets(

float_3d m1, // current moleculefloat_3d m3, // upper moleculefloat_3d *c_long_u_ret,float_3d *up_long_d_ret

);

Page 16: High Performance Implementation of Microtubule …...High Performance Implementation of Microtubule Modeling on FPGA using Vivado HLS +7 495 947 9017 Yury Rumyantsev rumyantsev@rosta.ru

+7 495 947 9017www.rosta.ru

One Pipeline Computational SchemeFirst Step

Page 17: High Performance Implementation of Microtubule …...High Performance Implementation of Microtubule Modeling on FPGA using Vivado HLS +7 495 947 9017 Yury Rumyantsev rumyantsev@rosta.ru

+7 495 947 9017www.rosta.ru

One Pipeline Computational SchemeSecond Step

Page 18: High Performance Implementation of Microtubule …...High Performance Implementation of Microtubule Modeling on FPGA using Vivado HLS +7 495 947 9017 Yury Rumyantsev rumyantsev@rosta.ru

+7 495 947 9017www.rosta.ru

HLS ImplementationOne Pipeline Memory Requirements

One pipeline computation scheme requires coordinates of three molecules each cycle3*3*4 = 36 bytes

typedef struct {float x;float y;float t;

} float_3d;

float_3d m1[13][N_d];

#pragma HLS DATA_PACK variable=m1

BRAM Data bus width = 12 bytesUsing two ports we can read 24 bytes each cycle < 36 bytes requirement

#pragma HLS ARRAY_PARTITION variable=m1 cyclic factor=2 dim=2

All data stored in BRAM: less than 4 KB for coordinates

Page 19: High Performance Implementation of Microtubule …...High Performance Implementation of Microtubule Modeling on FPGA using Vivado HLS +7 495 947 9017 Yury Rumyantsev rumyantsev@rosta.ru

+7 495 947 9017www.rosta.ru

HLS ImplementationOne Pipeline Utilization and Performance

Frequency II Latency DSP FF LUT200 MHzPeriod = 5 ns

1 187 208 85467 133298 Total2160 2443200 1221600 Available9 % 3 % 11 % Utilization

One iteration latency

N – number of molecules = 13*12 = 152

𝑇𝑇𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 = 𝐿𝐿 + 𝑁𝑁

𝑇𝑇𝑖𝑖𝑡𝑡 = 343 * 5 ns = 1,7 мкс

How to increase performance? Add more computation pipelines to process severalmolecules in parallel.

XC7V72000T

Page 20: High Performance Implementation of Microtubule …...High Performance Implementation of Microtubule Modeling on FPGA using Vivado HLS +7 495 947 9017 Yury Rumyantsev rumyantsev@rosta.ru

+7 495 947 9017www.rosta.ru

Three Pipelines Computational Scheme First Step

Page 21: High Performance Implementation of Microtubule …...High Performance Implementation of Microtubule Modeling on FPGA using Vivado HLS +7 495 947 9017 Yury Rumyantsev rumyantsev@rosta.ru

+7 495 947 9017www.rosta.ru

Three Pipelines Computational Scheme Second Step

Page 22: High Performance Implementation of Microtubule …...High Performance Implementation of Microtubule Modeling on FPGA using Vivado HLS +7 495 947 9017 Yury Rumyantsev rumyantsev@rosta.ru

+7 495 947 9017www.rosta.ru

HLS ImplementationThree Pipelines Utilization and Performance

Frequency II Latency DSP FF LUT200 MHzPeriod = 5 ns

1 187 625 247349 405527 Total2160 2443200 1221600 Available28 % 10 % 33 % Utilization

Memory requirements: 7 molecules or 84 bytes each cycle

#pragma HLS ARRAY_PARTITION variable=m1 cyclic factor=4 dim=2

One iteration latency

𝑇𝑇𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 = 𝐿𝐿 + 𝑁𝑁/3 = 239 => 1.2 us

XC7V72000T

Page 23: High Performance Implementation of Microtubule …...High Performance Implementation of Microtubule Modeling on FPGA using Vivado HLS +7 495 947 9017 Yury Rumyantsev rumyantsev@rosta.ru

+7 495 947 9017www.rosta.ru

Heat Modeling

Calculate with Langevin equations

)1,0(2,

1,, NdtTk

qUdtqq

qBi

nk

total

q

ink

ink ⋅+

∂∂⋅−= −

γγNormally distributed pseudo random numbers

Each cycle 3 molecules coordinates are updated => we need 9 random numbers each cycle

Algorithm for generating normal numbers

1. Generate 2 uniformly distributed numbers (Mersenne Twister algorithm)2. Apply Box-Muller transform

3. Get 2 normal numbers

And finally we need 5 such blocks operate in parallel

We used Vivado HLS and achieved II = 1

Page 24: High Performance Implementation of Microtubule …...High Performance Implementation of Microtubule Modeling on FPGA using Vivado HLS +7 495 947 9017 Yury Rumyantsev rumyantsev@rosta.ru

+7 495 947 9017www.rosta.ru

Agenda

1. Intoducing Rosta and Hardware overview

2. Microtubule Modeling Problem

3. Vivado HLS Implementation

4. Vivado Challenges: Floorplan and Timing Closure

5. Conclusion

Page 25: High Performance Implementation of Microtubule …...High Performance Implementation of Microtubule Modeling on FPGA using Vivado HLS +7 495 947 9017 Yury Rumyantsev rumyantsev@rosta.ru

+7 495 947 9017www.rosta.ru

Floorplan Scheme

Big silicon XC7V2000t – 4 SLRs

HLS core doesn’t fit in one SLR

– breaks Xilinx recommendation

Need to minimize logic in HLS core, split between two HLS cores

1. Deterministic part (forces calculation and coordinates update) – main core2. Pseudo random number generators - rand core

Main HLS core is still too big – fits in two SLRs - can’t do anything about it

DSP FF LUT625 247349 4055272160 2443200 122160028 % 10 % 33 %

Page 26: High Performance Implementation of Microtubule …...High Performance Implementation of Microtubule Modeling on FPGA using Vivado HLS +7 495 947 9017 Yury Rumyantsev rumyantsev@rosta.ru

+7 495 947 9017www.rosta.ru

Floorplan Scheme

pblock_base – PCIe DMA, DDR3 controller, Rand HLS core –SLR2

pblock_hls – Main HLS coreSLR0 + SLR1

Page 27: High Performance Implementation of Microtubule …...High Performance Implementation of Microtubule Modeling on FPGA using Vivado HLS +7 495 947 9017 Yury Rumyantsev rumyantsev@rosta.ru

+7 495 947 9017www.rosta.ru

Floorplan SchemeImplementation Results

REDPCIe DMA, DDR3 controller

PURPLERand HLS core

CYAN Main HLS core

Page 28: High Performance Implementation of Microtubule …...High Performance Implementation of Microtubule Modeling on FPGA using Vivado HLS +7 495 947 9017 Yury Rumyantsev rumyantsev@rosta.ru

+7 495 947 9017www.rosta.ru

Timing Closure

Problems:

1. HLS Clock PeriodIncrease HLS clock uncertainty. This effectively decreases clock frequency, increasing pipelines depths and latencies, but not dramatically

2. DSP usageToo many float operation in design, require lots of DSP – Timing was very badHad to apply HLS Resource directive to decrease number of DSP cores

3. SLR boundary crossingRegister signals crossing SLRs

4. BRAM Access LatencyIncrease latency to insert FFs in address BRAM bus, thus breaking critical paths

5. Run phys_opt_design implementation stage

Thanks to Sergei Storojev and John Blaine from Xilinx!

Page 29: High Performance Implementation of Microtubule …...High Performance Implementation of Microtubule Modeling on FPGA using Vivado HLS +7 495 947 9017 Yury Rumyantsev rumyantsev@rosta.ru

+7 495 947 9017www.rosta.ru

Timing Closure DSP Usage

Very inconvenient! Suggestion - to be able to apply Resource directive to ALL cores inside function

Current Vivado HLS functionality – apply Resource directive to specific operation, represented by individual variable

Page 30: High Performance Implementation of Microtubule …...High Performance Implementation of Microtubule Modeling on FPGA using Vivado HLS +7 495 947 9017 Yury Rumyantsev rumyantsev@rosta.ru

+7 495 947 9017www.rosta.ru

Timing ClosureSLR Crossing

Register nets crossing SLR:Use Register Slices on AXI MM and Stream interfaces

Page 31: High Performance Implementation of Microtubule …...High Performance Implementation of Microtubule Modeling on FPGA using Vivado HLS +7 495 947 9017 Yury Rumyantsev rumyantsev@rosta.ru

+7 495 947 9017www.rosta.ru

Timing ClosureBRAM Access Latency

First synthesis results showed lots of very long combinatorial paths in front of BRAM Address for HLS arrays

Good Idea was to insert FF in this path using Vivado HLS directive

#pragma HLS RESOURCE variable=m1 core=RAM_2P_BRAM latency=5

Page 32: High Performance Implementation of Microtubule …...High Performance Implementation of Microtubule Modeling on FPGA using Vivado HLS +7 495 947 9017 Yury Rumyantsev rumyantsev@rosta.ru

+7 495 947 9017www.rosta.ru

Agenda

1. Intoducing Rosta and Hardware overview

2. Microtubule Modeling Problem

3. Vivado HLS Implementation

4. Vivado Challenges: Floorplan and Timing Closure

5. Conclusion

Page 33: High Performance Implementation of Microtubule …...High Performance Implementation of Microtubule Modeling on FPGA using Vivado HLS +7 495 947 9017 Yury Rumyantsev rumyantsev@rosta.ru

+7 495 947 9017www.rosta.ru

Conclusion

Big FPGA is capable of HPC using Vivado HLS

My experience1. Achieve II = 1 pipeline is a must2. Use Array Partition directive to feed pipeline with data3. Try to fit HLS core into one SLR. Floorplanning is a must4. Register nets crossing SLR

Tip:1. Try to increase BRAM access latency if facing timing issues on address bus

Suggestion 1. To be able to apply Resource directive to ALL cores inside function

Page 34: High Performance Implementation of Microtubule …...High Performance Implementation of Microtubule Modeling on FPGA using Vivado HLS +7 495 947 9017 Yury Rumyantsev rumyantsev@rosta.ru

+7 495 947 9017www.rosta.ru

Future Work

We are on step of obtaining new scientific results using our accelerated implementation.

Future technical plans:

Implement this algorithm using SDAccell on Rosta new board RC-4KU with Kintex Ultrascale silicon

If we have to tick to rule: one HLS core (or OpenCL kernel) per one SLR, then there is urgent need for implementing external pipes functionality in SDAccell

Thank you!

Page 35: High Performance Implementation of Microtubule …...High Performance Implementation of Microtubule Modeling on FPGA using Vivado HLS +7 495 947 9017 Yury Rumyantsev rumyantsev@rosta.ru

RC-47 board – Closer Look+7 495 947 9017

www.rosta.ru

KC1

KC2

Ножевой разъем

SD Card

USB

Life Support System

PEX 8732

С0

С1 С2

С3

35