frank vahid, 1 embedding-based placement of processing element networks on fpgas for physical model...

Frank Vahid, 1

Embedding-Based Placement of Processing element Networks on

FPGAs for Physical Model Simulation

Bailey Miller*, Frank Vahid*, Tony Givargis**

*Univ. of California, Riverside

**Univ. of California, Irvine

Supported by the NSF (CNS1016792, CPS1136146), and Semiconductor Research Corporation (GRC 2143.001)

Intern at SpaceX

Bailey Miller, UCR 2

Frank Vahid, UCR 3

Models of physical world that run in real-time

http://www.nhlbi.nih.gov/

Test cyber-physical systems

Physical mockup

Transducermodels

Environmentmodel

Digital mockup

Issue: Real-time achieved via inaccuracy

Frank Vahid, UCR 4

Weibel lung complexity4 gen: 32 ODEs6 gen: 128 ODEs8 gen: 512 ODEs10 gen: 2048 ODEs

“2-3 minutes to simulate one breath accurately”

V[1],R[1]

V[2],R[2]V[7],R[7]

Frank Vahid, UCR 5

100200

300400

PC(1)PC(4)GPUHLS / FPGAs

Speedup vs real-time

PC(1): 0.8xPC(4): 3.1xGPU: 1.6xHLS/FPGA: 3.2x

Earlier results14301490

Parallel computations + • Neighbor communication Seem like great match for FPGAs

Frank Vahid, UCR 6

Network of synchronized PEs on FPGAs

Digital mockup

General Processing Element• Iterative ODE solver (Euler/RK4) 0.1 ms / 0.01 ms timestep

PE 1 PE: 300 MHz

Frank Vahid, UCR 7

100200

300400

weibel

PC(1)PC(4)GPUHLSGeneral PEs

Speedup vs real-timePC(1): 0.8xPC(4): 3.1xGPU: 1.6xHLS: 3.2xGeneral PE: 4.9x

Earlier results1430

14901522

Frank Vahid, UCR 8

Current work

Problem: More PEs Lower frequency

Inter-PE critical path

INST MEM

DATA MEM

Internal PE critical path

0100200300400

0 200 400 600Freq

Frequency vs. PE network size

11-gen Weibel model, Virtex6 240T FPGA, general PEs

1 10 20 50 100 200 400

Real ODEs/sec

Lost ODEs/sec due to freq drop

Earlier approach

Frank Vahid, UCR 9

10K iterations

150K iterations

Convert virtual PEs to physical circuits using

FPGA place-route

Maps ODEs to virtual PEs using

simulated annealing

Frank Vahid, UCR 10

This work: Main idea

Graph embedding: Map guest graph to host graph, minim. max wire length

Virtual PEs

Physical PEs

Use physical model structure to avoid using FPGA placement (Phase 2)

Frank Vahid, UCR 11

FPGA FPGA

Phase 2 – Map virtual PEs to physical PEs

Embedding algorithm

H-tree embedding Linear embedding Direct map embedding

[1] Zienicke, P. 1990. Embeddings of Treelike Graphs into 2-Dimensional Meshes. (WG '90).[2] Aleliunas, R., and Rosenberg, A.L. 1982. On Embedding Rectangular Grids in Square Grids. (Computers ‘82).[3] Berman, F., and Snyder, L. 1987. On mapping parallel algorithms into parallel architectures, (PDC, ‘87).

2D grid of physical PEs

Frank Vahid, UCR 12

EqP1EqV1

EqP2EqV2

EqP3 EqV3

EqP4EqV4

EqP7EqV7

EqP5EqV5

EqP6EqV6

Bypass FPGA placement

EqP1EqV1

EqP4EqV4

EqP2EqV2

EqP5EqV5

EqP2EqV2

EqP6EqV6

EqP7EqV7

(Phase 1: May require "graph folding" first to reduce #PEs)

Compare/backup: Simulated annealing

Frank Vahid, UCR 13

Cost function:

C = w1*sum + w2*max + w3*gaps

Sum = sum of wire distancesMax = max wire length (Euclidean dist.)Gaps = wires across architectural features

Neighbor function: Swap PEs based on distance to neighbors

Results

Frank Vahid, UCR 14

No placement strategy

Simulated annealing placement

Embedding placement

4 generations shown 5 generations shown 5 generations shown

Results

Frank Vahid, UCR 15

Model and #PEs

No placement strategy Simulated Annealing Embedding

Not routable

Strategy # LUTS # BRAM #DSP Equivalent LUTs

None 58362 512 256 306682

SA 58567 512 256 306887

Embed 58569 512 256 306889

No impact on size

2D Neuron model - 256PE – Xilinx Virtex6

Strategy Total power (mW)

Dynamic power (mW)

Static power (mW)

None 15525 8744 6481

SA 16604 10013 6590

Embed 19859 12999 6859

20% more powerUsing Xilinx XPower Analyzer

0100200300400

0 200 400 600Fre

Frequency vs. PE network size

Results/Conclusions

Frank Vahid, UCR 16http://www.meti.com/

Speedup vs real-time (avg)

PC(1): 0.8xPC(4): 3.1xGPU: 1.6xHLS: 3.2xGeneral PE: 4.9xGrph emb(GPE): 11.2xCustom PE: 6.1xHeterog PE: 34.5x(Grph emb(HPE): 48.5x)

• Graph emb. gives additional speedup– FPGAs: Fastest cost-effective

execution of physical models–

http://www.youtube.com/watch?v=ThUKVhqoA3Q • Future

– Manycore device– Beyond testing CPS

• Implement end-productsSpeedup / dollar

Heterog PE • 3.0x better than

PC(4)• 4.5x better than GPUCPU (I7-950 + Intel X58 board): $480

GPU(GTX460 + I3-540 + H55 board): $380FPGA (Xilinx Virtex6 240T-2 board): $1800

Questions?

Frank Vahid, UCR 17

frank vahid, 1 embedding-based placement of processing element networks on fpgas for physical model...

pes slide

inaccuracy frank vahid

earlier approach frank

spacex slide

r7 slide

map virtual pes

t fpga

fpga comparebackup

Documents

digitally-bypassed transducers: interfacing digital mockups...

digital design copyright © 2006 frank vahid 1 digital...

application-specific customization of microblaze processors,...

1 semiconductor memory ece 3710 fall 2006. embedded systems...

embedded systems design: a unified hardware/software...

frank vahid, ucr 1 building fake body parts: digital mockups...

embedded system design-a unified hardware software...

warp processors frank vahid (task leader) department of...

frank vahid, uc riverside 1 new opportunities with platform...

frank vahid, uc riverside 1 system-on-a-chip platform tuning...

1 codesign extended applications brian grattan, greg stitt,...

warp processing – towards fpga ubiquity frank vahid...

roman lyseckyuniversity of california, riverside1 techniques...

embedded systems: supercomputing in a pencil tip frank...

parameterized embedded systems platforms frank vahid...

synthesizable vhdl lc-2 created by: eric frohnhoefer & ron...

embedded systems design: a unified … systems design: a...

custom sensor-based embedded computing systems frank...

embedded systems design: a unified … systems design: a...

dd vahid ch5 sep28 2006 fv - university of north carolina...