a framework for efficient rapid prototyping by virtually enlarging fpga resources...

A Framework for Efficient Rapid Prototyping by Virtually Enlarging FPGA Resources

Shinya Takamaeda-Yamazaki†, Kenji Kise‡

†Nara Institute of Science and Technology (NAIST), Japan ‡Tokyo Institute of Technology (Tokyo Tech), Japan

ReConFig2014 Session 4B (PE) 10:15-10:40, Dec 9, 2014

Abstract n  flipSyrup: A framework for FPGA-based rapid prototyping

with abstract memory blocks and inter-FPGA interfaces

l  Available at PyPI (https://pypi.python.org/pypi/flipsyrup)

2 ReConFig2014 Shinya T-Y. NAIST

Read Write

Syrup Memory

Syrup Memory

Syrup Channel

Syrup Channel

Read Write

User-logic Dat

a to

/from

O

ther

Cha

nnel

s

BRAMs w/o Capacity Limit

Contents

n  Background l  FPGA-based rapid prototyping

n  New framework: flipSyrup l  Design flow with flipSyrup

l  Abstract objects for memory and inter-FPGA interface

l  Automatic RTL conversion by static analysis

n  Evaluation l  Multicore on a single FPGA platform

l  Manycore on a multi-FPGA platform

n  Conclusion

ReConFig2014 Shinya T-Y. NAIST 3

Contents







n  Conclusion


Background: Multicores to Manycores

5

TILERA TILE-Gx100

(100-core, MIPS) Intel Xeon Phi (54-core, x86)

Now: Multicore (2~8 cores per chip)

(Now and) Future: Many-core (32+ cores per chip)

Intel Corei7 (8-core, x86)

ARM Cortex-A9 (4-core, ARM)

ReConFig2014 Shinya T-Y. NAIST

FPGA-based Hardware Prototyping n  A major way for evaluating a new architectural idea

l JFast simulation speed: x100~x1000 faster than SW simulators

l LVery difficult and complicated to develop the system


Architectural Idea�

Problem Lack of abstractions for FPGA resources

Problems for Prototyping on FPGAs

7

Memory

Target Processor

Host Computer



8

Memory

Target Processor

Host Computer

Inter-FPGA communication

Small on-chip memory Complex off-chip DRAM

Cycle-level accuracy

Capacity limitation of FPGAs

Partition for multiple FPGAs

Long synthesis time



9

Memory

Target Processor

Host Computer

Inter-FPGA communication

Small on-chip memory Complex off-chip DRAM

Cycle-level accuracy

Capacity limitation of FPGAs

Partition for multiple FPGAs

Long synthesis time

Lack of Scalability Lack of Abstraction


Goal of This Research

n  Abstraction for Memory System l  For comprehensive management for entire memory systems of

on-chip SRAM and off-chip DRAM •  Just combining off-chip memory can expand the memory capacity,

but also increase the system complexity

n  Abstraction for Inter-FPGA Communication l  For cycle-accuracy management on multiple FPGAs

•  Just using Multiple FPGAs can expand the logic capacity, but it requires design partitioning and synchronization mechanism for cycle-accuracy


To provide efficient abstractions for simplifying development of FPGA-based prototypes

Contents







n  Conclusion


flipSyrup n  A framework for FPGA-based rapid prototyping with

abstract memory blocks and inter-FPGA interfaces l  Syrup Memory: Ideal abstracted memory system for processor

RTL implementation to user design •  For easy memory system implementation

l  Syrup Channel: Ideal inter-FPGA communication for multi-FPGA prototyping to user design

•  For easy design partitioning of simulated processor RTL

12

Read Write

Syrup Memory

Syrup Memory

Syrup Channel

Syrup Channel

Read Write

User-logic Dat

a to

/from

O

ther

Cha

nnel

s

BRAMs w/o Capacity Limit ReConFig2014 Shinya T-Y. NAIST

Development Flow with flipSyrup


Control Signal

Insertion IP-core Packing

(RTL and

Setting file)

IP-core Integration

on EDK

Synthesis by EDA

Memory/Channel System

Synthesis

Simulation System Bit Files

Manual RTL Modification Framework Tool-chain

Vendor EDA Tool-chain

FPGA Memory Specifications

BRAM size = 128K DRAM width = 128

Pure RTL Design

Partitioned Design with

Abstract Objects

Instance Hierarchy Analysis

IP-cores for Simulation

Simulation on FPGAs

Simulation Result

FPGA-based Hardware Simulation

RTL Modeling with Abstract Objects n  In advance, RAM objects and logic segments are

identified by using abstract objects of flipSyrup


Read Write

RAM

1-cycle RAMs w/o Capacity Limits

RAM RAM

Logic

Read Write

Sub-logic 0

Syrup Memory

Syrup Memory

Syrup Channel

Syrup Channel

Read Write

Virtual Connection

Region 0

Sub-logic N-1

Syrup Memory

Syrup Channel

Region N-1

= Entire Original Logic Replacing RAMs and I/Os with abstract objects

(a) Original Target Design (b) RTL Design with Abstract Objects

Complete Cycle-Accurate Simulation System n  The tool-chain generates a complete IP-core for cycle-

accurate simulation of the target hardware


Memory I/F

Memory I/F

Channel I/F

Channel I/F

Stall

I/O In

terfa

ce

(Ser

/Des

) Off-chip DRAM

On-chip Bus Interface (AXI4 or Handshake)

Cache

FIFO

Controlled Simulation

Target

Cycle-Accuracy Manager

Cache

FIFO

On-chip Interconnect

Other IP-core

or CPU

(If needed)

flipSyrup IP-core (Region 0) (Automatically Generated)

FPGA Region 0

I/O

flipSyrup System

Sub-logic 0

FPGA Region N-1

Connected to Other FPGAs

flipSyrup Abstract Objects in User RTL n  Syrup Memory: Abstract Memory

l  Behaves as an ideal block RAM in user RTL

n  Syrup Channel: Abstract Inter-FPGA Interconnect l  Behaves as an FIFO for inter-FPGA communications in user RTL

16

SyrupMemory1P #� (� .DOMAIN("domain"),� .ID(0),� .ADDR_WIDTH(W_A),� .DATA_WIDTH(W_D),� .WAY(1),� .LINEWIDTH(128),� .BYTE_ENABLE(0)� )� inst_mem0� (� .CLK(CLK),� .ADDR(addr),� .D(data_in),� .WE(wen),� .Q(data_out),� .RE(ren),� .BE()� );�

Syrup Memory (1-port)�

SyrupOutChannel #� (� .DOMAIN("domain"),� .ID(0),� .DATA_WIDTH(W_D)� )� inst_outchannel� (� .CLK(CLK),� .D(data_in),� .WE(wen)� );�

Syrup Out Channel�

SyrupInChannel #� (� .DOMAIN("domain"),� .ID(0),� .DATA_WIDTH(W_D)� )� inst_inchannel� (� .CLK(CLK),� .Q(data_out),� .RE(ren)� );�

Syrup In Channel�


Read Write

Syrup Memory

Syrup Memory

Syrup Channel

Syrup Channel

Read Write

User-logic Dat

a to

/from

O

ther

Cha

nnel

s

BRAMs w/o Capacity Limit

Automatic RTL Conversion by Static Analysis n  Our Verilog HDL compiler automatically inserts

l  (1) a throttling signal (DRIVE) and

l  (2) external memory/channel ports

l  with complete cycle-accuracy of RTL behavior

n  We developed an original RTL analyzer in Python l  Pyverilog: https://pypi.python.org/pypi/pyverilog/

•  You can install it by typing “pip install pyverilog”


sub userlogic

Abst Memory

Abst Channel

Memory Ports Channel Ports

DRIVE (=!stall)

sub userlogic

Abst Memory

Abst Channel

(a) Input (b) Converted

Automatic RTL Conversion in Verilog HDL n  flipSyrup automatically inserts

additional signals and some “generate” conditions


generate for(i=0; i<2; i=i+1) begin: loop� SyrupMemory1P� #(� .DOMAIN(”domain"),� .ID(i),� .ADDR_WIDTH(W_A),� .DATA_WIDTH(W_D)� )� inst_memory_name� (� .CLK(CLK),� .ADDR(mem_addr),� .D(mem_d),� .WE(mem_we),� .Q(mem_q)� );�end endgenerate�

1�2�3�4�5�6�7�8�9�10�11�12�13�14�15�16�17�

(a) Instantiation Declaration of Memory in Input RTL

generate for(i=0; i<2; i=i+1) begin: loop� if((i == 0)) begin � SyrupMemory1P� #(� .DOMAIN(”domain"),� .ID(i),� .ADDR_WIDTH(W_A),� .DATA_WIDTH(W_D) � )� inst_memory_name� (� .CLK(CLK),� .ADDR(mem_addr),� .D(mem_d),� .WE(mem_we),� .Q(mem_q),� .p0_addr(domain_syrupmemory_0_addr),� .p0_d(domain_syrupmemory_0_d),� .p0_we(domain_syrupmemory_0_we),� .p0_q(domain_syrupmemory_0_q),� .DRIVE(DRIVE)� );� end else if((i == 1)) begin � SyrupMemory1P� #(� .DOMAIN(”domain"),� .ID(i),� .ADDR_WIDTH(W_A),� .DATA_WIDTH(W_D) � )� inst_memory_name� (� .CLK(CLK),� .ADDR(mem_addr),� .D(mem_d),� .WE(mem_we),� .Q(mem_q),� .p0_addr(domain_syrupmemory_1_addr),� .p0_d(domain_syrupmemory_1_d),� .p0_we(domain_syrupmemory_1_we),� .p0_q(domain_syrupmemory_1_q),� .DRIVE(DRIVE)� );� end�end endgenerate�

1�2�3�4�5�6�7�8�9�10�11�12�13�14�15�16�17�18�19�20�21�22�23�24�25�26�27�28�29�30�31�32�33�34�35�36�37�38�39�40�41�42�43�44�45�

(b) Instantiation Declaration in Converted RTL

Contents







n  Conclusion


Evaluation: Multicore on Single FPGA Platform n  Prototyping Target

l  NoC-based multicore (like Heracles[5])

n  FPGA Board l  Xilinx VC707

(Virtex-7 XC7VX485T)


Evaluation Setup Core MIPS32, 6-stage, single-issue, in-order

DMAC 32-bit, 2-port Local Memory 32-bit, 4-port, 512KB/node, 1-cycle

Router 4-stage, 3-virtual-channel # nodes 8 (flipSyrup cache: 256KB/node)

16 (flipSyrup cache: 128KB/node) 24 (flipSyrup cache: 64KB/node)

Benchmark Dhrystone (dh), N-Queen (nq), Matrix mult (mm), 5-point stencil (st)

Local Memory�

DMAC�Core�

Router�

Xilinx VC707 Board

Target Multicore Processor with NoC

[5] Michel A. Kinsy et al., Heracles:A Tool for Fast RTL-Based Design Space Exploration of Multicore Processors, ACM FPGA’13

Evaluation Methodology

n  Comparison with an ideal FPGA with infinite BRAM n  On the ideal (non-existent) FPGA, …

l  Since all simulated memory systems can be implemented as BRAM, there are no system-level cache misses, so that there no stalls

n  On the actual FPGA, … l  Actual implementations have stall events due to cache misses

for cycle-accurate consistency l  The causes of stall events are classified into 4 types using

performance counters •  Hit: Non stall (= # cycles on an ideal FPGA)

•  Miss: Cache miss on own cache

•  Conflict: Port conflict on own cache

•  Wait: Cache miss on other caches


Simulation Speed n  Hardware using abstract memories can be simulated

within about 2x longer time compared to ideal FPGAs

n  The speed degrades with the increase of core count l  Due to waiting for other cache components and port conflict

22

0.00

0.50

1.00

1.50

2.00

2.50

ds

nq

mm

st

gmea

n ds

nq

mm

st

gmea

n ds

nq

mm

st

gmea

n

8-core 16-core 24-core

Nor

mal

ized

Cyc

le C

ount�

Wait Conflict Miss Hit


Ove

rhea

d Id

eal

Resource Utilization n  Not serious overhead

l  But Block RAM utilization is somehow bad •  Due to sub-banking structure to support byte-enable operations

23

0.0%

10.0%

20.0%

30.0%

40.0%

50.0%

60.0%

70.0%

80.0%

90.0%

Reg LUT BRAM Reg LUT BRAM Reg LUT BRAM

8-core 16-core 24-core

Util

izat

ion

[%]�

flipSyrup

Target


Contents







n  Conclusion


Evaluation: Manycore on Multi-FPGA Platform

n  Prototyping Target: NoC-based manycore l  The number of cores:

16-core (4x4)・32-core(4x8)・64-core (8x8)・128-core (16x8)

l  2 benchmarks: nq: N-queen mm: Matrix-matrix multiplication

n  FPGA environment: Multi-FPGA Platform l  ScalableCore System (16~128 FPGA nodes) [8]

•  FPGA: Xilinx Spartan-6 LX16

•  576Kbit BRAM + 512KB external SRAM

l  Operation frequency •  Target and flipSyrup:40MHz, Inter-FPGA Ser/Des: 80MHz

l  Design Tool: Xilinx ISE 14.6


[9] Shinya Takamaeda+, ScalableCore System: Scalable Many-core Simulator by Employing over 100 FPGAs, ARC’12

Manycore on ScalableCore System n  Local memory and Inter-FPGA

communications are abstracted by flipSyrup

26

DRAM Controller DRAM Controller

Local Memory

DMAC Core

R

System Functions

Target Core



ScalableCore System [6] FPGA: Xilinx Spartan-6 ×128

Simulation Speed n  Almost same as hand-tuned system

l  Original hand-tuned ScalableCore system: 1142[KHz]

l  Acceptable simulation performance under the abstraction

28

1111 1111 1111 1111 1111 1111 1111 1111

0

200

400

600

800

1000

1200

16-core 32-core 64-core 128-core

Sim

ulat

ion

Spee

d [K

Hz]�

N-Queen Matrix Multiply


Contents







n  Conclusion


Conclusion n  flipSyrup: A framework for FPGA-based rapid prototyping

with abstract memory blocks and inter-FPGA interfaces l  Available at PyPI (https://pypi.python.org/pypi/flipsyrup)

•  Please type a command “pip install flipsyrup”


a framework for efficient rapid prototyping by virtually enlarging fpga resources...

Engineering

simplifyingdevelopment

contentsn backgroundl

interfpga communicationl

mipsreconfig2014 shinya

technology naist

single fpga platforml

memory capacity

flipsyrupl design flow