a framework for efficient rapid prototyping by virtually enlarging fpga resources...
DESCRIPTION
A Framework for Efficient Rapid Prototyping by Virtually Enlarging FPGA Resources (ReConFig2014@Cancun, Mexico) flipSyrup, a new framework for rapid prototyping is proposed.TRANSCRIPT
A Framework for Efficient Rapid Prototyping by Virtually Enlarging FPGA Resources
Shinya Takamaeda-Yamazaki†, Kenji Kise‡
†Nara Institute of Science and Technology (NAIST), Japan ‡Tokyo Institute of Technology (Tokyo Tech), Japan
ReConFig2014 Session 4B (PE) 10:15-10:40, Dec 9, 2014
Abstract n flipSyrup: A framework for FPGA-based rapid prototyping
with abstract memory blocks and inter-FPGA interfaces
l Available at PyPI (https://pypi.python.org/pypi/flipsyrup)
2 ReConFig2014 Shinya T-Y. NAIST
Read Write
Syrup Memory
Syrup Memory
Syrup Channel
Syrup Channel
Read Write
User-logic Dat
a to
/from
O
ther
Cha
nnel
s
BRAMs w/o Capacity Limit
Contents
n Background l FPGA-based rapid prototyping
n New framework: flipSyrup l Design flow with flipSyrup
l Abstract objects for memory and inter-FPGA interface
l Automatic RTL conversion by static analysis
n Evaluation l Multicore on a single FPGA platform
l Manycore on a multi-FPGA platform
n Conclusion
ReConFig2014 Shinya T-Y. NAIST 3
Contents
n Background l FPGA-based rapid prototyping
n New framework: flipSyrup l Design flow with flipSyrup
l Abstract objects for memory and inter-FPGA interface
l Automatic RTL conversion by static analysis
n Evaluation l Multicore on a single FPGA platform
l Manycore on a multi-FPGA platform
n Conclusion
ReConFig2014 Shinya T-Y. NAIST 4
Background: Multicores to Manycores
5
TILERA TILE-Gx100
(100-core, MIPS) Intel Xeon Phi (54-core, x86)
Now: Multicore (2~8 cores per chip)
(Now and) Future: Many-core (32+ cores per chip)
Intel Corei7 (8-core, x86)
ARM Cortex-A9 (4-core, ARM)
ReConFig2014 Shinya T-Y. NAIST
FPGA-based Hardware Prototyping n A major way for evaluating a new architectural idea
l JFast simulation speed: x100~x1000 faster than SW simulators
l LVery difficult and complicated to develop the system
ReConFig2014 Shinya T-Y. NAIST 6
Architectural Idea�
Problem Lack of abstractions for FPGA resources
Problems for Prototyping on FPGAs
7
Memory
Target Processor
Host Computer
ReConFig2014 Shinya T-Y. NAIST
Problems for Prototyping on FPGAs
8
Memory
Target Processor
Host Computer
Inter-FPGA communication
Small on-chip memory Complex off-chip DRAM
Cycle-level accuracy
Capacity limitation of FPGAs
Partition for multiple FPGAs
Long synthesis time
ReConFig2014 Shinya T-Y. NAIST
Problems for Prototyping on FPGAs
9
Memory
Target Processor
Host Computer
Inter-FPGA communication
Small on-chip memory Complex off-chip DRAM
Cycle-level accuracy
Capacity limitation of FPGAs
Partition for multiple FPGAs
Long synthesis time
Lack of Scalability Lack of Abstraction
ReConFig2014 Shinya T-Y. NAIST
Goal of This Research
n Abstraction for Memory System l For comprehensive management for entire memory systems of
on-chip SRAM and off-chip DRAM • Just combining off-chip memory can expand the memory capacity,
but also increase the system complexity
n Abstraction for Inter-FPGA Communication l For cycle-accuracy management on multiple FPGAs
• Just using Multiple FPGAs can expand the logic capacity, but it requires design partitioning and synchronization mechanism for cycle-accuracy
ReConFig2014 Shinya T-Y. NAIST 10
To provide efficient abstractions for simplifying development of FPGA-based prototypes
Contents
n Background l FPGA-based rapid prototyping
n New framework: flipSyrup l Design flow with flipSyrup
l Abstract objects for memory and inter-FPGA interface
l Automatic RTL conversion by static analysis
n Evaluation l Multicore on a single FPGA platform
l Manycore on a multi-FPGA platform
n Conclusion
ReConFig2014 Shinya T-Y. NAIST 11
flipSyrup n A framework for FPGA-based rapid prototyping with
abstract memory blocks and inter-FPGA interfaces l Syrup Memory: Ideal abstracted memory system for processor
RTL implementation to user design • For easy memory system implementation
l Syrup Channel: Ideal inter-FPGA communication for multi-FPGA prototyping to user design
• For easy design partitioning of simulated processor RTL
12
Read Write
Syrup Memory
Syrup Memory
Syrup Channel
Syrup Channel
Read Write
User-logic Dat
a to
/from
O
ther
Cha
nnel
s
BRAMs w/o Capacity Limit ReConFig2014 Shinya T-Y. NAIST
Development Flow with flipSyrup
13 ReConFig2014 Shinya T-Y. NAIST
Control Signal
Insertion IP-core Packing
(RTL and
Setting file)
IP-core Integration
on EDK
Synthesis by EDA
Memory/Channel System
Synthesis
Simulation System Bit Files
Manual RTL Modification Framework Tool-chain
Vendor EDA Tool-chain
FPGA Memory Specifications
BRAM size = 128K DRAM width = 128
Pure RTL Design
Partitioned Design with
Abstract Objects
Instance Hierarchy Analysis
IP-cores for Simulation
Simulation on FPGAs
Simulation Result
FPGA-based Hardware Simulation
RTL Modeling with Abstract Objects n In advance, RAM objects and logic segments are
identified by using abstract objects of flipSyrup
ReConFig2014 Shinya T-Y. NAIST 14
Read Write
RAM
1-cycle RAMs w/o Capacity Limits
RAM RAM
Logic
Read Write
Sub-logic 0
Syrup Memory
Syrup Memory
Syrup Channel
Syrup Channel
Read Write
Virtual Connection
Region 0
Sub-logic N-1
Syrup Memory
Syrup Channel
Region N-1
= Entire Original Logic Replacing RAMs and I/Os with abstract objects
(a) Original Target Design (b) RTL Design with Abstract Objects
Complete Cycle-Accurate Simulation System n The tool-chain generates a complete IP-core for cycle-
accurate simulation of the target hardware
ReConFig2014 Shinya T-Y. NAIST 15
Memory I/F
Memory I/F
Channel I/F
Channel I/F
Stall
I/O In
terfa
ce
(Ser
/Des
) Off-chip DRAM
On-chip Bus Interface (AXI4 or Handshake)
Cache
FIFO
Controlled Simulation
Target
Cycle-Accuracy Manager
Cache
FIFO
On-chip Interconnect
Other IP-core
or CPU
(If needed)
flipSyrup IP-core (Region 0) (Automatically Generated)
FPGA Region 0
I/O
flipSyrup System
Sub-logic 0
FPGA Region N-1
Connected to Other FPGAs
flipSyrup Abstract Objects in User RTL n Syrup Memory: Abstract Memory
l Behaves as an ideal block RAM in user RTL
n Syrup Channel: Abstract Inter-FPGA Interconnect l Behaves as an FIFO for inter-FPGA communications in user RTL
16
SyrupMemory1P #� (� .DOMAIN("domain"),� .ID(0),� .ADDR_WIDTH(W_A),� .DATA_WIDTH(W_D),� .WAY(1),� .LINEWIDTH(128),� .BYTE_ENABLE(0)� )� inst_mem0� (� .CLK(CLK),� .ADDR(addr),� .D(data_in),� .WE(wen),� .Q(data_out),� .RE(ren),� .BE()� );�
Syrup Memory (1-port)�
SyrupOutChannel #� (� .DOMAIN("domain"),� .ID(0),� .DATA_WIDTH(W_D)� )� inst_outchannel� (� .CLK(CLK),� .D(data_in),� .WE(wen)� );�
Syrup Out Channel�
SyrupInChannel #� (� .DOMAIN("domain"),� .ID(0),� .DATA_WIDTH(W_D)� )� inst_inchannel� (� .CLK(CLK),� .Q(data_out),� .RE(ren)� );�
Syrup In Channel�
ReConFig2014 Shinya T-Y. NAIST
Read Write
Syrup Memory
Syrup Memory
Syrup Channel
Syrup Channel
Read Write
User-logic Dat
a to
/from
O
ther
Cha
nnel
s
BRAMs w/o Capacity Limit
Automatic RTL Conversion by Static Analysis n Our Verilog HDL compiler automatically inserts
l (1) a throttling signal (DRIVE) and
l (2) external memory/channel ports
l with complete cycle-accuracy of RTL behavior
n We developed an original RTL analyzer in Python l Pyverilog: https://pypi.python.org/pypi/pyverilog/
• You can install it by typing “pip install pyverilog”
ReConFig2014 Shinya T-Y. NAIST 17
sub userlogic
Abst Memory
Abst Channel
Memory Ports Channel Ports
DRIVE (=!stall)
sub userlogic
Abst Memory
Abst Channel
(a) Input (b) Converted
Automatic RTL Conversion in Verilog HDL n flipSyrup automatically inserts
additional signals and some “generate” conditions
ReConFig2014 Shinya T-Y. NAIST 18
generate for(i=0; i<2; i=i+1) begin: loop� SyrupMemory1P� #(� .DOMAIN(”domain"),� .ID(i),� .ADDR_WIDTH(W_A),� .DATA_WIDTH(W_D)� )� inst_memory_name� (� .CLK(CLK),� .ADDR(mem_addr),� .D(mem_d),� .WE(mem_we),� .Q(mem_q)� );�end endgenerate�
1�2�3�4�5�6�7�8�9�10�11�12�13�14�15�16�17�
(a) Instantiation Declaration of Memory in Input RTL
generate for(i=0; i<2; i=i+1) begin: loop� if((i == 0)) begin � SyrupMemory1P� #(� .DOMAIN(”domain"),� .ID(i),� .ADDR_WIDTH(W_A),� .DATA_WIDTH(W_D) � )� inst_memory_name� (� .CLK(CLK),� .ADDR(mem_addr),� .D(mem_d),� .WE(mem_we),� .Q(mem_q),� .p0_addr(domain_syrupmemory_0_addr),� .p0_d(domain_syrupmemory_0_d),� .p0_we(domain_syrupmemory_0_we),� .p0_q(domain_syrupmemory_0_q),� .DRIVE(DRIVE)� );� end else if((i == 1)) begin � SyrupMemory1P� #(� .DOMAIN(”domain"),� .ID(i),� .ADDR_WIDTH(W_A),� .DATA_WIDTH(W_D) � )� inst_memory_name� (� .CLK(CLK),� .ADDR(mem_addr),� .D(mem_d),� .WE(mem_we),� .Q(mem_q),� .p0_addr(domain_syrupmemory_1_addr),� .p0_d(domain_syrupmemory_1_d),� .p0_we(domain_syrupmemory_1_we),� .p0_q(domain_syrupmemory_1_q),� .DRIVE(DRIVE)� );� end�end endgenerate�
1�2�3�4�5�6�7�8�9�10�11�12�13�14�15�16�17�18�19�20�21�22�23�24�25�26�27�28�29�30�31�32�33�34�35�36�37�38�39�40�41�42�43�44�45�
(b) Instantiation Declaration in Converted RTL
Contents
n Background l FPGA-based rapid prototyping
n New framework: flipSyrup l Design flow with flipSyrup
l Abstract objects for memory and inter-FPGA interface
l Automatic RTL conversion by static analysis
n Evaluation l Multicore on a single FPGA platform
l Manycore on a multi-FPGA platform
n Conclusion
ReConFig2014 Shinya T-Y. NAIST 19
Evaluation: Multicore on Single FPGA Platform n Prototyping Target
l NoC-based multicore (like Heracles[5])
n FPGA Board l Xilinx VC707
(Virtex-7 XC7VX485T)
ReConFig2014 Shinya T-Y. NAIST 20
Evaluation Setup Core MIPS32, 6-stage, single-issue, in-order
DMAC 32-bit, 2-port Local Memory 32-bit, 4-port, 512KB/node, 1-cycle
Router 4-stage, 3-virtual-channel # nodes 8 (flipSyrup cache: 256KB/node)
16 (flipSyrup cache: 128KB/node) 24 (flipSyrup cache: 64KB/node)
Benchmark Dhrystone (dh), N-Queen (nq), Matrix mult (mm), 5-point stencil (st)
Local Memory�
DMAC�Core�
Router�
Xilinx VC707 Board
Target Multicore Processor with NoC
[5] Michel A. Kinsy et al., Heracles:A Tool for Fast RTL-Based Design Space Exploration of Multicore Processors, ACM FPGA’13
Evaluation Methodology
n Comparison with an ideal FPGA with infinite BRAM n On the ideal (non-existent) FPGA, …
l Since all simulated memory systems can be implemented as BRAM, there are no system-level cache misses, so that there no stalls
n On the actual FPGA, … l Actual implementations have stall events due to cache misses
for cycle-accurate consistency l The causes of stall events are classified into 4 types using
performance counters • Hit: Non stall (= # cycles on an ideal FPGA)
• Miss: Cache miss on own cache
• Conflict: Port conflict on own cache
• Wait: Cache miss on other caches
21 ReConFig2014 Shinya T-Y. NAIST
Simulation Speed n Hardware using abstract memories can be simulated
within about 2x longer time compared to ideal FPGAs
n The speed degrades with the increase of core count l Due to waiting for other cache components and port conflict
22
0.00
0.50
1.00
1.50
2.00
2.50
ds
nq
mm
st
gmea
n ds
nq
mm
st
gmea
n ds
nq
mm
st
gmea
n
8-core 16-core 24-core
Nor
mal
ized
Cyc
le C
ount�
Wait Conflict Miss Hit
ReConFig2014 Shinya T-Y. NAIST
Ove
rhea
d Id
eal
Resource Utilization n Not serious overhead
l But Block RAM utilization is somehow bad • Due to sub-banking structure to support byte-enable operations
23
0.0%
10.0%
20.0%
30.0%
40.0%
50.0%
60.0%
70.0%
80.0%
90.0%
Reg LUT BRAM Reg LUT BRAM Reg LUT BRAM
8-core 16-core 24-core
Util
izat
ion
[%]�
flipSyrup
Target
ReConFig2014 Shinya T-Y. NAIST
Contents
n Background l FPGA-based rapid prototyping
n New framework: flipSyrup l Design flow with flipSyrup
l Abstract objects for memory and inter-FPGA interface
l Automatic RTL conversion by static analysis
n Evaluation l Multicore on a single FPGA platform
l Manycore on a multi-FPGA platform
n Conclusion
ReConFig2014 Shinya T-Y. NAIST 24
Evaluation: Manycore on Multi-FPGA Platform
n Prototyping Target: NoC-based manycore l The number of cores:
16-core (4x4)・32-core(4x8)・64-core (8x8)・128-core (16x8)
l 2 benchmarks: nq: N-queen mm: Matrix-matrix multiplication
n FPGA environment: Multi-FPGA Platform l ScalableCore System (16~128 FPGA nodes) [8]
• FPGA: Xilinx Spartan-6 LX16
• 576Kbit BRAM + 512KB external SRAM
l Operation frequency • Target and flipSyrup:40MHz, Inter-FPGA Ser/Des: 80MHz
l Design Tool: Xilinx ISE 14.6
25 ReConFig2014 Shinya T-Y. NAIST
[9] Shinya Takamaeda+, ScalableCore System: Scalable Many-core Simulator by Employing over 100 FPGAs, ARC’12
Manycore on ScalableCore System n Local memory and Inter-FPGA
communications are abstracted by flipSyrup
26
DRAM Controller DRAM Controller
Local Memory
DMAC Core
R
System Functions
Target Core
ReConFig2014 Shinya T-Y. NAIST
ReConFig2014 Shinya T-Y. NAIST 27
ScalableCore System [6] FPGA: Xilinx Spartan-6 ×128
Simulation Speed n Almost same as hand-tuned system
l Original hand-tuned ScalableCore system: 1142[KHz]
l Acceptable simulation performance under the abstraction
28
1111 1111 1111 1111 1111 1111 1111 1111
0
200
400
600
800
1000
1200
16-core 32-core 64-core 128-core
Sim
ulat
ion
Spee
d [K
Hz]�
N-Queen Matrix Multiply
ReConFig2014 Shinya T-Y. NAIST
Contents
n Background l FPGA-based rapid prototyping
n New framework: flipSyrup l Design flow with flipSyrup
l Abstract objects for memory and inter-FPGA interface
l Automatic RTL conversion by static analysis
n Evaluation l Multicore on a single FPGA platform
l Manycore on a multi-FPGA platform
n Conclusion
ReConFig2014 Shinya T-Y. NAIST 29
Conclusion n flipSyrup: A framework for FPGA-based rapid prototyping
with abstract memory blocks and inter-FPGA interfaces l Available at PyPI (https://pypi.python.org/pypi/flipsyrup)
• Please type a command “pip install flipsyrup”
30 ReConFig2014 Shinya T-Y. NAIST