a-ports: a distributed, efficient technique for performance models on fpgas
DESCRIPTION
A-Ports: A Distributed, Efficient Technique for Performance Models on FPGAs. † MIT Computer Science and AI Lab Computation Structures Group {pellauer, vmurali, arvind} @csail.mit.edu. ‡ Intel VSSAD Group {michael.adler, joel.emer} @intel.com. Michael Pellauer † - PowerPoint PPT PresentationTRANSCRIPT
1
A-Ports: A Distributed, Efficient Technique for Performance Models on
FPGAs
†MIT Computer Science and AI LabComputation Structures Group
{pellauer, vmurali, arvind}@csail.mit.edu
‡IntelVSSAD Group
{michael.adler, joel.emer}@intel.com
Michael Pellauer†
Muralidaran Vijayaraghavan†
Michael Adler‡
Arvind†
Joel Emer†‡
2
Introduction
The modern circuit design flow:
SpecifySystem
Requirements
SpecifySystem
Requirements
ExploreArchitectureAlternatives
ExploreArchitectureAlternatives
WriteCircuit
RTL
WriteCircuit
RTLVerifyVerify Physical
Manufacture
PhysicalManufacture
FPGAs used hereAlternative to ASIC tapeout
FPGAs used hereCreate prototype for verification
Interest to use FPGAs for Performance ModelsHAsim: Re-implement Intel Asim simulator on FPGA (this talk)Other Projects: Liberty, UT-FASTPerformance modeling efforts in RAMP
3
Frequency(order of magnitue)
Low-detail models 100 KHz
Medium-detail models 10 KHz
High-detail models 1 KHz
Performance Models
Created early in the design processDrive architectural exploration, feasibility analysis
parameterization, ease of change
Shows how many clock cycles an operation takesDoes not show what clock cycle time is
Homebrewed C or SystemC: synchronous simulationThe software performance modeling crisis:
Development Time
Simulation speed
Accuracy
FPGAs can help!Performance models have high degree of parallelism within a model clock cycle
It’s all modeling gates
But many interesting circuits have no good implementations on LUT-based FPGAs
CAMs, many-ported register files, nested MUXes, etc
The solution:Configure FPGA into circuit simulatorVirtualize the FPGA clock
FPGAs can help!Performance models have high degree of parallelism within a model clock cycle
It’s all modeling gates
But many interesting circuits have no good implementations on LUT-based FPGAs
CAMs, many-ported register files, nested MUXes, etc
The solution:Configure FPGA into circuit simulatorVirtualize the FPGA clock
4
Performance Model on FPGA
Implemented using Bluespec SystemVerilogXilinx Virtex IIPro 70 using Xilinx ISE 8.1iClock speed != simulation speed
Uses 1 execution unit to simulate 4 parallel execution unitsSequentially searches BlockRAM to simulate parallel CAMEnds up taking about 15.6 FPGA cycles per model cycleResult: 95 / 15.6 = 6 MHz simulation rate
4-way Superscalar, Out of Order
FPGA Slices 22,873 (69%)
Block RAMs 25 (7%)
Clock Speed 95 MHz
Simulation Rate 6 MHz
Avg. FPGA cycle to Model cycle Ratio (FMR)
15.6
Average Simulator IPS 4.7 MIPS
5
Example Target
Register File with 2 Read Ports, 2 Write PortsReads take zero clock cyclesDirect configuration onto FPGA: 9242 slices, 104 MHz
2R /2WR eg is ter
F ile
rd_addr1
rd_val1
rd_val2
rd_addr2
wr_addr1wr_val1
wr_addr2wr_val2
CC 1 CC 2
rd_addr1 A C
rd_val1 V(A) V(C)
rd_addr2 B D
rd_val2 V(B) V(D)
6
Example as Performance Model
Simulate the circuit using synchronous BlockRAMFirst do reads, then serialize writesOnly update model time when all requests are servicedResults: 94 slices, 1 BlockRAM, 224 MHzSimulation rate is 224 / 3 = 75 MHz (FPGA-to-Model Ratio)
Model CC 1
FPGA CC: 1 2 3
rd_addr1 A A A
rd_val1 V(A) V(A)
rd_addr2 B B B
rd_val2 V(B)
Separated model clock from FPGA clockHow do we compose these modules into a correct, efficient system?Let’s examine how a software performance model does it
Separated model clock from FPGA clockHow do we compose these modules into a correct, efficient system?Let’s examine how a software performance model does it
7
Time in Software Asim Model
FETFET DECDEC EXEEXE MEMMEM WBWB1111
11 11
11
22
Software has no inherent clockModel time is tracked via Asim “Ports”
All communication goes through PortsPorts have a model time latency for messages
Execution model: for each module in systemSimulate a model cycle for that moduleReads all input Ports, writes all output PortsCan write special “NoMessage” value to indicate no activity
FPGA: Can simulate in parallel instead of sequentially
8
Barrier Synchronization
Controller tracks current Model CCTells all modules “begin”Modules copy input, compute, write output, say “done”When all are done, increment cycle count and repeat
More fine-grained parallelism than parallel software modelsFGPA-to-Model Ratio: Dynamic worst case
But what about clock rate?
FETFET DECDEC EXEEXE MEMMEM WBWB
ControllerController
curCC
9
…
The problem with Barrier Sync
Becomes critical path with large number of modulesMassive fan-in/fan-out from controller Quickly becomes critical path
Experiment: Linear topology of modules
Could pipeline or cluster, but we can do better
020406080
100120140
25 50 75 100
Number of Modules
Clo
ck
Fre
qu
en
cy
(M
Hz)
10
A-Ports: Asim Ports on an FPGA
FETFET DECDEC EXEEXE MEMMEM WBWB1111
11 11
11
22
Scalable: Distributed control, no combinational paths, no countersPort with n latency starts with n NoMessage in itEach module may proceed in a “dataflow” manner
Start a cycle whenever all inputs are availableCompute for any number of FPGA cyclesStall if output ports are full
A module can derive the current model cycle by countingAdjacent modules may not be on the same model cycle….
11
Modules can “slip” in model time
Observation: when port has n messages in itProducer and Consumer are on same model cycle
Producers can run ahead, prebuffering dataConsumers can run ahead, draining dataStill works with backwards pathsWith proper buffering we can get average number of FPGA cycles per model cycle
Much better than worst case a la Barrier
FETFET DECDEC EXEEXE MEMMEM WBWB1111
11 11
11
22
12
Example: MIPS R10K-like Processor
4-Way Superscalar, Out-of-order Issue
13
Results: OOO SimulatorOut-Of-Order Simulator Speedup
0
0.2
0.4
0.6
0.8
1
1.2
1.4
median multiply qsort towers vvadd average
Barrier Sync
A-Ports Default Buffers
A-Ports Optimal Buffers
14
Takeaways
Performance Modeling on FPGAs shows great potentialCycle-accurate simulation in MHz vs KHz
A-Ports:Distributed, efficient tracking of time that scalesManages dynamic “slip” in model timeDynamic average case instead of worst case
In paper: a technique to resynchronize modules to the same model clock cycleUnderway: Effort to model realistic multicore systemsFuture Work: Combine with Chung-style virtualization [FPGA 2008] to eliminate pipeline stalls