a-ports: a distributed, efficient technique for performance models on fpgas

1

A-Ports: A Distributed, Efficient Technique for Performance Models on

FPGAs

†MIT Computer Science and AI LabComputation Structures Group

{pellauer, vmurali, arvind}@csail.mit.edu

‡IntelVSSAD Group

{michael.adler, joel.emer}@intel.com

Michael Pellauer†

Muralidaran Vijayaraghavan†

Michael Adler‡

Arvind†

Joel Emer†‡

2

Introduction

The modern circuit design flow:

SpecifySystem

Requirements

SpecifySystem

Requirements

ExploreArchitectureAlternatives

ExploreArchitectureAlternatives

WriteCircuit

RTL

WriteCircuit

RTLVerifyVerify Physical

Manufacture

PhysicalManufacture

FPGAs used hereAlternative to ASIC tapeout

FPGAs used hereCreate prototype for verification

Interest to use FPGAs for Performance ModelsHAsim: Re-implement Intel Asim simulator on FPGA (this talk)Other Projects: Liberty, UT-FASTPerformance modeling efforts in RAMP

3

Frequency(order of magnitue)

Low-detail models 100 KHz

Medium-detail models 10 KHz

High-detail models 1 KHz

Performance Models

Created early in the design processDrive architectural exploration, feasibility analysis

parameterization, ease of change

Shows how many clock cycles an operation takesDoes not show what clock cycle time is

Homebrewed C or SystemC: synchronous simulationThe software performance modeling crisis:

Development Time

Simulation speed

Accuracy

FPGAs can help!Performance models have high degree of parallelism within a model clock cycle

It’s all modeling gates

But many interesting circuits have no good implementations on LUT-based FPGAs

CAMs, many-ported register files, nested MUXes, etc

The solution:Configure FPGA into circuit simulatorVirtualize the FPGA clock

FPGAs can help!Performance models have high degree of parallelism within a model clock cycle

It’s all modeling gates

But many interesting circuits have no good implementations on LUT-based FPGAs

CAMs, many-ported register files, nested MUXes, etc

The solution:Configure FPGA into circuit simulatorVirtualize the FPGA clock

4

Performance Model on FPGA

Implemented using Bluespec SystemVerilogXilinx Virtex IIPro 70 using Xilinx ISE 8.1iClock speed != simulation speed

Uses 1 execution unit to simulate 4 parallel execution unitsSequentially searches BlockRAM to simulate parallel CAMEnds up taking about 15.6 FPGA cycles per model cycleResult: 95 / 15.6 = 6 MHz simulation rate

4-way Superscalar, Out of Order

FPGA Slices 22,873 (69%)

Block RAMs 25 (7%)

Clock Speed 95 MHz

Simulation Rate 6 MHz

Avg. FPGA cycle to Model cycle Ratio (FMR)

15.6

Average Simulator IPS 4.7 MIPS

5

Example Target

Register File with 2 Read Ports, 2 Write PortsReads take zero clock cyclesDirect configuration onto FPGA: 9242 slices, 104 MHz

2R /2WR eg is ter

F ile

rd_addr1

rd_val1

rd_val2

rd_addr2

wr_addr1wr_val1

wr_addr2wr_val2

CC 1 CC 2

rd_addr1 A C

rd_val1 V(A) V(C)

rd_addr2 B D

rd_val2 V(B) V(D)

6

Example as Performance Model

Simulate the circuit using synchronous BlockRAMFirst do reads, then serialize writesOnly update model time when all requests are servicedResults: 94 slices, 1 BlockRAM, 224 MHzSimulation rate is 224 / 3 = 75 MHz (FPGA-to-Model Ratio)

Model CC 1

FPGA CC: 1 2 3

rd_addr1 A A A

rd_val1 V(A) V(A)

rd_addr2 B B B

rd_val2 V(B)

Separated model clock from FPGA clockHow do we compose these modules into a correct, efficient system?Let’s examine how a software performance model does it

Separated model clock from FPGA clockHow do we compose these modules into a correct, efficient system?Let’s examine how a software performance model does it

7

Time in Software Asim Model

FETFET DECDEC EXEEXE MEMMEM WBWB1111

11 11

11

22

Software has no inherent clockModel time is tracked via Asim “Ports”

All communication goes through PortsPorts have a model time latency for messages

Execution model: for each module in systemSimulate a model cycle for that moduleReads all input Ports, writes all output PortsCan write special “NoMessage” value to indicate no activity

FPGA: Can simulate in parallel instead of sequentially

8

Barrier Synchronization

Controller tracks current Model CCTells all modules “begin”Modules copy input, compute, write output, say “done”When all are done, increment cycle count and repeat

More fine-grained parallelism than parallel software modelsFGPA-to-Model Ratio: Dynamic worst case

But what about clock rate?

FETFET DECDEC EXEEXE MEMMEM WBWB

ControllerController

curCC

9

…

The problem with Barrier Sync

Becomes critical path with large number of modulesMassive fan-in/fan-out from controller Quickly becomes critical path

Experiment: Linear topology of modules

Could pipeline or cluster, but we can do better

020406080

100120140

25 50 75 100

Number of Modules

Clo

ck

Fre

qu

en

cy

(M

Hz)

10

A-Ports: Asim Ports on an FPGA


11 11

11

22

Scalable: Distributed control, no combinational paths, no countersPort with n latency starts with n NoMessage in itEach module may proceed in a “dataflow” manner

Start a cycle whenever all inputs are availableCompute for any number of FPGA cyclesStall if output ports are full

A module can derive the current model cycle by countingAdjacent modules may not be on the same model cycle….

11

Modules can “slip” in model time

Observation: when port has n messages in itProducer and Consumer are on same model cycle

Producers can run ahead, prebuffering dataConsumers can run ahead, draining dataStill works with backwards pathsWith proper buffering we can get average number of FPGA cycles per model cycle

Much better than worst case a la Barrier


11 11

11

22

12

Example: MIPS R10K-like Processor

4-Way Superscalar, Out-of-order Issue

13

Results: OOO SimulatorOut-Of-Order Simulator Speedup

0

0.2

0.4

0.6

0.8

1

1.2

1.4

median multiply qsort towers vvadd average

Barrier Sync

A-Ports Default Buffers

A-Ports Optimal Buffers

14

Takeaways

Performance Modeling on FPGAs shows great potentialCycle-accurate simulation in MHz vs KHz

A-Ports:Distributed, efficient tracking of time that scalesManages dynamic “slip” in model timeDynamic average case instead of worst case

In paper: a technique to resynchronize modules to the same model clock cycleUnderway: Effort to model realistic multicore systemsFuture Work: Combine with Chung-style virtualization [FPGA 2008] to eliminate pipeline stalls

a-ports: a distributed, efficient technique for performance models on fpgas

Documents