a-ports: a distributed, efficient technique for performance models on fpgas

14
1 A-Ports: A Distributed, Efficient Technique for Performance Models on FPGAs MIT Computer Science and AI Lab Computation Structures Group {pellauer, vmurali, arvind} @csail.mit.edu Intel VSSAD Group {michael.adler, joel.emer} @intel.com Michael Pellauer Muralidaran Vijayaraghavan Michael Adler Arvind Joel Emer †‡

Upload: toni

Post on 02-Feb-2016

26 views

Category:

Documents


0 download

DESCRIPTION

A-Ports: A Distributed, Efficient Technique for Performance Models on FPGAs. † MIT Computer Science and AI Lab Computation Structures Group {pellauer, vmurali, arvind} @csail.mit.edu. ‡ Intel VSSAD Group {michael.adler, joel.emer} @intel.com. Michael Pellauer † - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: A-Ports: A Distributed, Efficient Technique for Performance Models on FPGAs

1

A-Ports: A Distributed, Efficient Technique for Performance Models on

FPGAs

†MIT Computer Science and AI LabComputation Structures Group

{pellauer, vmurali, arvind}@csail.mit.edu

‡IntelVSSAD Group

{michael.adler, joel.emer}@intel.com

Michael Pellauer†

Muralidaran Vijayaraghavan†

Michael Adler‡

Arvind†

Joel Emer†‡

Page 2: A-Ports: A Distributed, Efficient Technique for Performance Models on FPGAs

2

Introduction

The modern circuit design flow:

SpecifySystem

Requirements

SpecifySystem

Requirements

ExploreArchitectureAlternatives

ExploreArchitectureAlternatives

WriteCircuit

RTL

WriteCircuit

RTLVerifyVerify Physical

Manufacture

PhysicalManufacture

FPGAs used hereAlternative to ASIC tapeout

FPGAs used hereCreate prototype for verification

Interest to use FPGAs for Performance ModelsHAsim: Re-implement Intel Asim simulator on FPGA (this talk)Other Projects: Liberty, UT-FASTPerformance modeling efforts in RAMP

Page 3: A-Ports: A Distributed, Efficient Technique for Performance Models on FPGAs

3

Frequency(order of magnitue)

Low-detail models 100 KHz

Medium-detail models 10 KHz

High-detail models 1 KHz

Performance Models

Created early in the design processDrive architectural exploration, feasibility analysis

parameterization, ease of change

Shows how many clock cycles an operation takesDoes not show what clock cycle time is

Homebrewed C or SystemC: synchronous simulationThe software performance modeling crisis:

Development Time

Simulation speed

Accuracy

FPGAs can help!Performance models have high degree of parallelism within a model clock cycle

It’s all modeling gates

But many interesting circuits have no good implementations on LUT-based FPGAs

CAMs, many-ported register files, nested MUXes, etc

The solution:Configure FPGA into circuit simulatorVirtualize the FPGA clock

FPGAs can help!Performance models have high degree of parallelism within a model clock cycle

It’s all modeling gates

But many interesting circuits have no good implementations on LUT-based FPGAs

CAMs, many-ported register files, nested MUXes, etc

The solution:Configure FPGA into circuit simulatorVirtualize the FPGA clock

Page 4: A-Ports: A Distributed, Efficient Technique for Performance Models on FPGAs

4

Performance Model on FPGA

Implemented using Bluespec SystemVerilogXilinx Virtex IIPro 70 using Xilinx ISE 8.1iClock speed != simulation speed

Uses 1 execution unit to simulate 4 parallel execution unitsSequentially searches BlockRAM to simulate parallel CAMEnds up taking about 15.6 FPGA cycles per model cycleResult: 95 / 15.6 = 6 MHz simulation rate

4-way Superscalar, Out of Order

FPGA Slices 22,873 (69%)

Block RAMs 25 (7%)

Clock Speed 95 MHz

Simulation Rate 6 MHz

Avg. FPGA cycle to Model cycle Ratio (FMR)

15.6

Average Simulator IPS 4.7 MIPS

Page 5: A-Ports: A Distributed, Efficient Technique for Performance Models on FPGAs

5

Example Target

Register File with 2 Read Ports, 2 Write PortsReads take zero clock cyclesDirect configuration onto FPGA: 9242 slices, 104 MHz

2R /2WR eg is ter

F ile

rd_addr1

rd_val1

rd_val2

rd_addr2

wr_addr1wr_val1

wr_addr2wr_val2

CC 1 CC 2

rd_addr1 A C

rd_val1 V(A) V(C)

rd_addr2 B D

rd_val2 V(B) V(D)

Page 6: A-Ports: A Distributed, Efficient Technique for Performance Models on FPGAs

6

Example as Performance Model

Simulate the circuit using synchronous BlockRAMFirst do reads, then serialize writesOnly update model time when all requests are servicedResults: 94 slices, 1 BlockRAM, 224 MHzSimulation rate is 224 / 3 = 75 MHz (FPGA-to-Model Ratio)

Model CC 1

FPGA CC: 1 2 3

rd_addr1 A A A

rd_val1 V(A) V(A)

rd_addr2 B B B

rd_val2 V(B)

Separated model clock from FPGA clockHow do we compose these modules into a correct, efficient system?Let’s examine how a software performance model does it

Separated model clock from FPGA clockHow do we compose these modules into a correct, efficient system?Let’s examine how a software performance model does it

Page 7: A-Ports: A Distributed, Efficient Technique for Performance Models on FPGAs

7

Time in Software Asim Model

FETFET DECDEC EXEEXE MEMMEM WBWB1111

11 11

11

22

Software has no inherent clockModel time is tracked via Asim “Ports”

All communication goes through PortsPorts have a model time latency for messages

Execution model: for each module in systemSimulate a model cycle for that moduleReads all input Ports, writes all output PortsCan write special “NoMessage” value to indicate no activity

FPGA: Can simulate in parallel instead of sequentially

Page 8: A-Ports: A Distributed, Efficient Technique for Performance Models on FPGAs

8

Barrier Synchronization

Controller tracks current Model CCTells all modules “begin”Modules copy input, compute, write output, say “done”When all are done, increment cycle count and repeat

More fine-grained parallelism than parallel software modelsFGPA-to-Model Ratio: Dynamic worst case

But what about clock rate?

FETFET DECDEC EXEEXE MEMMEM WBWB

ControllerController

curCC

Page 9: A-Ports: A Distributed, Efficient Technique for Performance Models on FPGAs

9

The problem with Barrier Sync

Becomes critical path with large number of modulesMassive fan-in/fan-out from controller Quickly becomes critical path

Experiment: Linear topology of modules

Could pipeline or cluster, but we can do better

020406080

100120140

25 50 75 100

Number of Modules

Clo

ck

Fre

qu

en

cy

(M

Hz)

Page 10: A-Ports: A Distributed, Efficient Technique for Performance Models on FPGAs

10

A-Ports: Asim Ports on an FPGA

FETFET DECDEC EXEEXE MEMMEM WBWB1111

11 11

11

22

Scalable: Distributed control, no combinational paths, no countersPort with n latency starts with n NoMessage in itEach module may proceed in a “dataflow” manner

Start a cycle whenever all inputs are availableCompute for any number of FPGA cyclesStall if output ports are full

A module can derive the current model cycle by countingAdjacent modules may not be on the same model cycle….

Page 11: A-Ports: A Distributed, Efficient Technique for Performance Models on FPGAs

11

Modules can “slip” in model time

Observation: when port has n messages in itProducer and Consumer are on same model cycle

Producers can run ahead, prebuffering dataConsumers can run ahead, draining dataStill works with backwards pathsWith proper buffering we can get average number of FPGA cycles per model cycle

Much better than worst case a la Barrier

FETFET DECDEC EXEEXE MEMMEM WBWB1111

11 11

11

22

Page 12: A-Ports: A Distributed, Efficient Technique for Performance Models on FPGAs

12

Example: MIPS R10K-like Processor

4-Way Superscalar, Out-of-order Issue

Page 13: A-Ports: A Distributed, Efficient Technique for Performance Models on FPGAs

13

Results: OOO SimulatorOut-Of-Order Simulator Speedup

0

0.2

0.4

0.6

0.8

1

1.2

1.4

median multiply qsort towers vvadd average

Barrier Sync

A-Ports Default Buffers

A-Ports Optimal Buffers

Page 14: A-Ports: A Distributed, Efficient Technique for Performance Models on FPGAs

14

Takeaways

Performance Modeling on FPGAs shows great potentialCycle-accurate simulation in MHz vs KHz

A-Ports:Distributed, efficient tracking of time that scalesManages dynamic “slip” in model timeDynamic average case instead of worst case

In paper: a technique to resynchronize modules to the same model clock cycleUnderway: Effort to model realistic multicore systemsFuture Work: Combine with Chung-style virtualization [FPGA 2008] to eliminate pipeline stalls