Transcript
Page 1: The Open Source ProtoFlex Simulator

Computer Architecture Lab at

RAMP Retreat, June 2009

The Open SourceProtoFlex Simulator

Eric S. Chung, Michael K. Papamichael,James C. Hoe, Babak Falsafi, Ken Mai

Page 2: The Open Source ProtoFlex Simulator

The ProtoFlex Simulator

• History

– Project started (circa 2007) to build scalable, full-system multiprocessor simulators using FPGAs

• Key Features

– Functional simulator for N-way UltraSPARC III server (~50-90 MIPS)

– Using hybrid simulation, runs real server apps + Solaris OS

– Employs multithreading to virtualize # CPUs per FPGA core

2Hybrid Simulation Virtualization

Page 3: The Open Source ProtoFlex Simulator

Open Sourcing ProtoFlex

• Why open source?

– Demonstration of FPGAs as viable architecture research vehicle

– Facilitate adoption of hybrid simulation & host multithreading

– Encourage building on top of our work

• What are we releasing?

– Bluespec source HDL, Verilog and pre-generated netlistsfor SPARCV9 CPU model + interfaces

– XUPV5 Reference Design for EDK 10.1

– Virtutech Simics plug-ins for hybrid simulation

– Top-level SW controller, user command-line interface

– Documentation through online wiki 3

Page 4: The Open Source ProtoFlex Simulator

Outline

• Motivation

• The ProtoFlex Simulator

– High level components

– UltraSPARC core model

• Using ProtoFlex

• XUPV5 Reference Design

• Distribution Details4

Page 5: The Open Source ProtoFlex Simulator

The ProtoFlex Simulator

• User perceives familiar SW-like UltraSPARC III simulator

– Software user interface similar to Simics

– Applications load directly from Simics checkpoints

– Standard simulation features: state viewing, scripting, single-stepping, checkpointing, terminal, profiling/monitoring

5

Page 6: The Open Source ProtoFlex Simulator

The ProtoFlex Simulator

• User perceives familiar SW-like UltraSPARC III simulator

– Software user interface similar to Simics

– Applications load directly from Simics checkpoints

– Standard simulation features: state viewing, scripting, single-stepping, checkpointing, terminal, profiling/monitoring

6

FPGA Linux PC

PFMON

SIMICS

(I/O)

FPGA

Core

Main Memory

PowerPC

(or uBlaze)

Ethernet User

Interfac

e

Page 7: The Open Source ProtoFlex Simulator

Our UltraSPARC III Core Model

• ISA Specifications

– 64-bit SPARCV9 ISA + US III extensions

– 8 register windows, 4 global register files

– 512-entry D-TLB, 128 I-TLB

• Implementation

– 14-stage, multi-threaded pipeline, switch context on each cycle

– On Virtex-5, XST~148MHz,Placed & routed @ 100MHz

– Parameterized non-blocking caches

– FP + rare MMU instructions areSW-emulated by nearby uBlaze

– 100% mirrors Virtutech Simics model 7

I-TLB Stage 1

I-TLB Stage 2

64-bit ALU Stage 1

Context Scheduler

I-Fetch Address Generate

I-Fetch Tag Check

US III Decoder

64-bit ALU Stage 2

D-TLB Stage 1

D-TLB Stage 2

D-TLB Stage 3

D-Cache Address Generate

D-Cache Tag Check

Multi-Cycle Instruction Unit

NonblockingI-cache(BRAM)

Writeback

Arbiter to DDR Memory

Integer RF(BRAM)

NonblockingD-cache(BRAM)

Page 8: The Open Source ProtoFlex Simulator

Core Design Statistics

• Runs 100MHz on V5

– Synthesizes up to 148MHz using standard tools (ISE XST)

• Logic usage

– 23.5 KLUTs (11.3% LX330T)

• BRAM usage

– 120 BRAMs for 16-context configuration (37% LX330T)

• Future optimizations

– Paging structures to SRAM or DRAM can reduce BRAM by significant amount

– Will release in future updates 8

Page 9: The Open Source ProtoFlex Simulator

Outline

• Motivation

• The ProtoFlex Simulator

• Using ProtoFlex

• XUPv5 Reference Design

• Distribution

9

Page 10: The Open Source ProtoFlex Simulator

Using ProtoFlex

• Add passive monitors

– Counters, histograms

– Roll your own

10

I-TLB Stage 1

I-TLB Stage 2

64-bit ALU Stage 1

Context Scheduler

I-Fetch Address Generate

I-Fetch Tag Check

US III Decoder

64-bit ALU Stage 2

D-TLB Stage 1

D-TLB Stage 2

D-TLB Stage 3

D-Cache Address Generate

D-Cache Tag Check

Multi-Cycle Instruction Unit

NonblockingI-cache(BRAM)

Writeback

Arbiter to DDR

Memory

Integer RF(BRAM)

NonblockingD-cache(BRAM)

• Trace-based simulation

– Collect dynamic traces

– Feed traces to functional-first timing model

• Sampled Program Monitoring

– Use micro-blaze (or PPC) to monitor core/memory state

– Unintrusive profiling w/o changes to target SW

CountersCountersCounters

Histogram

TrackerHistogram

TrackerHistogram

Trackers

Timing

Model

FPGA

Hard/Soft Core

(PowerPC or

MicroBlaze)

Page 11: The Open Source ProtoFlex Simulator

Applications of ProtoFlex

• Examples

– Functional-first CMP cache coherency model for first-order timing models and functional warming *TRETS’09+

– Real-time stack trace profiling

– CMP interconnect model (in progress)

– Realistic CPU traffic generators (in progress)

• … running real 16-CPU server workloads

– Oracle TPC-C, IBM DB/2 TPC-C, TPC-H, SPEC2K

11

Piranha CMP Cache

(First-Order Timing

Model)

Statistics + Warmed

Coherency &

Tag States

Page 12: The Open Source ProtoFlex Simulator

What does the RTL look like?

• We use Bluespec System Verilog (high-level, synthesizable HDL)

– 4-8 weeks learning curve for normal HDL users

– Once learned, easier to read/modify than conventional RTL

– Requires BSV compiler (free for academics)

– Paper in MEMOCODE’09 describes BSV coding/validation of core

• Sample code:

12

rule split_ALU_pipeline (True);…p1 = piperegs[DECODE];piperegs[ALU1] <= doALUStage1(p1, alu_ifc);p2 = piperegs[ALU1];piperegs[ALU2] <= doALUStage2(p2, alu_ifc);…

endrule

rule merged_ALU_pipeline (True);…p1 = piperegs[DECODE];p_tmp1 = doALUStage1(p1, alu_ifc);p_tmp2 = doALUStage2(p_tmp1, alu_ifc);piperegs[ALU] <= p_tmp2;…

endrule

2-stage ALU 1-stage ALU

Page 13: The Open Source ProtoFlex Simulator

Other Simulator Features

• Changing Core Parameters

– Number of CPU contexts

– Cache sizes

– Merge/split pipeline stages

– Enable/disable modules for profiling & debugging

– Clock frequency (tested @ 10 MHz – 100 MHz)

– Set optimal LUTRAM size (16 = V2P, 64 = V5)

– Choose LUTRAMs or BRAMs for any CPU state

• System Parameters

– UDP or TCP/IP (for PFMON-to-FPGA communication)

– XUPv5, BEE2

13

Page 14: The Open Source ProtoFlex Simulator

Outline

• Motivation

• The ProtoFlex Simulator

• Using ProtoFlex

• XUPv5 Reference Design

• Distribution

14

Page 15: The Open Source ProtoFlex Simulator

Platform Release: XUPV5

• Why XUPv5?

– Inexpensive (~$750), easily accessible

– Standard tool flows (EDK, ISE)

– Reference design portable to other platforms – just drop in our ‘pcores’

• Supporting other platforms

– Future ports to BEE3 & Xilinx Accelerated Computing Platform (ACP)

– Plan to release with future updates

15

Page 16: The Open Source ProtoFlex Simulator

Required Equipment

16

FPGA BoardLinux PC

Ethernet

BlueSPARCPFMON+ Simics

Page 17: The Open Source ProtoFlex Simulator

Required Equipment

17

FPGA BoardLinux PC

Ethernet

BlueSPARCPFMON+ Simics

Page 18: The Open Source ProtoFlex Simulator

18

XUPv5 Overview

• Virtex-5 LX110T

• DDR2 Memory

– up to 2GB

• 1ΜΒ SRAM

• 1Gbps Ethernet

• 3Gbps SATA

• Serial Port

Page 19: The Open Source ProtoFlex Simulator

SRAMController

Multi-Port Memory Controller

Reference Design Block Diagram

19

PLB

BlueSPARC MicroBlaze

Ethernet BRAM Serial PortBRAMBRAM

XUPv5

LX110T

DRAM SRAM

Page 20: The Open Source ProtoFlex Simulator

SRAMController

Multi-Port Memory Controller

BlueSPARC

20

PLB

MicroBlaze

Ethernet BRAM Serial PortBRAMBRAM

XUPv5

LX110T

DRAM SRAM

• EDK IP core

– connects to PLB & NPI

• Runs @ 100MHz

• 4 CPU contexts

• 64KB I&D L1 caches

BlueSPARC

Page 21: The Open Source ProtoFlex Simulator

SRAMController

Multi-Port Memory Controller

Reference Design Block Diagram

21

PLB

BlueSPARC MicroBlaze

Ethernet BRAM Serial PortBRAMBRAM

XUPv5

LX110T

DRAM SRAM

Page 22: The Open Source ProtoFlex Simulator

SRAMController

Multi-Port Memory Controller

BlueSPARC

22

PLB

BlueSPARC MicroBlaze

Ethernet Serial Port

XUPv5

LX110T

DRAM SRAM

• 81% utilization

– Core 51% (76 out of 148)

– Rest 30% (45 out of 148)

BRAMBRAMBRAM

Page 23: The Open Source ProtoFlex Simulator

SRAMController

Multi-Port Memory Controller

Reference Design Block Diagram

23

PLB

BlueSPARC MicroBlaze

Ethernet BRAM Serial PortBRAMBRAM

XUPv5

LX110T

DRAM SRAM

Page 24: The Open Source ProtoFlex Simulator

SRAMController

Multi-Port Memory Controller

Ethernet

24

PLB

BlueSPARC MicroBlaze

BRAM Serial PortBRAMBRAM

XUPv5

LX110T

DRAM SRAM

• 4 MB/sec bandwidth

• 350 usec RTT latency

• Socket Abstraction

– using LWIP RAW interface

Ethernet

Page 25: The Open Source ProtoFlex Simulator

SRAMController

Multi-Port Memory Controller

Reference Design Block Diagram

25

PLB

BlueSPARC MicroBlaze

Ethernet BRAM Serial PortBRAMBRAM

XUPv5

LX110T

DRAM SRAM

Page 26: The Open Source ProtoFlex Simulator

SRAMController

DDR2 Memory Controller

26

PLB

BlueSPARC MicroBlaze

Ethernet BRAM Serial PortBRAMBRAM

XUPv5

LX110T

DRAM SRAM

Multi-Port Memory Controller

• 1.5GB/s peak BW

• 115ns latency

• Multiple ports/interfaces

Page 27: The Open Source ProtoFlex Simulator

Required Equipment

27

FPGA BoardLinux PC

Ethernet

BlueSPARCPFMON+ Simics

Page 28: The Open Source ProtoFlex Simulator

Required Equipment

28

FPGA BoardLinux PC

Ethernet

BlueSPARCPFMON+ Simics

Page 29: The Open Source ProtoFlex Simulator

29

Linux PC

• Software requirements

– SuSE Linux 10.1

– CAD tools + licenses (Bluespec compiler, Xilinx ISE/EDK)

– Simics 3.0.22

– Hybrid simulation plug-in modules

– ProtoFlex MONitor tool (PFMON)

Page 30: The Open Source ProtoFlex Simulator

30

Linux PC

• Runs PFMON (ProtoFlex MONitor)

– Orchestrates communication between Simics & BlueSPARC

– Provides CLI interface to simulator (like Simics Console)

• Runs Simics

– Handles I/O, FPGA Core and memory initialization

BlueSPARCPFMON

Linux PC

Simics

Page 31: The Open Source ProtoFlex Simulator

31

From RTL to Running System

• Bluespec Verilog

– Bluespec compiler

– ~30 minutes

• Verilog Bitstream

– Xilinx EDK

– ~ 3 hours

• BitstreamWorking System

– Stream mem. image over ethernet

– ~ 5 minutes (for 512MB image)

Bluespec code

Verilog code

Bitstream

1

2

3

1

2

WorkingSystem

3

Page 32: The Open Source ProtoFlex Simulator

XUPv5 Caveats

• Limitations

– Due to BRAM limits, only 4 CPU contexts (compared to 16 on BEE2’s V2P70)

– Slow PC-to-FPGA latency via LwIP/Ethernet (1 key/sec in terminal)

– Limited DDR2 RAM capacity (up to 2GB)

• However…

– Our XUPv5 design still has much room for improvement

– Useful for familiarizing with ProtoFlex tools

– Can still run multithreaded workloads + perform monitoring

– Easily portable to more powerful platforms

32

Page 33: The Open Source ProtoFlex Simulator

Outline

• Motivation

• The ProtoFlex Simulator

• Using ProtoFlex

• XUPv5 Reference Design

• Distribution

33

Page 34: The Open Source ProtoFlex Simulator

Distribution Details

• Release dates

– All source code will be available in 1 week (June)

– Send email to [email protected] for more info

• Licensing

– GNU GPL

• Support

– Documentation: www.ece.cmu.edu/~protoflex

– Support mailing list/forum will be available soon

34

Page 35: The Open Source ProtoFlex Simulator

Thank you!

• Acknowledgments

– Funding for this work provided by: NSF CCF-0811702, NSF CNS-0509356, C2S2 Marco Center, and Sun Microsystems

– Paul Hartke & Xilinx for FPGA donations & email support

• Topics & References

– Hybrid Simulation and FPGA Core MultithreadingA Complexity-Effective Architecture for Accelerating Full-System Multiprocessor Simulations Using FPGAs [FPGA’08]

– Functional-First, First-Order Timing Model for a CMP cacheProtoFlex: Towards Scalable, Full-System Multiprocessor Simulations Using FPGAs [TRETS’09]

– FPGA Core Design & Validation Using Flight Data RecorderImplementing a High-performance Multithreaded Microprocessor: A Case Study in High-level Design and Validation [MEMOCODE’09]

35


Top Related