multithreaded sparc v8 functional model for ramp gold zhangxi tan uc berkeley ramp retreat, jan 17,...

Multithreaded SPARC v8 Functional Model for RAMP Gold

Zhangxi TanUC Berkeley

RAMP Retreat, Jan 17, 2008

MotivationMotivation• Traditional RISC optimizations are far less appealing on

soft-core processors on FPGAs– Mapped to expensive wide bus muxes; becomes

area/frequency bottleneck on fabric• Bypassing network• Delayed branch

– Less efficient when dealing with memory access latency• Small cache size & shared memory controller make things even worse

– Poor core count on single FPGA! (e.g. V5 LX110T)• <16 32-bit Sparc V8 Leon integer pipeline

ApproachApproach• Need a new functional model, which is able to

– Support a large number of emulated cores (~1k) per BEE3 board– Accelerate aggregate emulate performance (MIPS/chip)

• Including optimizations to tolerate memory & I/O latency– Run full OS and support OS development

• TLB/exception support• Memory mapped I/O + IRQ support

– Interfacing with timing model

• Virtualizing Sparc V8 RTL with fine-grain multithreading– High density design (256/512 emulated CPUs per chip)

• 8 cores in 2 clusters per FPGA (V5 LX110T); each core has 32 or 64 threads (configurable)

• 4 cores in one cluster share one BEE3 mem controller – Start from 32-bit ISA, eventually support 64-bit ISA (v9)

Design philosophy 1Design philosophy 1• Keep everything simple!

– Build processor w/o bypassing network• Greatly simplify pipeline design• Preliminary result shows ~28% LUT reduction + ~18% frequency

improvement on Leon3 processor

– Direct map cache/TLB– Simple fine-grain multithreading to fill pipeline bubbles

• Static RR issue : T1->T2->T3->T4->T1->T2…..• Never stall the pipeline

– Long latency operations? – Tell the pipeline to REPLAY the instruction in the next rotation

– “Microcode” for complex instructions/trap handling

Design philosophy 2Design philosophy 2• Design for fabric (Targeting Virtex 5)

– High working frequency (expect ~150 MHz) • Deep pipeline: 10~11 physical stages

– Manually controlled FPGA resources mapping• BRAMs, LUTRAM• Use V5 DSPs as ALU • Pipelining all BRAMs and DSPs. (maximize Fmax)

– Error detection/correction for all BRAMs• Cache tags and register file use parity bit to detect soft

errors• TLB entry and cache data are protected by built-in V5 ECC-

BRAM

ChallengesChallenges

• Thread state storage & per-thread L1 cache – Will BRAM/LUTRAM fit?– How large ?– Where to map? LUTRAM or BRAM

• Bandwidth and RW ports requirement– Multithreading amplifies the requirement!

• How to make use of FPGA primitives to control total LUT usage– 6-input LUTs: LUT5_2, RAM64B– DSPs

State storageState storage• Main thread state (integer pipeline)

– 3 register windows per thread (2-minimum by specification, 3 for performance)

• 8 global + 16*3 window registers• Stored in BRAM in chunks of 64 registers

– PC/nPC – LUTRAM– PSR (processor state register) – LUTRAM– WIM (register window mask) – LUTRAM– TBR (trap base register) – BRAM packed w. 3 reg

window– Y (high 32-bit for mul/div) - LUTRAM

Regfile layoutRegfile layoutThread BRAM

AddressBRAM Content

0 0-7 Global register g0-g7

8 TBR

9-15 scratch register for microcode mode

16-63 3-register window

1 64-71 Global register g0-g7

72 TBR

73-79 scratch register for microcode mode

80-127 3-register window

2 … ….

• 64 threads per pipeline, 8 pipelines per chip (V5 LX110T)• Eight 18kb blocks

• Double clocked BRAM (virtually 4 ports)

• Indexed with {thread_id, reg_addr}

Cache & TLBCache & TLB• Per thread Cache

– Split I/D direct-map write-allocate write-back cache• Block size: 32 bytes (BEE3 DDR2 controller heart beat)• 512B total in 64-thread configuration : 256B – I$, 256B – D$

– Size doubled (1KB) for 32-thread configuration• Non-blocking to a different thread, but blocking to the same thread• CPU and memory controller access cache at the same time through

different ports– Physical tag• Per thread TLB– split I/D direct-map TLB

• 16 entries in total : 8 for ITLB and 8 for DTLB• Total BRAM usage per thread (regfile + cache/TLB + tag

+misc) : 30~32 blocks (18kb)• BRAM is still the critical resource

DSP48E are perfect for ALUDSP48E are perfect for ALU

• DSP48E is a MAC.• Two 48-bit inputs, one 48-bit output

– Add/subtract/logic/by pass/address calculation– Pattern detector (generate Z flag)

• <10 LUTs for C, O, nothing for N

Mapping SPARC instructions to DSP48EMapping SPARC instructions to DSP48E• Most of SPARC v8 instructions can be covered by DSP48E

– 1 cycle ALU (1 DSP)• LD/ST (address calculation)• Bit-wise logic (and, or, …)• SETHI• JMPL, RETT, Call• Write special register (WRPSR)• SAVE/RESTORE

– Long latency ALU• Pipelined shift/Mul (4 DSPs) • Divide (1 DSP)

– Misc• RDPSR, RDWIM (XOR ops.)

• Only one 32-bit adder is not in DSP (nPC+4)

• DSP48E is not silver bullet– Barrel shifter/shifter support is weak

• Altera does better on shifters– 48-bit is odd!

• Expecting 64-bit inputs DSPs w. 32x32 multipliers (DSP64E?)

Pipeline ArchPipeline ArchInstruction Fetch 1(Issue address Request )

Static Thread Selection

(Round Robin )

Special Registers(pc / npc , wim , psr ,

thread control registers )

I -Cache( nine 18 kb

BRAMs )

Microcode ROM

Instruction Fetch 2( com pare tag)

32-b it Instruction

Synthesized Instruction

Tag com pare result

M icro inst .

Tag/Data read request

Decode(Resolve Branch ,

Decode reg ister file

address )

Regfile Access(1 or 2 cycles )

32 - bit Register

File( four 36kb BRAMs )

Decode ALU

control /Exception Detection

im mpc

OP 2 OP 1

MUL /DIV/ SHF(4 DSPs )

Simple ALU (1 DSP)/ LDST decoding

Special register handling

(RDPSR /RDWIM )

M em request under cache m iss

Tag

Unaligned address detection / Store

preparation

Load(issue address )

D -Cache

( nine 18 kb BRAMs )

Trap /IRQ handling Read & Modify

Tag/Data read request

Tag / 128- b it data

Generate

microcode request

Load align /

Write Back

128- b it read & m odify data

256-bit memory

interface

256-bit memory

interface

Thread Selection

Instruction Fetch

Decode

Register File Access

Execution

Memory

Write Back

LUT RAM (clk x 2)

LUT ROM

BRAM (clk x 2)

DSP (clk x 2)

• 7-stage pipeline– MMU support soon

Core 1

SPARC V8 Pipeline

(64 Threads )

256 BI$

256 BD$

SPARC V8 Pipeline

(64 Threads )

256 BI$

256 BD$

SPARC V8 Pipeline

(64 Threads )

256 BI$

256 BD$

SPARC V 8 Pipeline

( 64 Threads )

256 BI$

256 BD$

Core 2 Core 3 Core 4

BEE3 DDR2 Memory controller 1

144 bits

Core 5

SPARC V8 Pipeline

(64 Threads )

256 BI$

256 BD$

Core 6 Core 7 Core 8

BEE3 DDR2 Memory controller 2

144 bits

SPARC V8 Pipeline

(64 Threads )

256 BI$

256 BD$

SPARC V8 Pipeline

(64 Threads )

256 BI$

256 BD$

SPARC V8 Pipeline

(64 Threads )

256 BI$

256 BD$

Cluster 2

Cluster 1

Virtex 5 LX110T

StatusStatus

• Coded in Systemverilog– ~4000 lines of code implemented

• Push to synthesis tools in Feb 08– Synthesize with Precision or Synplify– Full V8 instruction (integer) support (no MMU)– Aiming ~150 MHz, estimate <4000 LUTs per core

• Verification Goal– pass microsparc verification suite / sparc.org

certification test

Backup Slides

SPARC vs MIPSSPARC vs MIPS• Similar ISA

– Similar ALU/Jump and Link/Jump instructions– Similar LD/ST inst. (LDB, LDH, LDW)– Delay branch

• Except– Branch on 4 condition codes (N, C, O, Z)

• E.g. Addcc r1, r2, r3 Bicc address

– Trap on condition code for SW traps (e.g. System call)– Register window ( 2-32 windows)

• Only 1 window (32 registers) activates, controlled by CWP field in Processor State Register (PSR)

• SAVE/RESTORE, RETT, trap will affect the window• SAVE/RESTORE are common used in function call

– No FPU <-> Integer register file transfer instructions– Difference in atomic instructions:

• MIPS: LL/SC, SPARC: LDSTUB, SWAP

multithreaded sparc v8 functional model for ramp gold zhangxi tan uc berkeley ramp retreat, jan 17,...

Documents

thread l1 cache

different thread

thread configuration

thread cpu

cache tlb

base register bram packed

muldiv lutram slide

cache data