multithreaded sparc v8 functional model for ramp gold zhangxi tan uc berkeley ramp retreat, jan 17,...
Post on 22-Dec-2015
213 views
TRANSCRIPT
Multithreaded SPARC v8 Functional Model for RAMP Gold
Zhangxi TanUC Berkeley
RAMP Retreat, Jan 17, 2008
MotivationMotivation• Traditional RISC optimizations are far less appealing on
soft-core processors on FPGAs– Mapped to expensive wide bus muxes; becomes
area/frequency bottleneck on fabric• Bypassing network• Delayed branch
– Less efficient when dealing with memory access latency• Small cache size & shared memory controller make things even worse
– Poor core count on single FPGA! (e.g. V5 LX110T)• <16 32-bit Sparc V8 Leon integer pipeline
ApproachApproach• Need a new functional model, which is able to
– Support a large number of emulated cores (~1k) per BEE3 board– Accelerate aggregate emulate performance (MIPS/chip)
• Including optimizations to tolerate memory & I/O latency– Run full OS and support OS development
• TLB/exception support• Memory mapped I/O + IRQ support
– Interfacing with timing model
• Virtualizing Sparc V8 RTL with fine-grain multithreading– High density design (256/512 emulated CPUs per chip)
• 8 cores in 2 clusters per FPGA (V5 LX110T); each core has 32 or 64 threads (configurable)
• 4 cores in one cluster share one BEE3 mem controller – Start from 32-bit ISA, eventually support 64-bit ISA (v9)
Design philosophy 1Design philosophy 1• Keep everything simple!
– Build processor w/o bypassing network• Greatly simplify pipeline design• Preliminary result shows ~28% LUT reduction + ~18% frequency
improvement on Leon3 processor
– Direct map cache/TLB– Simple fine-grain multithreading to fill pipeline bubbles
• Static RR issue : T1->T2->T3->T4->T1->T2…..• Never stall the pipeline
– Long latency operations? – Tell the pipeline to REPLAY the instruction in the next rotation
– “Microcode” for complex instructions/trap handling
Design philosophy 2Design philosophy 2• Design for fabric (Targeting Virtex 5)
– High working frequency (expect ~150 MHz) • Deep pipeline: 10~11 physical stages
– Manually controlled FPGA resources mapping• BRAMs, LUTRAM• Use V5 DSPs as ALU • Pipelining all BRAMs and DSPs. (maximize Fmax)
– Error detection/correction for all BRAMs• Cache tags and register file use parity bit to detect soft
errors• TLB entry and cache data are protected by built-in V5 ECC-
BRAM
ChallengesChallenges
• Thread state storage & per-thread L1 cache – Will BRAM/LUTRAM fit?– How large ?– Where to map? LUTRAM or BRAM
• Bandwidth and RW ports requirement– Multithreading amplifies the requirement!
• How to make use of FPGA primitives to control total LUT usage– 6-input LUTs: LUT5_2, RAM64B– DSPs
State storageState storage• Main thread state (integer pipeline)
– 3 register windows per thread (2-minimum by specification, 3 for performance)
• 8 global + 16*3 window registers• Stored in BRAM in chunks of 64 registers
– PC/nPC – LUTRAM– PSR (processor state register) – LUTRAM– WIM (register window mask) – LUTRAM– TBR (trap base register) – BRAM packed w. 3 reg
window– Y (high 32-bit for mul/div) - LUTRAM
Regfile layoutRegfile layoutThread BRAM
AddressBRAM Content
0 0-7 Global register g0-g7
8 TBR
9-15 scratch register for microcode mode
16-63 3-register window
1 64-71 Global register g0-g7
72 TBR
73-79 scratch register for microcode mode
80-127 3-register window
2 … ….
• 64 threads per pipeline, 8 pipelines per chip (V5 LX110T)• Eight 18kb blocks
• Double clocked BRAM (virtually 4 ports)
• Indexed with {thread_id, reg_addr}
Cache & TLBCache & TLB• Per thread Cache
– Split I/D direct-map write-allocate write-back cache• Block size: 32 bytes (BEE3 DDR2 controller heart beat)• 512B total in 64-thread configuration : 256B – I$, 256B – D$
– Size doubled (1KB) for 32-thread configuration• Non-blocking to a different thread, but blocking to the same thread• CPU and memory controller access cache at the same time through
different ports– Physical tag• Per thread TLB– split I/D direct-map TLB
• 16 entries in total : 8 for ITLB and 8 for DTLB• Total BRAM usage per thread (regfile + cache/TLB + tag
+misc) : 30~32 blocks (18kb)• BRAM is still the critical resource
DSP48E are perfect for ALUDSP48E are perfect for ALU
• DSP48E is a MAC.• Two 48-bit inputs, one 48-bit output
– Add/subtract/logic/by pass/address calculation– Pattern detector (generate Z flag)
• <10 LUTs for C, O, nothing for N
Mapping SPARC instructions to DSP48EMapping SPARC instructions to DSP48E• Most of SPARC v8 instructions can be covered by DSP48E
– 1 cycle ALU (1 DSP)• LD/ST (address calculation)• Bit-wise logic (and, or, …)• SETHI• JMPL, RETT, Call• Write special register (WRPSR)• SAVE/RESTORE
– Long latency ALU• Pipelined shift/Mul (4 DSPs) • Divide (1 DSP)
– Misc• RDPSR, RDWIM (XOR ops.)
• Only one 32-bit adder is not in DSP (nPC+4)
• DSP48E is not silver bullet– Barrel shifter/shifter support is weak
• Altera does better on shifters– 48-bit is odd!
• Expecting 64-bit inputs DSPs w. 32x32 multipliers (DSP64E?)
Pipeline ArchPipeline ArchInstruction Fetch 1(Issue address Request )
Static Thread Selection
(Round Robin )
Special Registers(pc / npc , wim , psr ,
thread control registers )
I -Cache( nine 18 kb
BRAMs )
Microcode ROM
Instruction Fetch 2( com pare tag)
32-b it Instruction
Synthesized Instruction
Tag com pare result
M icro inst .
Tag/Data read request
Decode(Resolve Branch ,
Decode reg ister file
address )
Regfile Access(1 or 2 cycles )
32 - bit Register
File( four 36kb BRAMs )
Decode ALU
control /Exception Detection
im mpc
OP 2 OP 1
MUL /DIV/ SHF(4 DSPs )
Simple ALU (1 DSP)/ LDST decoding
Special register handling
(RDPSR /RDWIM )
M em request under cache m iss
Tag
Unaligned address detection / Store
preparation
Load(issue address )
D -Cache
( nine 18 kb BRAMs )
Trap /IRQ handling Read & Modify
Tag/Data read request
Tag / 128- b it data
Generate
microcode request
Load align /
Write Back
128- b it read & m odify data
256-bit memory
interface
256-bit memory
interface
Thread Selection
Instruction Fetch
Decode
Register File Access
Execution
Memory
Write Back
LUT RAM (clk x 2)
LUT ROM
BRAM (clk x 2)
DSP (clk x 2)
• 7-stage pipeline– MMU support soon
Core 1
SPARC V8 Pipeline
(64 Threads )
256 BI$
256 BD$
SPARC V8 Pipeline
(64 Threads )
256 BI$
256 BD$
SPARC V8 Pipeline
(64 Threads )
256 BI$
256 BD$
SPARC V 8 Pipeline
( 64 Threads )
256 BI$
256 BD$
Core 2 Core 3 Core 4
BEE3 DDR2 Memory controller 1
144 bits
Core 5
SPARC V8 Pipeline
(64 Threads )
256 BI$
256 BD$
Core 6 Core 7 Core 8
BEE3 DDR2 Memory controller 2
144 bits
SPARC V8 Pipeline
(64 Threads )
256 BI$
256 BD$
SPARC V8 Pipeline
(64 Threads )
256 BI$
256 BD$
SPARC V8 Pipeline
(64 Threads )
256 BI$
256 BD$
Cluster 2
Cluster 1
Virtex 5 LX110T
StatusStatus
• Coded in Systemverilog– ~4000 lines of code implemented
• Push to synthesis tools in Feb 08– Synthesize with Precision or Synplify– Full V8 instruction (integer) support (no MMU)– Aiming ~150 MHz, estimate <4000 LUTs per core
• Verification Goal– pass microsparc verification suite / sparc.org
certification test
Backup Slides
SPARC vs MIPSSPARC vs MIPS• Similar ISA
– Similar ALU/Jump and Link/Jump instructions– Similar LD/ST inst. (LDB, LDH, LDW)– Delay branch
• Except– Branch on 4 condition codes (N, C, O, Z)
• E.g. Addcc r1, r2, r3 Bicc address
– Trap on condition code for SW traps (e.g. System call)– Register window ( 2-32 windows)
• Only 1 window (32 registers) activates, controlled by CWP field in Processor State Register (PSR)
• SAVE/RESTORE, RETT, trap will affect the window• SAVE/RESTORE are common used in function call
– No FPU <-> Integer register file transfer instructions– Difference in atomic instructions:
• MIPS: LL/SC, SPARC: LDSTUB, SWAP