@fpgacpu.org) president, gray research llc jan gray in ......xilinx xc4000 fpgas introduced in 1992,...

Processor and Integrated System DesignProcessor and Integrated System Designin FPGAsin FPGAs

Jan GrayJan GrayPresident, Gray Research LLCPresident, Gray Research LLC

(([email protected])@fpgacpu.org)

lw r12,4(r3)lw r12,4(r3)

Copyright © 1999,2000 Gray Research LLC. All Rights Reserved.

lw r12,4(r3)lw r12,4(r3)

IntroductionIntroduction◆◆ Recent dense programmable logic enablesRecent dense programmable logic enables

practical “desktop processor development”practical “desktop processor development”◆◆ ExamplesExamples

◆◆ RISC4005 (Freidin 1990)RISC4005 (Freidin 1990)◆◆ j32 (1995)j32 (1995)◆◆ XSOC/xr16 (1998)XSOC/xr16 (1998)

◆◆ xr16: simple pipelined 16-bit RISC, 33 MHz, 1.4 CPIxr16: simple pipelined 16-bit RISC, 33 MHz, 1.4 CPI◆◆ XSOC: on-chip bus and peripheralsXSOC: on-chip bus and peripherals

◆◆ Towards a reusable cores community andTowards a reusable cores community andcost-effective FPGA-based systems-on-a-chipcost-effective FPGA-based systems-on-a-chip

lw r12,4(r3)lw r12,4(r3)

OutlineOutline◆◆ Introduction to FPGAsIntroduction to FPGAs

◆◆ Xilinx XC4000XL architectureXilinx XC4000XL architecture◆◆ Development processDevelopment process

◆◆ XSOC/xr16 projectXSOC/xr16 project◆◆ xr16 tools, architecture, datapath, control unitxr16 tools, architecture, datapath, control unit◆◆ XSOC design and implementationXSOC design and implementation

◆◆ Ongoing workOngoing work◆◆ References, ResourcesReferences, Resources

lw r12,4(r3)lw r12,4(r3)

Field ProgrammableField ProgrammableGate ArraysGate Arrays

◆◆ Evolved fromEvolved from PALs PALs,, PLDs PLDs, Complex, Complex PLDs PLDs◆◆ A niche hardware implementation technologyA niche hardware implementation technology

◆◆ Strengths: quick prototyping and time-to-market,Strengths: quick prototyping and time-to-market,reprogrammabilityreprogrammability, relatively easy to use, relatively easy to use

◆◆ Weaknesses: cost, density, speedWeaknesses: cost, density, speed

◆◆ Vendors: Xilinx, Altera,Vendors: Xilinx, Altera, Actel Actel, Atmel, Lucent,, Atmel, Lucent,Cypress, Lattice, Cypress, Lattice, QuickLogicQuickLogic

Technology Gates Speed Cost Part Cost Spin timeCustom VLSI <100M <1 GHz $50K-up $1-up weeksGate array <10M <400 MHz $10K-$1M $1-up days/weeksFPGA <1M <200 MHz $100-$100K $3-$1K minutes/hours

lw r12,4(r3)lw r12,4(r3)

FPGA TechnologiesFPGA Technologies◆◆ Programmable ElementsProgrammable Elements

◆◆ LogicLogic◆◆ combinatorial: lookup tables (LUTs) or gates + muxescombinatorial: lookup tables (LUTs) or gates + muxes◆◆ sequential: flip-flopssequential: flip-flops

◆◆ Interconnect: metal wire segments connected by...Interconnect: metal wire segments connected by...◆◆ pass transistors driven by SRAM or EEPROM bit cellspass transistors driven by SRAM or EEPROM bit cells◆◆ multiplexersmultiplexers◆◆ fuses orfuses or antifuses antifuses (one-time programmable) (one-time programmable)

◆◆ FPGA architecture variationsFPGA architecture variations◆◆ CLB arrays, rows,CLB arrays, rows, macrocells macrocells; fine/coarse grain; fine/coarse grain◆◆ RAM, fast wide adders, TBUFs (3-state buffers)RAM, fast wide adders, TBUFs (3-state buffers)

lw r12,4(r3)lw r12,4(r3)

FPGAs: Easy to UseFPGAs: Easy to Use◆◆ Idealized digital design modelIdealized digital design model

◆◆ No analog EE: ignore gate and wire capacitance,No analog EE: ignore gate and wire capacitance,transistor W/L ratios, etc.transistor W/L ratios, etc.

◆◆ Low skew global nets Low skew global nets →→=No clock skew worriesNo clock skew worries◆◆ Buffered line drivers Buffered line drivers →→=No gate fan-out worriesNo gate fan-out worries

◆◆ Synchronous design “just works”Synchronous design “just works”◆◆ Avoid async. flip-flop state changes, gated clocksAvoid async. flip-flop state changes, gated clocks◆◆ Try to keep everything on one deviceTry to keep everything on one device◆◆ Design, simulate, it Design, simulate, it shouldshould just work just work

◆◆ ...even if you’re just a software type...even if you’re just a software type

◆◆ But, effort to master effective use of resourcesBut, effort to master effective use of resources

lw r12,4(r3)lw r12,4(r3)

FPGA ApplicationsFPGA Applications◆◆ Conventional: cheap, fast turnaround, fieldConventional: cheap, fast turnaround, field

upgradable ASICsupgradable ASICs◆◆ Telecom, networking, XSOC/xr16Telecom, networking, XSOC/xr16

◆◆ (Re-) configurable computing(Re-) configurable computing◆◆ QuickturnQuickturn: fast design simulation/verification: fast design simulation/verification◆◆ Signal, image processing: radar, radioSignal, image processing: radar, radio◆◆ Graphics: Graphics: compositingcompositing, , octtreeocttree rendering rendering◆◆ Military: target dependent correlation/recognitionMilitary: target dependent correlation/recognition◆◆ Cryptography: DES searchCryptography: DES search◆◆ “Hardware” genetic algorithms“Hardware” genetic algorithms

lw r12,4(r3)lw r12,4(r3)

Xilinx XC4000 FPGAsXilinx XC4000 FPGAs◆◆ Introduced in 1992, Xilinx’s third generationIntroduced in 1992, Xilinx’s third generation◆◆ SRAM basedSRAM based◆◆ Device family spans 3,000 to 250,000 “gates”,Device family spans 3,000 to 250,000 “gates”,

1 to 3 ns LUT delays, $10 to $1000 Q1 1 to 3 ns LUT delays, $10 to $1000 Q1◆◆ XSOC uses 3.3V XC4005XLPC84-3XSOC uses 3.3V XC4005XLPC84-3

◆◆ 14x14 Configurable Logic Blocks, 1.6 ns LUT delay14x14 Configurable Logic Blocks, 1.6 ns LUT delay◆◆ 60 bonded-out I/O Blocks60 bonded-out I/O Blocks◆◆ $20 ($3 Q100K)$20 ($3 Q100K)

◆◆ Eclipsed by newer Virtex, Virtex-E familiesEclipsed by newer Virtex, Virtex-E families◆◆ 256x16 SRAM blocks, up to 2M “gates”256x16 SRAM blocks, up to 2M “gates”

lw r12,4(r3)lw r12,4(r3)

XC4000XC4000Configurable Logic BlockConfigurable Logic Block

◆◆ Two 4-input LUTsTwo 4-input LUTs◆◆ Two D flip-flops w/ clock enableTwo D flip-flops w/ clock enable◆◆ Special features:Special features:

◆◆ Writeable LUTs provide 32x1 or 16x2 SRAMWriteable LUTs provide 32x1 or 16x2 SRAM◆◆ 8 CLBs make a 16x16 register file8 CLBs make a 16x16 register file

◆◆ Dedicated fast ripple carry chain hardwareDedicated fast ripple carry chain hardware◆◆ 8 CLBs implement 16-bit adder, ~10 ns delay8 CLBs implement 16-bit adder, ~10 ns delay

◆◆ Fig. 1Fig. 1

lw r12,4(r3)lw r12,4(r3)

XC4000 I/O BlockXC4000 I/O Block◆◆ Input: direct, registeredInput: direct, registered◆◆ Output: direct, registered, 3-stated,Output: direct, registered, 3-stated,

pull-ups/pull-downs, slew rate optionspull-ups/pull-downs, slew rate options◆◆ Fig. 11Fig. 11

lw r12,4(r3)lw r12,4(r3)

XC4000 InterconnectXC4000 Interconnect◆◆ Programmable Interconnect PointsProgrammable Interconnect Points◆◆ Switch matricesSwitch matrices◆◆ Input multiplexersInput multiplexers◆◆ Hierarchical resources for hierarchical circuitsHierarchical resources for hierarchical circuits◆◆ Special featuresSpecial features

◆◆ Long lines - buses and control linesLong lines - buses and control lines◆◆ 3-state buffer per output3-state buffer per output

◆◆ Low skew global clock linesLow skew global clock lines

◆◆ Fig. 12-16Fig. 12-16

lw r12,4(r3)lw r12,4(r3)

FPGA Development ProcessFPGA Development Process◆◆ Design capture: schematics, custom tools,Design capture: schematics, custom tools,

hardware design languages; hardware design languages; floorplan!floorplan!◆◆ Simulate/verifySimulate/verify◆◆ Compile: “technology mapping”, place, routeCompile: “technology mapping”, place, route◆◆ Make config bitstream: XC4005XL: 152K bitsMake config bitstream: XC4005XL: 152K bits◆◆ Download:Download:

◆◆ “master load” from byte-wide or bit-serial ROM“master load” from byte-wide or bit-serial ROM◆◆ “slave load” by microprocessor or“slave load” by microprocessor or XChecker XChecker pod pod

◆◆ Test/Debug; pod can readback internal stateTest/Debug; pod can readback internal state

lw r12,4(r3)lw r12,4(r3)

Talk OutlineTalk Outline◆◆ Introduction to FPGAsIntroduction to FPGAs


◆◆ XSOC/xr16XSOC/xr16◆◆ xr16 tools, architecture, datapath, control unitxr16 tools, architecture, datapath, control unit◆◆ XSOC design and implementationXSOC design and implementation

◆◆ DemoDemo◆◆ Ongoing workOngoing work◆◆ References, ResourcesReferences, Resources

lw r12,4(r3)lw r12,4(r3)

XSOC/xr16 Project GoalsXSOC/xr16 Project Goals◆◆ Make integrated system dev more accessibleMake integrated system dev more accessible◆◆ Subject of Subject of Circuit CellarCircuit Cellar article series article series

◆◆ Explain FPGA processor and system designExplain FPGA processor and system design◆◆ Build a fast, simple, thrifty, real CPUBuild a fast, simple, thrifty, real CPU

◆◆ Help launch a free cores development communityHelp launch a free cores development community◆◆ Provide on-chip bus architecture for easy cores reuseProvide on-chip bus architecture for easy cores reuse

◆◆ Show FPGA CPUs can be practical (=cheap)Show FPGA CPUs can be practical (=cheap)◆◆ Not just an ASIC prototyping toolNot just an ASIC prototyping tool

◆◆ Show C compiler supportShow C compiler support◆◆ Show efficient FPGA design “best practices”Show efficient FPGA design “best practices”

lw r12,4(r3)lw r12,4(r3)

Development ToolsDevelopment Tools◆◆ Need C compiler for our nascent 16-bit RISC:Need C compiler for our nascent 16-bit RISC:

port LCC [Fraser and Hanson]port LCC [Fraser and Hanson]◆◆ mipsmips.md .md →→ xr16.md xr16.md◆◆ 32 registers 32 registers →→ 16 registers 16 registers◆◆ sizeof(int)==sizeof(void*)==4 sizeof(int)==sizeof(void*)==4 →→ 2 2◆◆ Misc: branches; long ints; softwareMisc: branches; long ints; software mul mul/div/div

◆◆ AssemblerAssembler◆◆ LexLex, parse, fix far branches, apply fixups, emit, parse, fix far branches, apply fixups, emit

◆◆ Instruction set simulatorInstruction set simulator

lw r12,4(r3)lw r12,4(r3)

LCC Generated CodeLCC Generated Code

◆ typedef struct TN {int k;struct TN *left, *right;

} *T;

T search(int k, T p) {while (p && p->k != k)if (p->k < k)p = p->right;

elsep = p->left;

return p;}

◆ _search: ; r3=k r4=pbr L3

L2: lw r9,(r4)cmp r9,r3 ; p->k < k?bge L5lw r4,4(r4) ; p=p->rightbr L6

L5: lw r4,2(r4) ; p=p->leftL6:L3: mov r9,r4

cmp r9,r0 ; p==0?beq L7lw r9,(r4)cmp r9,r3 ; p->k != k?bne L2

L7: mov r2,r4 ; retval=pL1: ret

lw r12,4(r3)lw r12,4(r3)

XR16 Instruction Set DesignXR16 Instruction Set Design◆◆ Goals and constraintsGoals and constraints

◆◆ Compact 16-bit instructionsCompact 16-bit instructions◆◆ 16 16-bit registers16 16-bit registers◆◆ Cover integer C operationsCover integer C operations◆◆ KISSKISS◆◆ Fast Fast →→ pipelined pipelined →→ regular regular◆◆ Byte addressable Byte addressable →→

load/store bytes/wordsload/store bytes/words◆◆ dispdisp(reg) addressing(reg) addressing◆◆ Long ints Long ints →→ add with carry, add with carry,

shift extendedshift extended◆◆ Scale to 32-bit registersScale to 32-bit registers

◆◆ Common instructionsCommon instructions◆◆ lw sw lw sw mov mov lea call br cmplea call br cmp◆◆ movmov lea cmp lea cmp →→ add/subadd/sub◆◆ dispdisp(reg) addressing (69%)(reg) addressing (69%)

◆◆ Resolved:Resolved:◆◆ 3 operand: add sub addi3 operand: add sub addi◆◆ 4-bit immediates,4-bit immediates,

12-bit immediate prefix12-bit immediate prefix◆◆ no condition codes;no condition codes;

interlocked add/sub+binterlocked add/sub+bcondcond◆◆ fast call/return fast call/return →→ call/jal call/jal◆◆ mulmul/div/shifts in software/div/shifts in software

lw r12,4(r3)lw r12,4(r3)

XR16 Instruction SetXR16 Instruction SetHex Fmt Assembler Semantics N0dab rrr add rd,ra,rb rd = ra + rb; 11dab rrr sub rd,ra,rb rd = ra – rb; 12dai rri addi rd,ra,imm rd = ra + imm; 13d*b rr {and or xor andn adc

sbc} rd,rbrd = rd op rb; 1

4d*i ri {andi ori xori andniadci sbci slli slxisrai srli srxi} rd,imm

rd = rd op imm; 1

5dai rri lw rd,imm(ra) rd = *(int*)(ra+imm); 26dai rri lb rd,imm(ra) rd = *(byte*)(ra+imm); 28dai rri sw rd,imm(ra) *(int*)(ra+imm) = rd; 29dai rri sb rd,imm(ra) *(byte*)(ra+imm) = rd; 2Adai rri jal rd,imm(ra) rd = pc, pc = ra + imm; 3B*dd br {br brn beq bne bc

bnc bv bnv blt bgeble bgt bltu bgeubleu bgtu} label

if (cond) pc += 2*disp8; *

Ciii i12 call func r15 = pc, pc = imm12<<4; 3Diii i12 imm imm12 imm'next15:4 = imm12; 17xxxExxxFxxx

− reserved reserved

lw r12,4(r3)lw r12,4(r3)

Synthesized InstructionsSynthesized InstructionsInstruction Maps tonop and r0,r0mov rd,ra add rd,ra,r0cmp ra,rb sub r0,ra,rbsubi rd,ra,im addi rd,ra,-imcmpi ra,im addi r0,ra,-imcom rd xori rd,-1lea rd,imm(ra) addi rd,ra,immlbs rd,imm(ra)(load-byte,sign-extending)

lb rd,imm(ra)xori rd,0x80subi rd,0x80

j addr jal r0,addrret jal r0,0(r15)

lw r12,4(r3)lw r12,4(r3)

Compromised InstructionsCompromised Instructions

◆◆ No multiply/divideNo multiply/divide◆◆ No barrel shifter: 1-bit shifts onlyNo barrel shifter: 1-bit shifts only◆◆ Jumps and taken branches take 3 cyclesJumps and taken branches take 3 cycles

◆◆ Annul the two branch shadow instructionsAnnul the two branch shadow instructions

◆◆ Loads/stores take at least 2 cyclesLoads/stores take at least 2 cycles◆◆ One shared memory portOne shared memory port

◆◆ Simple interrupt/returnSimple interrupt/return◆◆ No floating pointNo floating point

lw r12,4(r3)lw r12,4(r3)

Design Hierarchy: xr16Design Hierarchy: xr16◆◆ XSOCXSOC

◆◆ xr16xr16◆◆ datapathdatapath◆◆ control unitcontrol unit

◆◆ memctrlmemctrl◆◆ vgavga◆◆ parinparin, , paroutparout, , iramiram

Copyright (C) 2000, Gray Research LLC.This work and its use subject to XSOCLicense Agreement. See LICENSE file.

Project: [None]Macro: XR16Date: 02/23/100

Figure S2: XR16 CPU Schematic

D

DP16RLOC=R5C0

BRDISP[7:0]

LOGICOP[1:0]

RNA[3:0]

RNB[3:0]

PCE

ADD

BCE15_4

BRANCH

CI

IMMOP[5:0]

A15

LOGICT

RETADT

RFWE

SELPC

SLT

SRT

SRI

SUMT

ZXT

ZEROPC

CLK

Z

ADDR[15:0]

RESULT[15:0]

CO

V

N

FWD

IMM[11:0]

LDT

UDT

UDLDT

DMAPC

PCCE

RETCE

C

CTRL16RLOC=R0C0

A15

Z

N

V

RNA[3:0]

RNB[3:0]

RFWE

IMMOP[5:0]

PCE

BCE15_4

ADD

CI

LOGICOP[1:0]

SUMT

LOGICT

ZXT

SRT

SLT

RETADT

CLK

SELPC

ZEROPC

ACE

SRI

FWD

CO

RDY

IMM[11:0]

READN

WORDN

INSN[15:0]

BRDISP[7:0]

BRANCH

DMA

DBUSN

DMAPC

PCCE

RETCE

IREQ

DMAREQ

ZERODMARNA[3:0]

RNB[3:0]

IMM[11:0]

LOGICOP[1:0]

Z,N,CO,V,A_15

AN[15:0]

BRDISP[7:0]

IMMOP[5:0]

D[15:0]INSN[15:0]

CLK

BRANCH

RFWE

FWD

PCE

BCE15_4

SELPC

ZEROPC

ADD

CI

SUMT

LOGICT

DMA

N

CO

ZXT

SRI

SRT

SLT

RETADT

RDY WORDN

UDLDT

V

A_15A_15

Z

N

CO

V

READN

Z

UDT

LDT

ACE

ZERODMA

DMAPC

PCCE

RETCE

IREQ

DMAREQ

DBUSN

lw r12,4(r3)lw r12,4(r3)

Datapath ResourcesDatapath Resources◆◆ 2 read, 1 write per cycle register file2 read, 1 write per cycle register file◆◆ Immediate operand multiplexerImmediate operand multiplexer◆◆ A and B operand registersA and B operand registers◆◆ Adder, logic unit, shifterAdder, logic unit, shifter◆◆ Result multiplexerResult multiplexer◆◆ Zero, negative, carry, overflow detectionZero, negative, carry, overflow detection◆◆ PC, branch displacement multiplexer, PC adderPC, branch displacement multiplexer, PC adder◆◆ Address multiplexerAddress multiplexer

lw r12,4(r3)lw r12,4(r3)

Design HierarchyDesign Hierarchy◆◆ XSOCXSOC




Project: [None]Macro: DP16Date: 02/23/100

EXECUTION UNIT

ADDRESS/PC UNIT

RESULT MUX

Figure S3: XR16 CPU Datapath Schematic

RETBUF

RLOC=R1C9

BUFT16XT

BREGS

REGFILERLOC=R1C1

D[15:0]

A[3:0]

WE

CLK

Q[15:0]

LOGICBUF

RLOC=R1C4

BUFT16XT

SRBUF

RLOC=R1C1

BUFT16XT

IMMED

IMM16RLOC=R1C3

B[15:0]

IR[11:0]

O[15:0]

OP[5:0]

B

FD12E4ERLOC=R1C3

D[15:0] Q[15:0]

CE15_4

CE3_0

CLK

AREGS

REGFILERLOC=R1C0

D[15:0]

A[3:0]

WE

CLK

Q[15:0]

A

FD16ERLOC=R1C2

D[15:0]

CE

CLK

Q[15:0]

ADDSUB

RLOC=R-1C5

ADSU16

A[15:0]

ADD

B[15:0]

CI

COOFL

S[15:0]

LOGIC

LOGIC16RLOC=R1C4

A[15:0]

B[15:0]

OP[1:0]

O[15:0]

SUMBUF

RLOC=R1C5

BUFT16XTBUF

SLBUF

RLOC=R1C0

BUFT16XT

PCINCR

RLOC=R-1C8

ADD16

A[15:0]

B[15:0]

CI

COOFL

S[15:0]

ADDRMUX

M2_16ZRLOC=R1C7

A[15:0]

B[15:0]

SEL

O[15:0]

ZERO

PC

RAM16X16SRLOC=R1C9

A[3:0]

D[15:0]

WE

WCLK

O[15:0]

RET

FD16ERLOC=R1C9

D[15:0]

CE

CLK

Q[15:0]PCDISP

PCDISP16RLOC=R1C6

BRDISP[7:0]

BRANCH

PCDISP[15:0]

ZHBUF

RLOC=R1C2

BUFT8XT

DOUT

FD16ERLOC=R1C4

D[15:0]

CE

CLK

Q[15:0]

FWD

M2_16RLOC=R1C2

A[15:0]

B[15:0]

SEL

O[15:0]

Z

ZERODETRLOC=R1C6

I[15:0]Z

UDLDBUF

RLOC=R5C2

BUFT8XT

UDBUF

RLOC=R1C3

BUFT8XT

LDBUF

RLOC=R5C3

BUFT8XT

RNA[3:0]

RNB[3:0]

AREG[15:0]

BREG[15:0] B[15:0]

AMUX[15:0]

BMUX[15:0]

RETAD[15:0]

A[15:0]

SUM[15:0]

LOGIC[15:0]

PCNEXT[15:0]

IMMOP[5:0]

PCDISP[15:0]

IMM[11:0]

DOUT[7:0]

BRDISP[7:0]

G,G,G,G,G,G,G,G

RESULT[15:8]

RESULT[15:8]

SRI,A[15:1]

RESULT[7:0]

RESULT[15:0]

PC[15:0]

DOUT[15:8] RESULT[7:0]

DOUT[15:8]

LOGICOP[1:0]

DOUT[15:0]

A[14:0],G

ADDR[15:0]

G,G,G,DMAPC

CLK

RFWE

ZXT

CLK

CLK

RFWE

CLK

PCE

SLT

Z

RETADT

PCCE

LDT

GND

SRT

SUMT

LOGICT

CI

ADD

FWD

N

A15

CLK

CO

SELPC

ZEROPC

CLK

CLK

BCE15_4

V

G

SUM15

PCE

BRANCH

PCESRI

UDLDT

UDT

RETCE

DMAPC

lw r12,4(r3)lw r12,4(r3)

Datapath FloorplanDatapath Floorplan

AREG

S, A

REG

, SLB

UF

BREG

S, B

REG

, SR

BUF

FWD

, A, U

DLD

BUF,

ZH

BUF

IMM

ED, B

, LD

BUF,

UD

BUF

LOG

IC, D

OU

T, L

OG

ICBU

F

ADD

SUB,

SU

MBU

F

PCD

ISP,

Z

ADD

RM

UX

PCIN

CR

PC, R

ET, R

ETBU

F

lw r12,4(r3)lw r12,4(r3)

Control Unit: Pipeline StagesControl Unit: Pipeline Stages◆◆ IF: Instruction FetchIF: Instruction Fetch

◆◆ Read instruction at addr; prepare addr Read instruction at addr; prepare addr ←← PC += 2; PC += 2;capture instruction in Instruction Register (IR)capture instruction in Instruction Register (IR)

◆◆ DC: Decode and register file accessDC: Decode and register file access◆◆ Decode instruction; read register operands;Decode instruction; read register operands;

prepare immediate field; select operandsprepare immediate field; select operands

◆◆ EX: ExecuteEX: Execute◆◆ Add, logic, shift; select result; write to register fileAdd, logic, shift; select result; write to register file◆◆ call/jal: PC := sumcall/jal: PC := sum◆◆ load/store: addr load/store: addr ←← sum; stop pipe for access sum; stop pipe for access

lw r12,4(r3)lw r12,4(r3)

Control Unit: Pipeline HazardsControl Unit: Pipeline Hazards◆◆ Memory wait statesMemory wait states◆◆ Result use data dependenceResult use data dependence◆◆ Load/store memory accessLoad/store memory access◆◆ Jumps and taken branchesJumps and taken branches◆◆ InterruptsInterrupts

lw r12,4(r3)lw r12,4(r3)

Control Unit ElementsControl Unit Elements◆◆ Control finite state machineControl finite state machine

◆◆ Memory interface controlsMemory interface controls

◆◆ Instruction registers (next, DC stage, EX stage)Instruction registers (next, DC stage, EX stage)◆◆ DC: Instruction decoderDC: Instruction decoder◆◆ DC: Register file read controlsDC: Register file read controls◆◆ DC: Immediate operand and forwarding controlDC: Immediate operand and forwarding control◆◆ EX: ALU and result multiplexer controlEX: ALU and result multiplexer control◆◆ EX: Register file write controlEX: Register file write control◆◆ DC+EX: Conditional branch controlDC+EX: Conditional branch control

lw r12,4(r3)lw r12,4(r3)





Project: [None]Macro: CTRL16Date: 02/23/100

"N.C."

"N.C.""N.C."

"N.C."

INSTRUCTION REGISTERS

Figure S4: XR16 CPU Control Unit Schematic

EXECUTE STAGEDC: OPERAND SELECTION

CONTROL STATE MACHINE

DC: CONDITIONAL BRANCHES

INSTRUCTION DECODER

T2

FD4PEINIT=S

D0

D1

D2

D3

Q0

Q1

Q2

Q3

CE

CLK

AND2B1TRUE

TRUTHRLOC=R2C6

COND[3:0]

Z

N

C

V

TRUE

OR2

OR2

EXIRB

BUF16

I[15:0] O[15:0]

IRB

BUF16

I[15:0]O[15:0]

IRMUX

IRMUXUSE_RLOC=FALSE

A[15:0]

B[15:0]

SEL

INT

O[15:0]

NEXTIRFD16CE

C

CE

CLR

D[15:0] Q[15:0]

RNB

RNMUX4RLOC=R2C1

RA[3:0]

RD[3:0]

EXRD[3:0]

SELR15

SELRD

SELSRC

RN[3:0]

FWD

SELR0

RZERO

AND2

CIFDCE

D

C

CE

CLR

Q

AND2

BUF

DECODE

DECODE

OP[3:0]

PCE

CLK

ADDSUB

SUB

NLW

NLD

NLB

IMM12

IMM4

WORDIMM4

NJAL

EXRESULTS

EXLDST

EXLBSB

EXJALI

SEXTIMM4

BR

EXST

FN[3:0]

NSUM

NLOGIC

NSR

NSL

EXFNSRA

ADCSBC

EXNSUB

EXIMM

ST

JALI

RRRI

EXOP[3:0]

EXJAL

NSUB

DCINTINH

AND2B1

BUF

AND2B1

RNA

RNMUX4RLOC=R2C0

RA[3:0]

RD[3:0]

EXRD[3:0]

SELR15

SELRD

SELSRC

RN[3:0]

FWD

SELR0

RZERO

BUF

AND2B1

BRANCHFDCE

D

C

CE

CLR

Q

IRFD16CE

C

CE

CLR

D[15:0] Q[15:0]

AND3B1

AND3B1

BUF

AND4B2

SRI

AND2T1

FD4PEINIT=S

D0

D1

D2

D3

Q0

Q1

Q2

Q3

CE

CLK

FSM

CTRLFSM

RDY

IREQ

DMAREQ

BRANCH

JUMP

EXLBSB

EXST

ACE

PCE

WORDN

READN

DBUSN

PCCE

RETCE

IF

DMAPC

CLK

ZERODMA

EXLDST

ZEROPC

IFINT

SELPC

DCINTINH

EXANNUL

EXAN

DMA

AND2XNOR2

EXIRFD16CE

C

CE

CLR

D[15:0] Q[15:0]

BUFBUF

AND2B1

IMMB

BUF16

I[15:0]O[15:0]

IR[15:0]IRMUX[15:0]

EXOP[3:0]

IMMOP[5:0]

LOGICOP[1:0]

NIR[15:0] EXOP[3:0],EXRD[3:0],BRDISP[7:0]

IMM[11:0]

IR[11:8]

RNB[3:0]

OP[3:0],RD[3:0],RA[3:0],RB[3:0]

EXIR[15:0]

INSN[15:0]

BRDISP[7:0]

IR[7:4]

RA[3:0]

RD[3:0]

EXRD[3:0]

RNA[3:0]

RB[3:0]

RD[3:0]

EXRD[3:0]

OP[3:0]

IR[11:0]

CI

CALL

NSUB

ADCSBC

FWD

ST

RFWEEXANNUL

BRANCH

RETCE

PCE

ADD

NSUB

CLK

NJAL

NLB

PCE

ZXT

SRT

SLT

RETADT

EXJAL

EXLBSB

BCE15_4

ACE

PCE

EXANNUL

CLK

IR3

EXFNSRA

WORDIMM4

ADDSUB

SUB

A15

IMM_12

EXCALL

SEXTIMM4

EXIMM

DCINTINH

EXJAL

NLW

NLD

IMM_12

IMM_4

PCE

EXRESULTS

SRI

NLB

EXIMM

EXANNUL

IMMOP0

WORDIMM4

NJAL

SEXTIMM4

EXRESULTS

CLK

EXCALL

RRRI

CLK

EXCALL

IR0IMMOP5

RZERO

IF

EXNSUB

RZERO

IREQ

NSL

NLW

NLD

NSR

JUMP

EXIR4

EXIR5

CO

LOGICOP1

PCE

LOGICOP0

JUMP

RRRI

LOGICT

EXST

BR

EXLDST TRUE

CLK

EXRESULTS

ADCSBC

NSR

NSL

NLOGIC

NSUM

CLK

NSUM

NLOGIC

CALL

PCE

IMMOP2

IMM_4

N

Z

BRANCHBR

CO

V

CLK

CLK

CLK

CLK

PCE

CLK

WORDN GND

DMAPC

EXANNUL

ST

EXNSUB

PCE

WORDIMM4

SUMT

EXANNUL

IMM_12

EXAN

IF

BRN

PCE

IMMOP1

IMMOP3

EXFNSRA

IR0

IMM_4

IMMOP4

READN

EXLDST

EXLBSB

EXST

EXAN

DCINTINH

IF

IFINT

DBUSN

SELPC

ZEROPC

DMAPC

PCCERDY

DMA

DMAREQ

ZERODMA

PCE

IFINT

GND

lw r12,4(r3)lw r12,4(r3)

OutlineOutline◆◆ Introduction to FPGAsIntroduction to FPGAs


◆◆ XSOC/xr16 projectXSOC/xr16 project◆◆ xr16 tools, architecture, datapath, control unitxr16 tools, architecture, datapath, control unit◆◆ XSOC design and implementationXSOC design and implementation

◆◆ Ongoing workOngoing work◆◆ References, ResourcesReferences, Resources

lw r12,4(r3)lw r12,4(r3)

XSOC ProjectXSOC Project◆◆ Make integrated system dev more accessibleMake integrated system dev more accessible◆◆ Build with Xilinx Student Edition ($100)Build with Xilinx Student Edition ($100)◆◆ Target XESS XS40-005XL student lab boardTarget XESS XS40-005XL student lab board

◆◆ Textbook, exercises, and tools ($100)Textbook, exercises, and tools ($100)

◆◆ Integrated system:Integrated system:◆◆ Processor, memory, memory controller, on-chipProcessor, memory, memory controller, on-chip

bus, parallel port, ram, video controllerbus, parallel port, ram, video controller◆◆ Double-cycle bytewide external SRAMDouble-cycle bytewide external SRAM◆◆ 32 KB SRAM 32 KB SRAM →→ 576x455 monochrome display576x455 monochrome display

lw r12,4(r3)lw r12,4(r3)

“System-on-a-Chip” in Context“System-on-a-Chip” in Context

4701K4701K4701K

R

G

B

VGA PORT

/HSYNC/VSYNC

XC4005XL FPGA

XA[14:1]XA_0

XD[7:0]RAMNCERAMNOERAMNWE

RESET8031

R1R0G1G0B1B0

NHSYNCNVSYNC

CLK

PAR_D[5:0]

PAR_S[6:3]

32 KB SRAM

A[14:1]A0D[7:0]/CS/OE/WE

D[5:0]

S[6:3]

PARALLELPORT

12 MHzOSC.

lw r12,4(r3)lw r12,4(r3)





Project: XSOCSheet: XSOCDate: 06/07/99

EXTERNAL BUS AND SRAM INTERFACE

P

XR16RLOC_ORIGIN=R1C3

CLK

ACE

AN[15:0]

RDY

READN

WORDN

D[15:0]

UDT

LDT

INSN[15:0]

DMA

UDLDT

DBUSN

IREQ

DMAREQ

ZERODMA

OPAD

BUFGP

STARTUPSTARTUP

GSR

GTS

CLK

Q3

Q2

Q1Q4

DONEIN

IBUF

OPAD

INV

OBUF

MEMCTRL

MEMCTRL

AN[15:0]

ACE

WORDN

READN

RDY

UDT

LDT

RAMNOE

RAMNWE

XA_0

XDOUTT

UXDT

LXDT

SEL[7:0]

CLK

CTRL[15:0]UDLDT

REQ

ACK

DMA

DBUSN

IREQ

DMAREQ

ZERODMA RESET

OPAD

IRAM

XRAM16X16RLOC_ORIGIN=R7C14

CTRL[15:0]

D[15:0]

SEL

CLK

OPAD

OBUFOPAD

OBUF OPAD

OBUF

OPAD

XAOFDX16

C

CED[15:0] Q[15:0]

OBUF

OBUF

OPAD

IPADIBUF

XDINIFD8

C

D[7:0] Q[7:0]

IBUF8

UXDBUFT8

T

LXDBUFT8

T

OPAD16O[15:0]

XDOUT

OBUFT8T

OBUFOBUF

BUF

OPAD

PARIN

XIN8RLOC_ORIGIN=R11C13

I[7:0]

CTRL[15:0]

D[7:0]

SEL

CLK

BUF

IPAD

IPAD

IPAD

IPAD

IPAD

IPAD

OPAD

OPAD

IBUFIBUFIBUFIBUFIBUF

OPAD

VGA

VGARLOC_ORIGIN=R7C1

CLK

NHSYNC

NVSYNC

R1

R0

G1

G0

B1

B0

PIX[15:0]

REQ

ACK

RESET

OBUFOBUF

OPAD

OPAD

OBUFOBUF

OPAD

OPAD

OBUF

IOPAD8IO[7:0]

PAROUT

XOUT4

CTRL[15:0]

D[7:0]

SEL

CLK

Q0

Q1

Q2

Q3

OPADOBUF4

AN[15:0]

PAR[7:0]

D[7:0]

XDIN[15:0]

XDIN[15:8]

XDIN[7:0]

XD[7:0]

D[7:0]

D[15:8]

CTRL[15:0]

D[15:0]

D[7:0]

XA[15:0]

SEL[7:0]

XDIN[15:0]

D[7:0]

NCLK

WORDN

CLK

READN

CLK

VCC

RAMNOE

CLK

CLK

RESET8031

RAMNWE

RAMNCE

ACE

GND

LXDT

CLKIN

VRESET

VREQ

VRESET

XA_0

XDOUTT

UXDT

NCLK

LXDT

UXDT

CLK

B0

XDOUTTG0

VREQ

SEL1

B1

CLK

G1

SEL0

DMA

PAR_D0

PAR_D1

PAR_D2

PAR_D3

PAR_D4

PAR_D5

PAR_S5

PAR_S4

PAR7

PAR6

PAR5

PAR4

PAR3

PAR2

PAR1

PAR0

CLK

NHSYNC

NVSYNC

R1

R0

CLK

UDLDT

LDT

UDT

RDY

ZERODMA

DMAREQ

IREQ

VACK VACK

DBUSN

SEL2PAR_S6

PAR_S3

GND

lw r12,4(r3)lw r12,4(r3)

On-Chip BusOn-Chip Bus◆◆ 16-bit pipelined on-chip data bus16-bit pipelined on-chip data bus◆◆ Bus/memory controllerBus/memory controller

◆◆ Address decoding, bus controls (output enables,Address decoding, bus controls (output enables,clock enables), external SRAM controlclock enables), external SRAM control

◆◆ Make core reuse easy: Make core reuse easy: no glue logic requiredno glue logic required◆◆ Abstract control signal busAbstract control signal bus◆◆ Locally decoded within each coreLocally decoded within each core◆◆ Just add core, attach data, control, and select linesJust add core, attach data, control, and select lines◆◆ Can evolve bus with new features withoutCan evolve bus with new features without

invalidating existing designs and coresinvalidating existing designs and cores

lw r12,4(r3)lw r12,4(r3)

Video ControllerVideo Controller◆◆ 32 KB 32 KB →→ 576x455 monochrome bitmap576x455 monochrome bitmap◆◆ VGA compatible sync timingsVGA compatible sync timings◆◆ Video signal generationVideo signal generation

◆◆ A series of frames, each a series of lines, each aA series of frames, each a series of lines, each aseries of pixelsseries of pixels

◆◆ Fetch 16-pixel words, shifts them out seriallyFetch 16-pixel words, shifts them out serially◆◆ DMA read a new word every 8 clocksDMA read a new word every 8 clocks

◆◆ Drive horizontal and vertical sync to “frame” pixelsDrive horizontal and vertical sync to “frame” pixels◆◆ 2 10-bit counters and 4 10-bit comparators2 10-bit counters and 4 10-bit comparators

lw r12,4(r3)lw r12,4(r3)

System Bring UpSystem Bring Up◆◆ 11/98: Proposal, compiler, instruction set,11/98: Proposal, compiler, instruction set,

datapathdatapath◆◆ 12/98: Control unit, simulate CPU, target12/98: Control unit, simulate CPU, target

XS40 board, 1 Hz, 20 MHzXS40 board, 1 Hz, 20 MHz◆◆ Debugging with LEDs and stop: Debugging with LEDs and stop: goto goto stop;stop;◆◆ Later added on-chip bus, external SRAMLater added on-chip bus, external SRAM

interface, DMA, video, and interruptsinterface, DMA, video, and interrupts◆◆ Testing challengeTesting challenge

lw r12,4(r3)lw r12,4(r3)

RecapRecap◆◆ Final floorplanFinal floorplan◆◆ FPGA editorFPGA editor◆◆ DemoDemo

◆◆ SourceSource◆◆ ListingListing◆◆ Compile and goCompile and go

lw r12,4(r3)lw r12,4(r3)

Ongoing WorkOngoing Work◆◆ Near termNear term

◆◆ Package XSOC/xr16 distribution for beta test Package XSOC/xr16 distribution for beta test ✓✓◆◆ Retarget to Verilog Retarget to Verilog ✓✓◆◆ Port to Virtex and Altera architecturesPort to Virtex and Altera architectures

◆◆ Longer termLonger term◆◆ Help build a communityHelp build a community◆◆ xr32 xr32 ½ ½ ✓✓◆◆ Port GCC/binutils; build eCOS, Linux?Port GCC/binutils; build eCOS, Linux?◆◆ Chip multiprocessorsChip multiprocessors

lw r12,4(r3)lw r12,4(r3)

ReferencesReferences◆◆ FPGAsFPGAs

◆◆ TrimbergerTrimberger, S., , S., Field-Programmable Gate ArrayField-Programmable Gate ArrayTechnologyTechnology,, Kluwer Kluwer..

◆◆ LCCLCC◆◆ C. Fraser and D. Hanson, C. Fraser and D. Hanson, A Retargetable C Compiler:A Retargetable C Compiler:

Design and ImplementationDesign and Implementation, Benjamin/Cummings., Benjamin/Cummings.

◆◆ RISC architecture/implementationRISC architecture/implementation◆◆ D. Patterson and J. Hennessy, D. Patterson and J. Hennessy, Computer Organization andComputer Organization and

Design: The Hardware/Software Interface, Design: The Hardware/Software Interface, also also ComputerComputerAchitectureAchitecture: A Quantitative Approach: A Quantitative Approach, Morgan , Morgan KaufmannKaufmann

lw r12,4(r3)lw r12,4(r3)

ResourcesResources◆◆ Usenet: comp.arch.Usenet: comp.arch.fpgafpga, comp.arch, comp.arch◆◆ WebWeb

◆◆ www.www.xilinxxilinx.com/products/xc4000XLA.htm.com/products/xc4000XLA.htm◆◆ www.www.ioio.com/~.com/~guccioneguccione/HW_list.html/HW_list.html◆◆ www.www.fpgacpufpgacpu.org.org

◆◆ ConferencesConferences◆◆ SIGDA Int’lSIGDA Int’l Symp Symp. on FPGAs, Monterey, Feb.. on FPGAs, Monterey, Feb.◆◆ IEEE IEEE SympSymp. on FPGAs for Custom Computing. on FPGAs for Custom Computing

Machines, Napa, AprilMachines, Napa, April

Processor and Integrated System DesignProcessor and Integrated System Designin FPGAsin FPGAs

Jan GrayJan GrayPresident, Gray Research LLCPresident, Gray Research LLC

(([email protected])@fpgacpu.org)

lw r12,4(r3)lw r12,4(r3)

Copyright © 1999,2000 Gray Research LLC. All Rights Reserved.

@fpgacpu.org) president, gray research llc jan gray in ......xilinx xc4000 fpgas introduced in 1992,...

Documents