@fpgacpu.org) president, gray research llc jan gray in ......xilinx xc4000 fpgas introduced in 1992,...
TRANSCRIPT
Processor and Integrated System DesignProcessor and Integrated System Designin FPGAsin FPGAs
Jan GrayJan GrayPresident, Gray Research LLCPresident, Gray Research LLC
(([email protected])@fpgacpu.org)
lw r12,4(r3)lw r12,4(r3)
Copyright © 1999,2000 Gray Research LLC. All Rights Reserved.
lw r12,4(r3)lw r12,4(r3)
IntroductionIntroduction◆◆ Recent dense programmable logic enablesRecent dense programmable logic enables
practical “desktop processor development”practical “desktop processor development”◆◆ ExamplesExamples
◆◆ RISC4005 (Freidin 1990)RISC4005 (Freidin 1990)◆◆ j32 (1995)j32 (1995)◆◆ XSOC/xr16 (1998)XSOC/xr16 (1998)
◆◆ xr16: simple pipelined 16-bit RISC, 33 MHz, 1.4 CPIxr16: simple pipelined 16-bit RISC, 33 MHz, 1.4 CPI◆◆ XSOC: on-chip bus and peripheralsXSOC: on-chip bus and peripherals
◆◆ Towards a reusable cores community andTowards a reusable cores community andcost-effective FPGA-based systems-on-a-chipcost-effective FPGA-based systems-on-a-chip
lw r12,4(r3)lw r12,4(r3)
OutlineOutline◆◆ Introduction to FPGAsIntroduction to FPGAs
◆◆ Xilinx XC4000XL architectureXilinx XC4000XL architecture◆◆ Development processDevelopment process
◆◆ XSOC/xr16 projectXSOC/xr16 project◆◆ xr16 tools, architecture, datapath, control unitxr16 tools, architecture, datapath, control unit◆◆ XSOC design and implementationXSOC design and implementation
◆◆ Ongoing workOngoing work◆◆ References, ResourcesReferences, Resources
lw r12,4(r3)lw r12,4(r3)
Field ProgrammableField ProgrammableGate ArraysGate Arrays
◆◆ Evolved fromEvolved from PALs PALs,, PLDs PLDs, Complex, Complex PLDs PLDs◆◆ A niche hardware implementation technologyA niche hardware implementation technology
◆◆ Strengths: quick prototyping and time-to-market,Strengths: quick prototyping and time-to-market,reprogrammabilityreprogrammability, relatively easy to use, relatively easy to use
◆◆ Weaknesses: cost, density, speedWeaknesses: cost, density, speed
◆◆ Vendors: Xilinx, Altera,Vendors: Xilinx, Altera, Actel Actel, Atmel, Lucent,, Atmel, Lucent,Cypress, Lattice, Cypress, Lattice, QuickLogicQuickLogic
Technology Gates Speed Cost Part Cost Spin timeCustom VLSI <100M <1 GHz $50K-up $1-up weeksGate array <10M <400 MHz $10K-$1M $1-up days/weeksFPGA <1M <200 MHz $100-$100K $3-$1K minutes/hours
lw r12,4(r3)lw r12,4(r3)
FPGA TechnologiesFPGA Technologies◆◆ Programmable ElementsProgrammable Elements
◆◆ LogicLogic◆◆ combinatorial: lookup tables (LUTs) or gates + muxescombinatorial: lookup tables (LUTs) or gates + muxes◆◆ sequential: flip-flopssequential: flip-flops
◆◆ Interconnect: metal wire segments connected by...Interconnect: metal wire segments connected by...◆◆ pass transistors driven by SRAM or EEPROM bit cellspass transistors driven by SRAM or EEPROM bit cells◆◆ multiplexersmultiplexers◆◆ fuses orfuses or antifuses antifuses (one-time programmable) (one-time programmable)
◆◆ FPGA architecture variationsFPGA architecture variations◆◆ CLB arrays, rows,CLB arrays, rows, macrocells macrocells; fine/coarse grain; fine/coarse grain◆◆ RAM, fast wide adders, TBUFs (3-state buffers)RAM, fast wide adders, TBUFs (3-state buffers)
lw r12,4(r3)lw r12,4(r3)
FPGAs: Easy to UseFPGAs: Easy to Use◆◆ Idealized digital design modelIdealized digital design model
◆◆ No analog EE: ignore gate and wire capacitance,No analog EE: ignore gate and wire capacitance,transistor W/L ratios, etc.transistor W/L ratios, etc.
◆◆ Low skew global nets Low skew global nets →→=No clock skew worriesNo clock skew worries◆◆ Buffered line drivers Buffered line drivers →→=No gate fan-out worriesNo gate fan-out worries
◆◆ Synchronous design “just works”Synchronous design “just works”◆◆ Avoid async. flip-flop state changes, gated clocksAvoid async. flip-flop state changes, gated clocks◆◆ Try to keep everything on one deviceTry to keep everything on one device◆◆ Design, simulate, it Design, simulate, it shouldshould just work just work
◆◆ ...even if you’re just a software type...even if you’re just a software type
◆◆ But, effort to master effective use of resourcesBut, effort to master effective use of resources
lw r12,4(r3)lw r12,4(r3)
FPGA ApplicationsFPGA Applications◆◆ Conventional: cheap, fast turnaround, fieldConventional: cheap, fast turnaround, field
upgradable ASICsupgradable ASICs◆◆ Telecom, networking, XSOC/xr16Telecom, networking, XSOC/xr16
◆◆ (Re-) configurable computing(Re-) configurable computing◆◆ QuickturnQuickturn: fast design simulation/verification: fast design simulation/verification◆◆ Signal, image processing: radar, radioSignal, image processing: radar, radio◆◆ Graphics: Graphics: compositingcompositing, , octtreeocttree rendering rendering◆◆ Military: target dependent correlation/recognitionMilitary: target dependent correlation/recognition◆◆ Cryptography: DES searchCryptography: DES search◆◆ “Hardware” genetic algorithms“Hardware” genetic algorithms
lw r12,4(r3)lw r12,4(r3)
Xilinx XC4000 FPGAsXilinx XC4000 FPGAs◆◆ Introduced in 1992, Xilinx’s third generationIntroduced in 1992, Xilinx’s third generation◆◆ SRAM basedSRAM based◆◆ Device family spans 3,000 to 250,000 “gates”,Device family spans 3,000 to 250,000 “gates”,
1 to 3 ns LUT delays, $10 to $1000 Q1 1 to 3 ns LUT delays, $10 to $1000 Q1◆◆ XSOC uses 3.3V XC4005XLPC84-3XSOC uses 3.3V XC4005XLPC84-3
◆◆ 14x14 Configurable Logic Blocks, 1.6 ns LUT delay14x14 Configurable Logic Blocks, 1.6 ns LUT delay◆◆ 60 bonded-out I/O Blocks60 bonded-out I/O Blocks◆◆ $20 ($3 Q100K)$20 ($3 Q100K)
◆◆ Eclipsed by newer Virtex, Virtex-E familiesEclipsed by newer Virtex, Virtex-E families◆◆ 256x16 SRAM blocks, up to 2M “gates”256x16 SRAM blocks, up to 2M “gates”
lw r12,4(r3)lw r12,4(r3)
XC4000XC4000Configurable Logic BlockConfigurable Logic Block
◆◆ Two 4-input LUTsTwo 4-input LUTs◆◆ Two D flip-flops w/ clock enableTwo D flip-flops w/ clock enable◆◆ Special features:Special features:
◆◆ Writeable LUTs provide 32x1 or 16x2 SRAMWriteable LUTs provide 32x1 or 16x2 SRAM◆◆ 8 CLBs make a 16x16 register file8 CLBs make a 16x16 register file
◆◆ Dedicated fast ripple carry chain hardwareDedicated fast ripple carry chain hardware◆◆ 8 CLBs implement 16-bit adder, ~10 ns delay8 CLBs implement 16-bit adder, ~10 ns delay
◆◆ Fig. 1Fig. 1
lw r12,4(r3)lw r12,4(r3)
XC4000 I/O BlockXC4000 I/O Block◆◆ Input: direct, registeredInput: direct, registered◆◆ Output: direct, registered, 3-stated,Output: direct, registered, 3-stated,
pull-ups/pull-downs, slew rate optionspull-ups/pull-downs, slew rate options◆◆ Fig. 11Fig. 11
lw r12,4(r3)lw r12,4(r3)
XC4000 InterconnectXC4000 Interconnect◆◆ Programmable Interconnect PointsProgrammable Interconnect Points◆◆ Switch matricesSwitch matrices◆◆ Input multiplexersInput multiplexers◆◆ Hierarchical resources for hierarchical circuitsHierarchical resources for hierarchical circuits◆◆ Special featuresSpecial features
◆◆ Long lines - buses and control linesLong lines - buses and control lines◆◆ 3-state buffer per output3-state buffer per output
◆◆ Low skew global clock linesLow skew global clock lines
◆◆ Fig. 12-16Fig. 12-16
lw r12,4(r3)lw r12,4(r3)
FPGA Development ProcessFPGA Development Process◆◆ Design capture: schematics, custom tools,Design capture: schematics, custom tools,
hardware design languages; hardware design languages; floorplan!floorplan!◆◆ Simulate/verifySimulate/verify◆◆ Compile: “technology mapping”, place, routeCompile: “technology mapping”, place, route◆◆ Make config bitstream: XC4005XL: 152K bitsMake config bitstream: XC4005XL: 152K bits◆◆ Download:Download:
◆◆ “master load” from byte-wide or bit-serial ROM“master load” from byte-wide or bit-serial ROM◆◆ “slave load” by microprocessor or“slave load” by microprocessor or XChecker XChecker pod pod
◆◆ Test/Debug; pod can readback internal stateTest/Debug; pod can readback internal state
lw r12,4(r3)lw r12,4(r3)
Talk OutlineTalk Outline◆◆ Introduction to FPGAsIntroduction to FPGAs
◆◆ Xilinx XC4000XL architectureXilinx XC4000XL architecture◆◆ Development processDevelopment process
◆◆ XSOC/xr16XSOC/xr16◆◆ xr16 tools, architecture, datapath, control unitxr16 tools, architecture, datapath, control unit◆◆ XSOC design and implementationXSOC design and implementation
◆◆ DemoDemo◆◆ Ongoing workOngoing work◆◆ References, ResourcesReferences, Resources
lw r12,4(r3)lw r12,4(r3)
XSOC/xr16 Project GoalsXSOC/xr16 Project Goals◆◆ Make integrated system dev more accessibleMake integrated system dev more accessible◆◆ Subject of Subject of Circuit CellarCircuit Cellar article series article series
◆◆ Explain FPGA processor and system designExplain FPGA processor and system design◆◆ Build a fast, simple, thrifty, real CPUBuild a fast, simple, thrifty, real CPU
◆◆ Help launch a free cores development communityHelp launch a free cores development community◆◆ Provide on-chip bus architecture for easy cores reuseProvide on-chip bus architecture for easy cores reuse
◆◆ Show FPGA CPUs can be practical (=cheap)Show FPGA CPUs can be practical (=cheap)◆◆ Not just an ASIC prototyping toolNot just an ASIC prototyping tool
◆◆ Show C compiler supportShow C compiler support◆◆ Show efficient FPGA design “best practices”Show efficient FPGA design “best practices”
lw r12,4(r3)lw r12,4(r3)
Development ToolsDevelopment Tools◆◆ Need C compiler for our nascent 16-bit RISC:Need C compiler for our nascent 16-bit RISC:
port LCC [Fraser and Hanson]port LCC [Fraser and Hanson]◆◆ mipsmips.md .md →→ xr16.md xr16.md◆◆ 32 registers 32 registers →→ 16 registers 16 registers◆◆ sizeof(int)==sizeof(void*)==4 sizeof(int)==sizeof(void*)==4 →→ 2 2◆◆ Misc: branches; long ints; softwareMisc: branches; long ints; software mul mul/div/div
◆◆ AssemblerAssembler◆◆ LexLex, parse, fix far branches, apply fixups, emit, parse, fix far branches, apply fixups, emit
◆◆ Instruction set simulatorInstruction set simulator
lw r12,4(r3)lw r12,4(r3)
LCC Generated CodeLCC Generated Code
◆ typedef struct TN {int k;struct TN *left, *right;
} *T;
T search(int k, T p) {while (p && p->k != k)if (p->k < k)p = p->right;
elsep = p->left;
return p;}
◆ _search: ; r3=k r4=pbr L3
L2: lw r9,(r4)cmp r9,r3 ; p->k < k?bge L5lw r4,4(r4) ; p=p->rightbr L6
L5: lw r4,2(r4) ; p=p->leftL6:L3: mov r9,r4
cmp r9,r0 ; p==0?beq L7lw r9,(r4)cmp r9,r3 ; p->k != k?bne L2
L7: mov r2,r4 ; retval=pL1: ret
lw r12,4(r3)lw r12,4(r3)
XR16 Instruction Set DesignXR16 Instruction Set Design◆◆ Goals and constraintsGoals and constraints
◆◆ Compact 16-bit instructionsCompact 16-bit instructions◆◆ 16 16-bit registers16 16-bit registers◆◆ Cover integer C operationsCover integer C operations◆◆ KISSKISS◆◆ Fast Fast →→ pipelined pipelined →→ regular regular◆◆ Byte addressable Byte addressable →→
load/store bytes/wordsload/store bytes/words◆◆ dispdisp(reg) addressing(reg) addressing◆◆ Long ints Long ints →→ add with carry, add with carry,
shift extendedshift extended◆◆ Scale to 32-bit registersScale to 32-bit registers
◆◆ Common instructionsCommon instructions◆◆ lw sw lw sw mov mov lea call br cmplea call br cmp◆◆ movmov lea cmp lea cmp →→ add/subadd/sub◆◆ dispdisp(reg) addressing (69%)(reg) addressing (69%)
◆◆ Resolved:Resolved:◆◆ 3 operand: add sub addi3 operand: add sub addi◆◆ 4-bit immediates,4-bit immediates,
12-bit immediate prefix12-bit immediate prefix◆◆ no condition codes;no condition codes;
interlocked add/sub+binterlocked add/sub+bcondcond◆◆ fast call/return fast call/return →→ call/jal call/jal◆◆ mulmul/div/shifts in software/div/shifts in software
lw r12,4(r3)lw r12,4(r3)
XR16 Instruction SetXR16 Instruction SetHex Fmt Assembler Semantics N0dab rrr add rd,ra,rb rd = ra + rb; 11dab rrr sub rd,ra,rb rd = ra – rb; 12dai rri addi rd,ra,imm rd = ra + imm; 13d*b rr {and or xor andn adc
sbc} rd,rbrd = rd op rb; 1
4d*i ri {andi ori xori andniadci sbci slli slxisrai srli srxi} rd,imm
rd = rd op imm; 1
5dai rri lw rd,imm(ra) rd = *(int*)(ra+imm); 26dai rri lb rd,imm(ra) rd = *(byte*)(ra+imm); 28dai rri sw rd,imm(ra) *(int*)(ra+imm) = rd; 29dai rri sb rd,imm(ra) *(byte*)(ra+imm) = rd; 2Adai rri jal rd,imm(ra) rd = pc, pc = ra + imm; 3B*dd br {br brn beq bne bc
bnc bv bnv blt bgeble bgt bltu bgeubleu bgtu} label
if (cond) pc += 2*disp8; *
Ciii i12 call func r15 = pc, pc = imm12<<4; 3Diii i12 imm imm12 imm'next15:4 = imm12; 17xxxExxxFxxx
− reserved reserved
lw r12,4(r3)lw r12,4(r3)
Synthesized InstructionsSynthesized InstructionsInstruction Maps tonop and r0,r0mov rd,ra add rd,ra,r0cmp ra,rb sub r0,ra,rbsubi rd,ra,im addi rd,ra,-imcmpi ra,im addi r0,ra,-imcom rd xori rd,-1lea rd,imm(ra) addi rd,ra,immlbs rd,imm(ra)(load-byte,sign-extending)
lb rd,imm(ra)xori rd,0x80subi rd,0x80
j addr jal r0,addrret jal r0,0(r15)
lw r12,4(r3)lw r12,4(r3)
Compromised InstructionsCompromised Instructions
◆◆ No multiply/divideNo multiply/divide◆◆ No barrel shifter: 1-bit shifts onlyNo barrel shifter: 1-bit shifts only◆◆ Jumps and taken branches take 3 cyclesJumps and taken branches take 3 cycles
◆◆ Annul the two branch shadow instructionsAnnul the two branch shadow instructions
◆◆ Loads/stores take at least 2 cyclesLoads/stores take at least 2 cycles◆◆ One shared memory portOne shared memory port
◆◆ Simple interrupt/returnSimple interrupt/return◆◆ No floating pointNo floating point
lw r12,4(r3)lw r12,4(r3)
Design Hierarchy: xr16Design Hierarchy: xr16◆◆ XSOCXSOC
◆◆ xr16xr16◆◆ datapathdatapath◆◆ control unitcontrol unit
◆◆ memctrlmemctrl◆◆ vgavga◆◆ parinparin, , paroutparout, , iramiram
Copyright (C) 2000, Gray Research LLC.This work and its use subject to XSOCLicense Agreement. See LICENSE file.
Project: [None]Macro: XR16Date: 02/23/100
Figure S2: XR16 CPU Schematic
D
DP16RLOC=R5C0
BRDISP[7:0]
LOGICOP[1:0]
RNA[3:0]
RNB[3:0]
PCE
ADD
BCE15_4
BRANCH
CI
IMMOP[5:0]
A15
LOGICT
RETADT
RFWE
SELPC
SLT
SRT
SRI
SUMT
ZXT
ZEROPC
CLK
Z
ADDR[15:0]
RESULT[15:0]
CO
V
N
FWD
IMM[11:0]
LDT
UDT
UDLDT
DMAPC
PCCE
RETCE
C
CTRL16RLOC=R0C0
A15
Z
N
V
RNA[3:0]
RNB[3:0]
RFWE
IMMOP[5:0]
PCE
BCE15_4
ADD
CI
LOGICOP[1:0]
SUMT
LOGICT
ZXT
SRT
SLT
RETADT
CLK
SELPC
ZEROPC
ACE
SRI
FWD
CO
RDY
IMM[11:0]
READN
WORDN
INSN[15:0]
BRDISP[7:0]
BRANCH
DMA
DBUSN
DMAPC
PCCE
RETCE
IREQ
DMAREQ
ZERODMARNA[3:0]
RNB[3:0]
IMM[11:0]
LOGICOP[1:0]
Z,N,CO,V,A_15
AN[15:0]
BRDISP[7:0]
IMMOP[5:0]
D[15:0]INSN[15:0]
CLK
BRANCH
RFWE
FWD
PCE
BCE15_4
SELPC
ZEROPC
ADD
CI
SUMT
LOGICT
DMA
N
CO
ZXT
SRI
SRT
SLT
RETADT
RDY WORDN
UDLDT
V
A_15A_15
Z
N
CO
V
READN
Z
UDT
LDT
ACE
ZERODMA
DMAPC
PCCE
RETCE
IREQ
DMAREQ
DBUSN
lw r12,4(r3)lw r12,4(r3)
Datapath ResourcesDatapath Resources◆◆ 2 read, 1 write per cycle register file2 read, 1 write per cycle register file◆◆ Immediate operand multiplexerImmediate operand multiplexer◆◆ A and B operand registersA and B operand registers◆◆ Adder, logic unit, shifterAdder, logic unit, shifter◆◆ Result multiplexerResult multiplexer◆◆ Zero, negative, carry, overflow detectionZero, negative, carry, overflow detection◆◆ PC, branch displacement multiplexer, PC adderPC, branch displacement multiplexer, PC adder◆◆ Address multiplexerAddress multiplexer
lw r12,4(r3)lw r12,4(r3)
Design HierarchyDesign Hierarchy◆◆ XSOCXSOC
◆◆ xr16xr16◆◆ datapathdatapath◆◆ control unitcontrol unit
◆◆ memctrlmemctrl◆◆ vgavga◆◆ parinparin, , paroutparout, , iramiram
Copyright (C) 2000, Gray Research LLC.This work and its use subject to XSOCLicense Agreement. See LICENSE file.
Project: [None]Macro: DP16Date: 02/23/100
EXECUTION UNIT
ADDRESS/PC UNIT
RESULT MUX
Figure S3: XR16 CPU Datapath Schematic
RETBUF
RLOC=R1C9
BUFT16XT
BREGS
REGFILERLOC=R1C1
D[15:0]
A[3:0]
WE
CLK
Q[15:0]
LOGICBUF
RLOC=R1C4
BUFT16XT
SRBUF
RLOC=R1C1
BUFT16XT
IMMED
IMM16RLOC=R1C3
B[15:0]
IR[11:0]
O[15:0]
OP[5:0]
B
FD12E4ERLOC=R1C3
D[15:0] Q[15:0]
CE15_4
CE3_0
CLK
AREGS
REGFILERLOC=R1C0
D[15:0]
A[3:0]
WE
CLK
Q[15:0]
A
FD16ERLOC=R1C2
D[15:0]
CE
CLK
Q[15:0]
ADDSUB
RLOC=R-1C5
ADSU16
A[15:0]
ADD
B[15:0]
CI
COOFL
S[15:0]
LOGIC
LOGIC16RLOC=R1C4
A[15:0]
B[15:0]
OP[1:0]
O[15:0]
SUMBUF
RLOC=R1C5
BUFT16XTBUF
SLBUF
RLOC=R1C0
BUFT16XT
PCINCR
RLOC=R-1C8
ADD16
A[15:0]
B[15:0]
CI
COOFL
S[15:0]
ADDRMUX
M2_16ZRLOC=R1C7
A[15:0]
B[15:0]
SEL
O[15:0]
ZERO
PC
RAM16X16SRLOC=R1C9
A[3:0]
D[15:0]
WE
WCLK
O[15:0]
RET
FD16ERLOC=R1C9
D[15:0]
CE
CLK
Q[15:0]PCDISP
PCDISP16RLOC=R1C6
BRDISP[7:0]
BRANCH
PCDISP[15:0]
ZHBUF
RLOC=R1C2
BUFT8XT
DOUT
FD16ERLOC=R1C4
D[15:0]
CE
CLK
Q[15:0]
FWD
M2_16RLOC=R1C2
A[15:0]
B[15:0]
SEL
O[15:0]
Z
ZERODETRLOC=R1C6
I[15:0]Z
UDLDBUF
RLOC=R5C2
BUFT8XT
UDBUF
RLOC=R1C3
BUFT8XT
LDBUF
RLOC=R5C3
BUFT8XT
RNA[3:0]
RNB[3:0]
AREG[15:0]
BREG[15:0] B[15:0]
AMUX[15:0]
BMUX[15:0]
RETAD[15:0]
A[15:0]
SUM[15:0]
LOGIC[15:0]
PCNEXT[15:0]
IMMOP[5:0]
PCDISP[15:0]
IMM[11:0]
DOUT[7:0]
BRDISP[7:0]
G,G,G,G,G,G,G,G
RESULT[15:8]
RESULT[15:8]
SRI,A[15:1]
RESULT[7:0]
RESULT[15:0]
PC[15:0]
DOUT[15:8] RESULT[7:0]
DOUT[15:8]
LOGICOP[1:0]
DOUT[15:0]
A[14:0],G
ADDR[15:0]
G,G,G,DMAPC
CLK
RFWE
ZXT
CLK
CLK
RFWE
CLK
PCE
SLT
Z
RETADT
PCCE
LDT
GND
SRT
SUMT
LOGICT
CI
ADD
FWD
N
A15
CLK
CO
SELPC
ZEROPC
CLK
CLK
BCE15_4
V
G
SUM15
PCE
BRANCH
PCESRI
UDLDT
UDT
RETCE
DMAPC
lw r12,4(r3)lw r12,4(r3)
Datapath FloorplanDatapath Floorplan
AREG
S, A
REG
, SLB
UF
BREG
S, B
REG
, SR
BUF
FWD
, A, U
DLD
BUF,
ZH
BUF
IMM
ED, B
, LD
BUF,
UD
BUF
LOG
IC, D
OU
T, L
OG
ICBU
F
ADD
SUB,
SU
MBU
F
PCD
ISP,
Z
ADD
RM
UX
PCIN
CR
PC, R
ET, R
ETBU
F
lw r12,4(r3)lw r12,4(r3)
Control Unit: Pipeline StagesControl Unit: Pipeline Stages◆◆ IF: Instruction FetchIF: Instruction Fetch
◆◆ Read instruction at addr; prepare addr Read instruction at addr; prepare addr ←← PC += 2; PC += 2;capture instruction in Instruction Register (IR)capture instruction in Instruction Register (IR)
◆◆ DC: Decode and register file accessDC: Decode and register file access◆◆ Decode instruction; read register operands;Decode instruction; read register operands;
prepare immediate field; select operandsprepare immediate field; select operands
◆◆ EX: ExecuteEX: Execute◆◆ Add, logic, shift; select result; write to register fileAdd, logic, shift; select result; write to register file◆◆ call/jal: PC := sumcall/jal: PC := sum◆◆ load/store: addr load/store: addr ←← sum; stop pipe for access sum; stop pipe for access
lw r12,4(r3)lw r12,4(r3)
Control Unit: Pipeline HazardsControl Unit: Pipeline Hazards◆◆ Memory wait statesMemory wait states◆◆ Result use data dependenceResult use data dependence◆◆ Load/store memory accessLoad/store memory access◆◆ Jumps and taken branchesJumps and taken branches◆◆ InterruptsInterrupts
lw r12,4(r3)lw r12,4(r3)
Control Unit ElementsControl Unit Elements◆◆ Control finite state machineControl finite state machine
◆◆ Memory interface controlsMemory interface controls
◆◆ Instruction registers (next, DC stage, EX stage)Instruction registers (next, DC stage, EX stage)◆◆ DC: Instruction decoderDC: Instruction decoder◆◆ DC: Register file read controlsDC: Register file read controls◆◆ DC: Immediate operand and forwarding controlDC: Immediate operand and forwarding control◆◆ EX: ALU and result multiplexer controlEX: ALU and result multiplexer control◆◆ EX: Register file write controlEX: Register file write control◆◆ DC+EX: Conditional branch controlDC+EX: Conditional branch control
lw r12,4(r3)lw r12,4(r3)
Design HierarchyDesign Hierarchy◆◆ XSOCXSOC
◆◆ xr16xr16◆◆ datapathdatapath◆◆ control unitcontrol unit
◆◆ memctrlmemctrl◆◆ vgavga◆◆ parinparin, , paroutparout, , iramiram
Copyright (C) 2000, Gray Research LLC.This work and its use subject to XSOCLicense Agreement. See LICENSE file.
Project: [None]Macro: CTRL16Date: 02/23/100
"N.C."
"N.C.""N.C."
"N.C."
INSTRUCTION REGISTERS
Figure S4: XR16 CPU Control Unit Schematic
EXECUTE STAGEDC: OPERAND SELECTION
CONTROL STATE MACHINE
DC: CONDITIONAL BRANCHES
INSTRUCTION DECODER
T2
FD4PEINIT=S
D0
D1
D2
D3
Q0
Q1
Q2
Q3
CE
CLK
AND2B1TRUE
TRUTHRLOC=R2C6
COND[3:0]
Z
N
C
V
TRUE
OR2
OR2
EXIRB
BUF16
I[15:0] O[15:0]
IRB
BUF16
I[15:0]O[15:0]
IRMUX
IRMUXUSE_RLOC=FALSE
A[15:0]
B[15:0]
SEL
INT
O[15:0]
NEXTIRFD16CE
C
CE
CLR
D[15:0] Q[15:0]
RNB
RNMUX4RLOC=R2C1
RA[3:0]
RD[3:0]
EXRD[3:0]
SELR15
SELRD
SELSRC
RN[3:0]
FWD
SELR0
RZERO
AND2
CIFDCE
D
C
CE
CLR
Q
AND2
BUF
DECODE
DECODE
OP[3:0]
PCE
CLK
ADDSUB
SUB
NLW
NLD
NLB
IMM12
IMM4
WORDIMM4
NJAL
EXRESULTS
EXLDST
EXLBSB
EXJALI
SEXTIMM4
BR
EXST
FN[3:0]
NSUM
NLOGIC
NSR
NSL
EXFNSRA
ADCSBC
EXNSUB
EXIMM
ST
JALI
RRRI
EXOP[3:0]
EXJAL
NSUB
DCINTINH
AND2B1
BUF
AND2B1
RNA
RNMUX4RLOC=R2C0
RA[3:0]
RD[3:0]
EXRD[3:0]
SELR15
SELRD
SELSRC
RN[3:0]
FWD
SELR0
RZERO
BUF
AND2B1
BRANCHFDCE
D
C
CE
CLR
Q
IRFD16CE
C
CE
CLR
D[15:0] Q[15:0]
AND3B1
AND3B1
BUF
AND4B2
SRI
AND2T1
FD4PEINIT=S
D0
D1
D2
D3
Q0
Q1
Q2
Q3
CE
CLK
FSM
CTRLFSM
RDY
IREQ
DMAREQ
BRANCH
JUMP
EXLBSB
EXST
ACE
PCE
WORDN
READN
DBUSN
PCCE
RETCE
IF
DMAPC
CLK
ZERODMA
EXLDST
ZEROPC
IFINT
SELPC
DCINTINH
EXANNUL
EXAN
DMA
AND2XNOR2
EXIRFD16CE
C
CE
CLR
D[15:0] Q[15:0]
BUFBUF
AND2B1
IMMB
BUF16
I[15:0]O[15:0]
IR[15:0]IRMUX[15:0]
EXOP[3:0]
IMMOP[5:0]
LOGICOP[1:0]
NIR[15:0] EXOP[3:0],EXRD[3:0],BRDISP[7:0]
IMM[11:0]
IR[11:8]
RNB[3:0]
OP[3:0],RD[3:0],RA[3:0],RB[3:0]
EXIR[15:0]
INSN[15:0]
BRDISP[7:0]
IR[7:4]
RA[3:0]
RD[3:0]
EXRD[3:0]
RNA[3:0]
RB[3:0]
RD[3:0]
EXRD[3:0]
OP[3:0]
IR[11:0]
CI
CALL
NSUB
ADCSBC
FWD
ST
RFWEEXANNUL
BRANCH
RETCE
PCE
ADD
NSUB
CLK
NJAL
NLB
PCE
ZXT
SRT
SLT
RETADT
EXJAL
EXLBSB
BCE15_4
ACE
PCE
EXANNUL
CLK
IR3
EXFNSRA
WORDIMM4
ADDSUB
SUB
A15
IMM_12
EXCALL
SEXTIMM4
EXIMM
DCINTINH
EXJAL
NLW
NLD
IMM_12
IMM_4
PCE
EXRESULTS
SRI
NLB
EXIMM
EXANNUL
IMMOP0
WORDIMM4
NJAL
SEXTIMM4
EXRESULTS
CLK
EXCALL
RRRI
CLK
EXCALL
IR0IMMOP5
RZERO
IF
EXNSUB
RZERO
IREQ
NSL
NLW
NLD
NSR
JUMP
EXIR4
EXIR5
CO
LOGICOP1
PCE
LOGICOP0
JUMP
RRRI
LOGICT
EXST
BR
EXLDST TRUE
CLK
EXRESULTS
ADCSBC
NSR
NSL
NLOGIC
NSUM
CLK
NSUM
NLOGIC
CALL
PCE
IMMOP2
IMM_4
N
Z
BRANCHBR
CO
V
CLK
CLK
CLK
CLK
PCE
CLK
WORDN GND
DMAPC
EXANNUL
ST
EXNSUB
PCE
WORDIMM4
SUMT
EXANNUL
IMM_12
EXAN
IF
BRN
PCE
IMMOP1
IMMOP3
EXFNSRA
IR0
IMM_4
IMMOP4
READN
EXLDST
EXLBSB
EXST
EXAN
DCINTINH
IF
IFINT
DBUSN
SELPC
ZEROPC
DMAPC
PCCERDY
DMA
DMAREQ
ZERODMA
PCE
IFINT
GND
lw r12,4(r3)lw r12,4(r3)
OutlineOutline◆◆ Introduction to FPGAsIntroduction to FPGAs
◆◆ Xilinx XC4000XL architectureXilinx XC4000XL architecture◆◆ Development processDevelopment process
◆◆ XSOC/xr16 projectXSOC/xr16 project◆◆ xr16 tools, architecture, datapath, control unitxr16 tools, architecture, datapath, control unit◆◆ XSOC design and implementationXSOC design and implementation
◆◆ Ongoing workOngoing work◆◆ References, ResourcesReferences, Resources
lw r12,4(r3)lw r12,4(r3)
XSOC ProjectXSOC Project◆◆ Make integrated system dev more accessibleMake integrated system dev more accessible◆◆ Build with Xilinx Student Edition ($100)Build with Xilinx Student Edition ($100)◆◆ Target XESS XS40-005XL student lab boardTarget XESS XS40-005XL student lab board
◆◆ Textbook, exercises, and tools ($100)Textbook, exercises, and tools ($100)
◆◆ Integrated system:Integrated system:◆◆ Processor, memory, memory controller, on-chipProcessor, memory, memory controller, on-chip
bus, parallel port, ram, video controllerbus, parallel port, ram, video controller◆◆ Double-cycle bytewide external SRAMDouble-cycle bytewide external SRAM◆◆ 32 KB SRAM 32 KB SRAM →→ 576x455 monochrome display576x455 monochrome display
lw r12,4(r3)lw r12,4(r3)
“System-on-a-Chip” in Context“System-on-a-Chip” in Context
4701K4701K4701K
R
G
B
VGA PORT
/HSYNC/VSYNC
XC4005XL FPGA
XA[14:1]XA_0
XD[7:0]RAMNCERAMNOERAMNWE
RESET8031
R1R0G1G0B1B0
NHSYNCNVSYNC
CLK
PAR_D[5:0]
PAR_S[6:3]
32 KB SRAM
A[14:1]A0D[7:0]/CS/OE/WE
D[5:0]
S[6:3]
PARALLELPORT
12 MHzOSC.
lw r12,4(r3)lw r12,4(r3)
Design HierarchyDesign Hierarchy◆◆ XSOCXSOC
◆◆ xr16xr16◆◆ datapathdatapath◆◆ control unitcontrol unit
◆◆ memctrlmemctrl◆◆ vgavga◆◆ parinparin, , paroutparout, , iramiram
Copyright (C) 2000, Gray Research LLC.This work and its use subject to XSOCLicense Agreement. See LICENSE file.
Project: XSOCSheet: XSOCDate: 06/07/99
EXTERNAL BUS AND SRAM INTERFACE
P
XR16RLOC_ORIGIN=R1C3
CLK
ACE
AN[15:0]
RDY
READN
WORDN
D[15:0]
UDT
LDT
INSN[15:0]
DMA
UDLDT
DBUSN
IREQ
DMAREQ
ZERODMA
OPAD
BUFGP
STARTUPSTARTUP
GSR
GTS
CLK
Q3
Q2
Q1Q4
DONEIN
IBUF
OPAD
INV
OBUF
MEMCTRL
MEMCTRL
AN[15:0]
ACE
WORDN
READN
RDY
UDT
LDT
RAMNOE
RAMNWE
XA_0
XDOUTT
UXDT
LXDT
SEL[7:0]
CLK
CTRL[15:0]UDLDT
REQ
ACK
DMA
DBUSN
IREQ
DMAREQ
ZERODMA RESET
OPAD
IRAM
XRAM16X16RLOC_ORIGIN=R7C14
CTRL[15:0]
D[15:0]
SEL
CLK
OPAD
OBUFOPAD
OBUF OPAD
OBUF
OPAD
XAOFDX16
C
CED[15:0] Q[15:0]
OBUF
OBUF
OPAD
IPADIBUF
XDINIFD8
C
D[7:0] Q[7:0]
IBUF8
UXDBUFT8
T
LXDBUFT8
T
OPAD16O[15:0]
XDOUT
OBUFT8T
OBUFOBUF
BUF
OPAD
PARIN
XIN8RLOC_ORIGIN=R11C13
I[7:0]
CTRL[15:0]
D[7:0]
SEL
CLK
BUF
IPAD
IPAD
IPAD
IPAD
IPAD
IPAD
OPAD
OPAD
IBUFIBUFIBUFIBUFIBUF
OPAD
VGA
VGARLOC_ORIGIN=R7C1
CLK
NHSYNC
NVSYNC
R1
R0
G1
G0
B1
B0
PIX[15:0]
REQ
ACK
RESET
OBUFOBUF
OPAD
OPAD
OBUFOBUF
OPAD
OPAD
OBUF
IOPAD8IO[7:0]
PAROUT
XOUT4
CTRL[15:0]
D[7:0]
SEL
CLK
Q0
Q1
Q2
Q3
OPADOBUF4
AN[15:0]
PAR[7:0]
D[7:0]
XDIN[15:0]
XDIN[15:8]
XDIN[7:0]
XD[7:0]
D[7:0]
D[15:8]
CTRL[15:0]
D[15:0]
D[7:0]
XA[15:0]
SEL[7:0]
XDIN[15:0]
D[7:0]
NCLK
WORDN
CLK
READN
CLK
VCC
RAMNOE
CLK
CLK
RESET8031
RAMNWE
RAMNCE
ACE
GND
LXDT
CLKIN
VRESET
VREQ
VRESET
XA_0
XDOUTT
UXDT
NCLK
LXDT
UXDT
CLK
B0
XDOUTTG0
VREQ
SEL1
B1
CLK
G1
SEL0
DMA
PAR_D0
PAR_D1
PAR_D2
PAR_D3
PAR_D4
PAR_D5
PAR_S5
PAR_S4
PAR7
PAR6
PAR5
PAR4
PAR3
PAR2
PAR1
PAR0
CLK
NHSYNC
NVSYNC
R1
R0
CLK
UDLDT
LDT
UDT
RDY
ZERODMA
DMAREQ
IREQ
VACK VACK
DBUSN
SEL2PAR_S6
PAR_S3
GND
lw r12,4(r3)lw r12,4(r3)
On-Chip BusOn-Chip Bus◆◆ 16-bit pipelined on-chip data bus16-bit pipelined on-chip data bus◆◆ Bus/memory controllerBus/memory controller
◆◆ Address decoding, bus controls (output enables,Address decoding, bus controls (output enables,clock enables), external SRAM controlclock enables), external SRAM control
◆◆ Make core reuse easy: Make core reuse easy: no glue logic requiredno glue logic required◆◆ Abstract control signal busAbstract control signal bus◆◆ Locally decoded within each coreLocally decoded within each core◆◆ Just add core, attach data, control, and select linesJust add core, attach data, control, and select lines◆◆ Can evolve bus with new features withoutCan evolve bus with new features without
invalidating existing designs and coresinvalidating existing designs and cores
lw r12,4(r3)lw r12,4(r3)
Video ControllerVideo Controller◆◆ 32 KB 32 KB →→ 576x455 monochrome bitmap576x455 monochrome bitmap◆◆ VGA compatible sync timingsVGA compatible sync timings◆◆ Video signal generationVideo signal generation
◆◆ A series of frames, each a series of lines, each aA series of frames, each a series of lines, each aseries of pixelsseries of pixels
◆◆ Fetch 16-pixel words, shifts them out seriallyFetch 16-pixel words, shifts them out serially◆◆ DMA read a new word every 8 clocksDMA read a new word every 8 clocks
◆◆ Drive horizontal and vertical sync to “frame” pixelsDrive horizontal and vertical sync to “frame” pixels◆◆ 2 10-bit counters and 4 10-bit comparators2 10-bit counters and 4 10-bit comparators
lw r12,4(r3)lw r12,4(r3)
System Bring UpSystem Bring Up◆◆ 11/98: Proposal, compiler, instruction set,11/98: Proposal, compiler, instruction set,
datapathdatapath◆◆ 12/98: Control unit, simulate CPU, target12/98: Control unit, simulate CPU, target
XS40 board, 1 Hz, 20 MHzXS40 board, 1 Hz, 20 MHz◆◆ Debugging with LEDs and stop: Debugging with LEDs and stop: goto goto stop;stop;◆◆ Later added on-chip bus, external SRAMLater added on-chip bus, external SRAM
interface, DMA, video, and interruptsinterface, DMA, video, and interrupts◆◆ Testing challengeTesting challenge
lw r12,4(r3)lw r12,4(r3)
RecapRecap◆◆ Final floorplanFinal floorplan◆◆ FPGA editorFPGA editor◆◆ DemoDemo
◆◆ SourceSource◆◆ ListingListing◆◆ Compile and goCompile and go
lw r12,4(r3)lw r12,4(r3)
Ongoing WorkOngoing Work◆◆ Near termNear term
◆◆ Package XSOC/xr16 distribution for beta test Package XSOC/xr16 distribution for beta test ✓✓◆◆ Retarget to Verilog Retarget to Verilog ✓✓◆◆ Port to Virtex and Altera architecturesPort to Virtex and Altera architectures
◆◆ Longer termLonger term◆◆ Help build a communityHelp build a community◆◆ xr32 xr32 ½ ½ ✓✓◆◆ Port GCC/binutils; build eCOS, Linux?Port GCC/binutils; build eCOS, Linux?◆◆ Chip multiprocessorsChip multiprocessors
lw r12,4(r3)lw r12,4(r3)
ReferencesReferences◆◆ FPGAsFPGAs
◆◆ TrimbergerTrimberger, S., , S., Field-Programmable Gate ArrayField-Programmable Gate ArrayTechnologyTechnology,, Kluwer Kluwer..
◆◆ LCCLCC◆◆ C. Fraser and D. Hanson, C. Fraser and D. Hanson, A Retargetable C Compiler:A Retargetable C Compiler:
Design and ImplementationDesign and Implementation, Benjamin/Cummings., Benjamin/Cummings.
◆◆ RISC architecture/implementationRISC architecture/implementation◆◆ D. Patterson and J. Hennessy, D. Patterson and J. Hennessy, Computer Organization andComputer Organization and
Design: The Hardware/Software Interface, Design: The Hardware/Software Interface, also also ComputerComputerAchitectureAchitecture: A Quantitative Approach: A Quantitative Approach, Morgan , Morgan KaufmannKaufmann
lw r12,4(r3)lw r12,4(r3)
ResourcesResources◆◆ Usenet: comp.arch.Usenet: comp.arch.fpgafpga, comp.arch, comp.arch◆◆ WebWeb
◆◆ www.www.xilinxxilinx.com/products/xc4000XLA.htm.com/products/xc4000XLA.htm◆◆ www.www.ioio.com/~.com/~guccioneguccione/HW_list.html/HW_list.html◆◆ www.www.fpgacpufpgacpu.org.org
◆◆ ConferencesConferences◆◆ SIGDA Int’lSIGDA Int’l Symp Symp. on FPGAs, Monterey, Feb.. on FPGAs, Monterey, Feb.◆◆ IEEE IEEE SympSymp. on FPGAs for Custom Computing. on FPGAs for Custom Computing
Machines, Napa, AprilMachines, Napa, April
Processor and Integrated System DesignProcessor and Integrated System Designin FPGAsin FPGAs
Jan GrayJan GrayPresident, Gray Research LLCPresident, Gray Research LLC
(([email protected])@fpgacpu.org)
lw r12,4(r3)lw r12,4(r3)
Copyright © 1999,2000 Gray Research LLC. All Rights Reserved.