Day 2VLSI Microprocessor Design Flow
Session A: Circuit design styles
Break
Session B: Design paths
Lunch
Session C: Verification
Break
Session D: Manufacture, fabrication testing, packaging
Today Organized Bottom-UpCircuit design styleFull-custom design pathStandard cell design pathRTL designVerification strategyPackagingManufacture & testing
Important: real designs proceed at all levels simultaneously
T0 Circuit Design Style
Typical design style for modern microprocessor
Datapaths and memories Control logic
Full-custom layout Standard cells
Regular structures Irregular structures
Most of the die area Most of the complexity
Few design bugs Most of the design bugs
Mostly hand-specified procedurallayout and routing (some hand layout
and routing)Placed and routed automatically
Sometimes exotic circuit designs(dynamic, self-timed)
Conservative static CMOS circuits
T0 Die Breakdown
Std. Cell
Full-Custom
Global Design Style Decisions
Extremely important:
Clock methodology and latch design
Power, ground, and clock distribution
Must be settled early since these affect every circuit on the chip.
T0 Clock and Latch StyleInput clock signal at 2x on-chip frequency(e.g., 80MHz crystal for 40MHz Spert-II board)divided by 2 on-chip to guarantee 50% duty cycle.
Clock buffered up, last stage drives single clock grid acrossentire chip, <1ns skew across chip, <500ps rise/fall time.
Clock output pad to phase lock external circuitry to T0 clock.
TSPC dynamic latches (T0 has minimum operating frequency).
Also, some special pseudo-static load-enabled latches.
Very similar to Alpha 21064 clocking strategy.
T0 Clock Distribution
2x Clock Input
Clock Buffer
Clock Grid(In realityhundreds ofwires)
Clock Output
T0 Latch StyleStandard-cell controller designed with edge-triggered flip-flops• Only negative edge-triggered flip-flops• Simpler for state machines• Simplifies synthesis timing specification• State stall handled with mux around flip-flop - no clock gating
Full-custom datapaths and memories used transparent latches• p- and n- type latches transparent on clock low or high respectively• Can steal time across clock cycle boundaries• Can place latches in convenient place in signal flow to save area• Simplifies double-cycling (used in vector register file, some buses)• Special stallable n-latch (small area without clock gating)
Designed library of latches verified to operate across all processcorners with clock skew/rise/fall spec, and when placed in serieswith other latches.
T0 Power/Ground DistributionHalf of all pins were power and ground (204/408)
Chip-on-board packaging gave low-inductance path to board(~1nH per wire)
Grid across whole chip in wide M1 and M2 strapped whereeverpossible.
Required IR drop less than 5% of Vdd in middle of chip.
On-chip gate oxide decoupling capacitors placed everywherepossible, especially under power rails.
Enough bypass capacitance for <5% power bounce, even ifpower/ground wires open circuit for one cycle.
T0 Power/Ground Distribution
M1
M2PowerGrid
Bypass cap.under powerrails
Additional bypasscap. in emptyspace
Every otherpad is poweror ground
T0 Custom MemoriesInstruction cache• 1KB storage + tags + valid• Classic 6T SRAM design• One port: differential write (128b) or differential read (32b)• 1 word line and 2 bit lines per bit cell• Special wire to clear all valid bits in one cycle for cache flush• Fast dynamic tag comparator built into tag sense amps - critical path
Scalar Register File• 128B storage (32x4B registers)• Three ports: One differential write plus two single-ended reads• 3 word lines and 4 bit lines per bit cell
Vector Register File (Trickiest piece of circuit design in T0)• 2KB storage (16x32x4B registers)• Eight ports: three diff. write on clock low, five single-end. read on clock high• Self-timed to generate all timing edges in one cycle• 5 word lines and 6 bit lines per bit cell
T0 Datapath Design StyleSelect datapath pitch, tradeoff between:
•wasted space for simple cells•crunched inefficient design for complex cells
Vector unit has 72λ bit pitch (late change from 80λ to fit reticle).Scalar unit has 80λ bit pitch.
Decide on metal layer assignments.Data busses in Metal 1, control/clock/Vdd/GND in Metal 2.Roughly half of datapath bit pitch is used for busses passing by cell.
Design library of datapath cells (mostly latches and muxes).Special cells created where needed (maybe 5% are special)
Mostly static CMOS logic and static pass-transistor logic, somecritical places use dynamic logic:• Adder carry-chains• Branch zero comparator• Saturation overflow comparators
T0 Datapath Latch DesignsLatches mostly dynamic TSPC plus holders (a la 21064)
X
9
4
QD
PHI
4
10
10
16
16
16
12
12
D
4
Q
4
4
PHI
9
X
14
1414
14
12
12
n-latch p-latch
Special Psuedo-Static n-Latch
Restrictive enable control line timing caused problems later
X
LEN
D
Q
4
4
4x4
8
8
LENB
8
20
20
8
80F
8
PHI
T0 Datapath Mux DesignsMuxes n-pass-transistor with level restoring p-transistor:
8
8
8
8
6
4x4
B
ASEL
4
BSEL
4
OUT
A
C
4
CSEL
6
6
6
3-input mux
Example Datapath Layout
T0 Standard Cell DesignsStarted with public domain library, but hand-inspected each celland threw away/redesigned bad cells• Some cells had too many series transistors or bad output driver
Changed every cell to have much wider power/ground rails• To avoid IR drop in middle of long standard cell row
Added separate clock rail into every cell• Fits into overall clock gridding scheme• Ensures controlled skew on clock (don’t want clock auto-routed!)
Designed our own standard cell flip-flops and latches• Connects to special clock rail - uses our clocking methodology• Latches used to synchronize with datapath signals
Added greater variety of inverters and buffers• Existing buffers not big enough to drive loads on our chip• More flexibility for synthesis to trade area and delay
T0 PadsPad design is especially tricky
Many esoteric device structures used to provide protectionagainst latch up and ESD damage
Obtained HP’s design guidelines under NDA
Designed custom pads using most of HP’s recommendations forpad protection
Pad output drivers used n-type pullup to reduce powerconsumption - output only swings to ~4V not 5V
Separate power supply rings for output drivers and core logic
SummaryT0 circuit design mostly conservative, low risk
Robustness engineered into all cells and overall design
Only a few tricks where big wins possibleFast dynamic datapath logic to shorten critical pathsDouble-pumped vector register file to save areaNovel output drivers to reduce power
Day 2, Session B:Design Paths
Full-custom
Standard cell
Final global checks
Full-Custom Tools
Pre-existing tools used:• Viewlogic schematic editor (commercial)• Magic layout editor and extraction (university)• HSpice circuit simulator (commercial)• CAzM table-driven circuit simulator (university, now commercial)• irsim switch-level simulator (university)• gemini layout versus schematic compare (university)• Dracula design rule checker (commercial)
In-house tools:• flat SPICE netlist flattener/processor• tilem procedural layout generator
Full-Custom Design Process
Initial specification with high-level schematic plus verbalcommunication (most full-custom work done before RTLfinished)
Design loop:Viewlogic schematic design (functionality and transistor sizing)Timing simulations with HSpiceFunctionality simulations with irsimmagic layoutExtractions with magic (get real parasitics - feed back into schematic)
Iterate until design goals met.
Clock cycle initially fixed at <50MHz to prevent overoptimization.
Example Viewlogic Schematic
(I-Cache SRAM bit)
4 4
IBIT
RSEL
BIT
IBITB
BITB
88
66
Example magic Layout
(Two halves of SRAM cache bits)
Standard Cell Design PathInitial RTL (Register Transfer Level) in C++
Each RTL control block manually translated into BDS• BDS, a limited, combinational-circuit-only hardware description language
bdsyn compiles BDS into blif (Berkeley Logic InterchangeFormat)
blif optimized and synthesized into gates using sis
Gate netlist input to TimberWolf place and route.Also, generate Viewlogic schematic from gate netlist.
RTL ModelRTL (Register Transfer Level) design in C++.
RTL model is “golden reference” for whole T0 design.
Models state in every latch on every clock phase.
Ran at 1,500 cycles/second on Sparcstation-20/61.
100-1000 times faster than Verilog or VHDL RTL model.
(More on RTL in next session)
BDS BlocksC++ RTL control logic was manually split into about 20 blocksthat the synthesis tool could handle (by trial and error).
Each control block manually translated into equivalent BDS.
Example BDS code (piece of JTAG block):
routine run_tdo;state tdo<7:0>; if tapcin<3> then tdo = regioin else if iregin<3> then tdo = regioin else tdo = memioin; tdob = not tdo;endroutine;
Synthesis with sisEach BDS block was translated into logic equations in blifAlso, had to create timing specs for each block.
Optimized and synthesized by sis (Berkeley synthesispackage)
Two basic synthesis scripts created:• target minimal area• target minimal delay
Some critical blocks were tuned with own custom synthesisscripts.
Synthesis could sometimes take infinite time or infinite memory.=> had to split blocks further or rewrite script.
Place and RouteSynthesized blocks connected by schematic.Entire control unit then extracted into single gate netlist.
Place and route using TimberWolf (simulated annealing).
Had to fix TimberWolf to fit control into non-rectangular space.
Placed outer loop around entire place and route run to iterateparameters.
Last piece of T0 design:3 months of CAD hacking after everthing else finished!
Final place and route took 1 week on Sparc-20/61.
Example Stdcell Layout
Static Timing AnalysisFind critical paths in control logic and datapath interface.
Manual database of signal timing specs.Scripts extracted RC delays of long wires from layout.
Timing script considered:• synthesis predicted timing• output drive capability• wire capactitative load• wire RC delay• input timing specs
Fixed any timing violations found by:• changing control logic• changing datapaths• changing wires (fatter wires for lower RC)
Gave up with 33MHz predicted cycle time (very conservative)
Critical Paths1) Host performing DMA at same time as indexed load/store• Have to drive long stall wire with bad RC delay into static latches with
difficult timing constraints. No time to change latches.
2) Branches/I-cache• BEQ/BNE instructions need XOR plus zero comparator fed to instruction
cache fetch in same cycle. Could solve with branch prediction.
3) Address generator/new-old instruction• Many possible ways to load address generator input latches depending on
current/next instruction vector/scalar. Could fix with more pipelining.
Design Rule Checks (DRC)Magic performs dynamic DRC during layout entry
Also, at each level of the design hierarchy after procedurallayout.
Final layout also DRC checked using Dracula, found a few minorbugs.
Problems with CAD ToolsBugs• many features don’t work - fixed a lot ourselves
Limitations• size --- had to recompile with bigger constants (if source available)• signal naming --- had to stick to a-z (one case) and 0-9 (no underscores)• don’t handle hierarchy --- had to “flatten” circuits on the fly (used Unix pipe)
Bad Design• often obvious that author never built real chips (useless features esp. GUIs)• too automatic, can’t control what happens (“take it or leave it” tool)• requires bulky, constrictive framework• wouldn’t work in script or Makefile• awkward binary data formats
(Commercial tools no better than university tools, sometimesworse)
Q: What Was Best CAD Tool?
A: Unix Development Environment!
Sed/Awk/Perl used extensively for format conversion
RCS used for revision control
Shell scripts/Makefiles to automate processes
Pipes used for on-the-fly netlist flattening, test vector generation
Design Path Summary
Many tools (>50), many data formats
Over half of the total T0 design effort was spent on CAD tools!(We began project intent on not developing any new tools)
Built design flow over several projects in group.
Continually added new tools/methodologies.(Many candidate tools tried and abandoned)
Biggest gaps:• good HDL --- would use Verilog now• good static timing analysis --- would use TimeMill now
Day 2, Session C:Verification
Levels of Design Representation
ISA
RTL
Schematic
Layout
Real Chip
Semantics ofinstruction set
State on each cycle
Transistors
Mask layers
Fabricated silicon
(ISA interpreter)
(RTL simulator)
(Irsim switch level simulator)
Verification FrameworkDefined set of “virtual machines”, each defining allowable:• registers• instructions• exceptions• memory regions• whether cycle accurate• and form of test result communicationfor valid test programs
Virtual machines:mips: MIPS-II scalar instruction sett0u: T0 user level instruction sett0raw: T0 user+kernel instruction sett0cyc: T0 user+kernel instruction set+cycle accuratet0die: T0 raw with no SRAM (wafer test)t0diecyc : T0 raw with no SRAM+cycle accurate (wafer test)
Virtual Machine Execution PlatformsSGI R4K
IndigoT0 ISA
InterpreterT0 RTL
SimulatorBare T0
DieSpert-IIBoard
mips X X X X
t0u X X X
t0raw X X X
t0cyc X X
t0die X X X X
t0diecyc X X X
Example Test Program/* Simple test of cpu and memory system life. */
#include <t0test.h>
TEST_MIPS # Type of virtual machineTEST_CODEBEGIN # Begin test program
life_test:
lw $2, life_test_dat addi $2, 1 sw $2, life_test_res
exit:TEST_CODEEND # End test program
.dataTEST_DATABEGIN # Begin data region for test input and result data.life_test_dat: .word 41
life_test_res: .word 0xffffffffTEST_DATAEND # End data region for test input and result data.
mips Test Compilation and Execution
test.S
testbuild-sperttestbuild-iris
test.sperta.out
SGIIndigoR4000Irix
t0rtltestt0isatest
iris.mem isa.mem rtl.mem
Test Program
Test Executables
Test Compilation
Test Run
Test Results
Producing Switch-Level Test Vectors
test.S
testbuild-spert
test.spert
t0cpudptr
test.irsim
sch/cell.1sch/cell.1sch/cell.1sch/cell.1
irsim
wspice
flatspice
spice2sim
cpudp.sim
Workviewschematic
Test program
Assertion failures?
Test vectors
Test executable
Schematic netlist
Test rig basedon RTL model
Simulation Speeds (Cycles/Second)
ISA
RTL
Schematic/Layout IRSIM
Fabbed Chip
500,000 (inst/second)
1,100
0.05
45,000,000
(Simulation speeds measured on Sparcstation-10/51)
R4000 (MIPS-II only) 100,000,000
Test ProgramsTwo classes:
Design verification (does the RTL implement a vector micro?)
Fabrication testing (does the fabricated chip meet specs?)
These classes do not necessarily overlap:
Verification tests don’t necessarily exercise all paths in circuits, (e.g., alladder propagations or SRAM data retention)
Fab tests won’t tell if RTL has design bug, only that chip matches buggyRTL.
Directed + Random TestsHand-written directed tests for specific functions.• Make sure all single events covered• Some events very hard to generate randomly• Very time-consuming
Randomly-generated tests for greater coverage• Good at finding bugs in combinations of events• Fast way of generating lots of test code• Difficult to randomly generate valid virtual machine test code
Hand-Written Tests
Nearly 100,000 lines of hand-written assembly test code!(Includes both design verification and fab test code)
Programs Lines
mips 107 14,293t0u 199 57,992t0raw 63 17,512t0cyc 3 537t0die 44 6,490t0diecyc 1 173Total 417 96,997
Random Program GenerationFirst attempt: rantor
rantor incrementally generates random test program, oneinstruction at a time
Problems:• Only instruction-by-instruction random - can’t generate instruction
sequences• Difficult to guarantee that random code obeys virtual machine limitations
rantor found quite a few RTL bugs initially, but eventually mostbugs were found to be in rantor -generated test programs.
Second Attempt: torture
1) User builds library of random test code sequence generators.
2)torture core randomly selects sequence generator.
3) Sequence generator builds random instruction sequence withvirtual registers (both visible and invisible in final test state).
4) torture interleaves multiple sequences randomlyallocating virtual registers to random physical registers.
Test programs guaranteed to obey virtual machine constraints.
All written as C++ class library.
Random Environment Events
Run test programs on simulated machine both in quietenvironment and also:
with random host and timer interrupts
with random host DMA I/O and scan-chain activity
Random Testing Results
Billions of RTL cycles run on network of workstations at ICSI(continuously running over several months on 4-20workstations).
Highly successful: 26 bugs found through random tests.
Any random bug found, added to regression tests.
LVS: Static Netlist Comparison
ISA
RTL
Schematic
Layout
Real Chip
gemini LVS
magic extract
schematic netlist
layout netlist
=?
Verification Summary
Intensive effort
Highly successful
No known logic bugs in first-pass silicon!
Day 2, Session D:Manufacture, Testing, Packaging
Manufacturing Path
T0 fabbed by Hewlett-Packard via MOSIS
Wafers delivered to test house
Wafer sort to select good die for bonding
Good die bonded to Spert-II boards (Chip-On-Board packaging)
Bare die on Spert-II board tested at ICSI
Assembly house surface-mounted components to good boards
Final whole board assembly and test at ICSI