Download - Day 2hpc.ac.upc.edu/Talks/dir15/T000015/slides.pdfRoughly half of datapath bit pitch is used for busses passing by cell. Design library of datapath cells (mostly latches and muxes)

Day 2VLSI Microprocessor Design Flow

Session A: Circuit design styles

Break

Session B: Design paths

Lunch

Session C: Verification

Break

Session D: Manufacture, fabrication testing, packaging

Today Organized Bottom-UpCircuit design styleFull-custom design pathStandard cell design pathRTL designVerification strategyPackagingManufacture & testing

Important: real designs proceed at all levels simultaneously

T0 Circuit Design Style

Typical design style for modern microprocessor

Datapaths and memories Control logic

Full-custom layout Standard cells

Regular structures Irregular structures

Most of the die area Most of the complexity

Few design bugs Most of the design bugs

Mostly hand-specified procedurallayout and routing (some hand layout

and routing)Placed and routed automatically

Sometimes exotic circuit designs(dynamic, self-timed)

Conservative static CMOS circuits

T0 Die Breakdown

Std. Cell

Full-Custom

Global Design Style Decisions

Extremely important:

Clock methodology and latch design

Power, ground, and clock distribution

Must be settled early since these affect every circuit on the chip.

T0 Clock and Latch StyleInput clock signal at 2x on-chip frequency(e.g., 80MHz crystal for 40MHz Spert-II board)divided by 2 on-chip to guarantee 50% duty cycle.

Clock buffered up, last stage drives single clock grid acrossentire chip, <1ns skew across chip, <500ps rise/fall time.

Clock output pad to phase lock external circuitry to T0 clock.

TSPC dynamic latches (T0 has minimum operating frequency).

Also, some special pseudo-static load-enabled latches.

Very similar to Alpha 21064 clocking strategy.

T0 Clock Distribution

2x Clock Input

Clock Buffer

Clock Grid(In realityhundreds ofwires)

Clock Output

T0 Latch StyleStandard-cell controller designed with edge-triggered flip-flops• Only negative edge-triggered flip-flops• Simpler for state machines• Simplifies synthesis timing specification• State stall handled with mux around flip-flop - no clock gating

Full-custom datapaths and memories used transparent latches• p- and n- type latches transparent on clock low or high respectively• Can steal time across clock cycle boundaries• Can place latches in convenient place in signal flow to save area• Simplifies double-cycling (used in vector register file, some buses)• Special stallable n-latch (small area without clock gating)

Designed library of latches verified to operate across all processcorners with clock skew/rise/fall spec, and when placed in serieswith other latches.

T0 Power/Ground DistributionHalf of all pins were power and ground (204/408)

Chip-on-board packaging gave low-inductance path to board(~1nH per wire)

Grid across whole chip in wide M1 and M2 strapped whereeverpossible.

Required IR drop less than 5% of Vdd in middle of chip.

On-chip gate oxide decoupling capacitors placed everywherepossible, especially under power rails.

Enough bypass capacitance for <5% power bounce, even ifpower/ground wires open circuit for one cycle.

T0 Power/Ground Distribution

M1

M2PowerGrid

Bypass cap.under powerrails

Additional bypasscap. in emptyspace

Every otherpad is poweror ground

T0 Custom MemoriesInstruction cache• 1KB storage + tags + valid• Classic 6T SRAM design• One port: differential write (128b) or differential read (32b)• 1 word line and 2 bit lines per bit cell• Special wire to clear all valid bits in one cycle for cache flush• Fast dynamic tag comparator built into tag sense amps - critical path

Scalar Register File• 128B storage (32x4B registers)• Three ports: One differential write plus two single-ended reads• 3 word lines and 4 bit lines per bit cell

Vector Register File (Trickiest piece of circuit design in T0)• 2KB storage (16x32x4B registers)• Eight ports: three diff. write on clock low, five single-end. read on clock high• Self-timed to generate all timing edges in one cycle• 5 word lines and 6 bit lines per bit cell

T0 Datapath Design StyleSelect datapath pitch, tradeoff between:

•wasted space for simple cells•crunched inefficient design for complex cells

Vector unit has 72λ bit pitch (late change from 80λ to fit reticle).Scalar unit has 80λ bit pitch.

Decide on metal layer assignments.Data busses in Metal 1, control/clock/Vdd/GND in Metal 2.Roughly half of datapath bit pitch is used for busses passing by cell.

Design library of datapath cells (mostly latches and muxes).Special cells created where needed (maybe 5% are special)

Mostly static CMOS logic and static pass-transistor logic, somecritical places use dynamic logic:• Adder carry-chains• Branch zero comparator• Saturation overflow comparators

T0 Datapath Latch DesignsLatches mostly dynamic TSPC plus holders (a la 21064)

X

9

4

QD

PHI

4

10

10

16

16

16

12

12

D

4

Q

4

4

PHI

9

X

14

1414

14

12

12

n-latch p-latch

Special Psuedo-Static n-Latch

Restrictive enable control line timing caused problems later

X

LEN

D

Q

4

4

4x4

8

8

LENB

8

20

20

8

80F

8

PHI

T0 Datapath Mux DesignsMuxes n-pass-transistor with level restoring p-transistor:

8

8

8

8

6

4x4

B

ASEL

4

BSEL

4

OUT

A

C

4

CSEL

6

6

6

3-input mux

Example Datapath Layout

T0 Standard Cell DesignsStarted with public domain library, but hand-inspected each celland threw away/redesigned bad cells• Some cells had too many series transistors or bad output driver

Changed every cell to have much wider power/ground rails• To avoid IR drop in middle of long standard cell row

Added separate clock rail into every cell• Fits into overall clock gridding scheme• Ensures controlled skew on clock (don’t want clock auto-routed!)

Designed our own standard cell flip-flops and latches• Connects to special clock rail - uses our clocking methodology• Latches used to synchronize with datapath signals

Added greater variety of inverters and buffers• Existing buffers not big enough to drive loads on our chip• More flexibility for synthesis to trade area and delay

T0 PadsPad design is especially tricky

Many esoteric device structures used to provide protectionagainst latch up and ESD damage

Obtained HP’s design guidelines under NDA

Designed custom pads using most of HP’s recommendations forpad protection

Pad output drivers used n-type pullup to reduce powerconsumption - output only swings to ~4V not 5V

Separate power supply rings for output drivers and core logic

SummaryT0 circuit design mostly conservative, low risk

Robustness engineered into all cells and overall design

Only a few tricks where big wins possibleFast dynamic datapath logic to shorten critical pathsDouble-pumped vector register file to save areaNovel output drivers to reduce power

Day 2, Session B:Design Paths

Full-custom

Standard cell

Final global checks

Full-Custom Tools

Pre-existing tools used:• Viewlogic schematic editor (commercial)• Magic layout editor and extraction (university)• HSpice circuit simulator (commercial)• CAzM table-driven circuit simulator (university, now commercial)• irsim switch-level simulator (university)• gemini layout versus schematic compare (university)• Dracula design rule checker (commercial)

In-house tools:• flat SPICE netlist flattener/processor• tilem procedural layout generator

Full-Custom Design Process

Initial specification with high-level schematic plus verbalcommunication (most full-custom work done before RTLfinished)

Design loop:Viewlogic schematic design (functionality and transistor sizing)Timing simulations with HSpiceFunctionality simulations with irsimmagic layoutExtractions with magic (get real parasitics - feed back into schematic)

Iterate until design goals met.

Clock cycle initially fixed at <50MHz to prevent overoptimization.

Example Viewlogic Schematic

(I-Cache SRAM bit)

4 4

IBIT

RSEL

BIT

IBITB

BITB

88

66

Example magic Layout

(Two halves of SRAM cache bits)

Standard Cell Design PathInitial RTL (Register Transfer Level) in C++

Each RTL control block manually translated into BDS• BDS, a limited, combinational-circuit-only hardware description language

bdsyn compiles BDS into blif (Berkeley Logic InterchangeFormat)

blif optimized and synthesized into gates using sis

Gate netlist input to TimberWolf place and route.Also, generate Viewlogic schematic from gate netlist.

RTL ModelRTL (Register Transfer Level) design in C++.

RTL model is “golden reference” for whole T0 design.

Models state in every latch on every clock phase.

Ran at 1,500 cycles/second on Sparcstation-20/61.

100-1000 times faster than Verilog or VHDL RTL model.

(More on RTL in next session)

BDS BlocksC++ RTL control logic was manually split into about 20 blocksthat the synthesis tool could handle (by trial and error).

Each control block manually translated into equivalent BDS.

Example BDS code (piece of JTAG block):

routine run_tdo;state tdo<7:0>; if tapcin<3> then tdo = regioin else if iregin<3> then tdo = regioin else tdo = memioin; tdob = not tdo;endroutine;

Synthesis with sisEach BDS block was translated into logic equations in blifAlso, had to create timing specs for each block.

Optimized and synthesized by sis (Berkeley synthesispackage)

Two basic synthesis scripts created:• target minimal area• target minimal delay

Some critical blocks were tuned with own custom synthesisscripts.

Synthesis could sometimes take infinite time or infinite memory.=> had to split blocks further or rewrite script.

Place and RouteSynthesized blocks connected by schematic.Entire control unit then extracted into single gate netlist.

Place and route using TimberWolf (simulated annealing).

Had to fix TimberWolf to fit control into non-rectangular space.

Placed outer loop around entire place and route run to iterateparameters.

Last piece of T0 design:3 months of CAD hacking after everthing else finished!

Final place and route took 1 week on Sparc-20/61.

Example Stdcell Layout

Static Timing AnalysisFind critical paths in control logic and datapath interface.

Manual database of signal timing specs.Scripts extracted RC delays of long wires from layout.

Timing script considered:• synthesis predicted timing• output drive capability• wire capactitative load• wire RC delay• input timing specs

Fixed any timing violations found by:• changing control logic• changing datapaths• changing wires (fatter wires for lower RC)

Gave up with 33MHz predicted cycle time (very conservative)

Critical Paths1) Host performing DMA at same time as indexed load/store• Have to drive long stall wire with bad RC delay into static latches with

difficult timing constraints. No time to change latches.

2) Branches/I-cache• BEQ/BNE instructions need XOR plus zero comparator fed to instruction

cache fetch in same cycle. Could solve with branch prediction.

3) Address generator/new-old instruction• Many possible ways to load address generator input latches depending on

current/next instruction vector/scalar. Could fix with more pipelining.

Design Rule Checks (DRC)Magic performs dynamic DRC during layout entry

Also, at each level of the design hierarchy after procedurallayout.

Final layout also DRC checked using Dracula, found a few minorbugs.

Problems with CAD ToolsBugs• many features don’t work - fixed a lot ourselves

Limitations• size --- had to recompile with bigger constants (if source available)• signal naming --- had to stick to a-z (one case) and 0-9 (no underscores)• don’t handle hierarchy --- had to “flatten” circuits on the fly (used Unix pipe)

Bad Design• often obvious that author never built real chips (useless features esp. GUIs)• too automatic, can’t control what happens (“take it or leave it” tool)• requires bulky, constrictive framework• wouldn’t work in script or Makefile• awkward binary data formats

(Commercial tools no better than university tools, sometimesworse)

Q: What Was Best CAD Tool?

A: Unix Development Environment!

Sed/Awk/Perl used extensively for format conversion

RCS used for revision control

Shell scripts/Makefiles to automate processes

Pipes used for on-the-fly netlist flattening, test vector generation

Design Path Summary

Many tools (>50), many data formats

Over half of the total T0 design effort was spent on CAD tools!(We began project intent on not developing any new tools)

Built design flow over several projects in group.

Continually added new tools/methodologies.(Many candidate tools tried and abandoned)

Biggest gaps:• good HDL --- would use Verilog now• good static timing analysis --- would use TimeMill now

Day 2, Session C:Verification

Levels of Design Representation

ISA

RTL

Schematic

Layout

Real Chip

Semantics ofinstruction set

State on each cycle

Transistors

Mask layers

Fabricated silicon

(ISA interpreter)

(RTL simulator)

(Irsim switch level simulator)

Verification FrameworkDefined set of “virtual machines”, each defining allowable:• registers• instructions• exceptions• memory regions• whether cycle accurate• and form of test result communicationfor valid test programs

Virtual machines:mips: MIPS-II scalar instruction sett0u: T0 user level instruction sett0raw: T0 user+kernel instruction sett0cyc: T0 user+kernel instruction set+cycle accuratet0die: T0 raw with no SRAM (wafer test)t0diecyc : T0 raw with no SRAM+cycle accurate (wafer test)

Virtual Machine Execution PlatformsSGI R4K

IndigoT0 ISA

InterpreterT0 RTL

SimulatorBare T0

DieSpert-IIBoard

mips X X X X

t0u X X X

t0raw X X X

t0cyc X X

t0die X X X X

t0diecyc X X X

Example Test Program/* Simple test of cpu and memory system life. */

#include <t0test.h>

TEST_MIPS # Type of virtual machineTEST_CODEBEGIN # Begin test program

life_test:

lw $2, life_test_dat addi $2, 1 sw $2, life_test_res

exit:TEST_CODEEND # End test program

.dataTEST_DATABEGIN # Begin data region for test input and result data.life_test_dat: .word 41

life_test_res: .word 0xffffffffTEST_DATAEND # End data region for test input and result data.

mips Test Compilation and Execution

test.S

testbuild-sperttestbuild-iris

test.sperta.out

SGIIndigoR4000Irix

t0rtltestt0isatest

iris.mem isa.mem rtl.mem

Test Program

Test Executables

Test Compilation

Test Run

Test Results

Producing Switch-Level Test Vectors

test.S

testbuild-spert

test.spert

t0cpudptr

test.irsim

sch/cell.1sch/cell.1sch/cell.1sch/cell.1

irsim

wspice

flatspice

spice2sim

cpudp.sim

Workviewschematic

Test program

Assertion failures?

Test vectors

Test executable

Schematic netlist

Test rig basedon RTL model

Simulation Speeds (Cycles/Second)

ISA

RTL

Schematic/Layout IRSIM

Fabbed Chip

500,000 (inst/second)

1,100

0.05

45,000,000

(Simulation speeds measured on Sparcstation-10/51)

R4000 (MIPS-II only) 100,000,000

Test ProgramsTwo classes:

Design verification (does the RTL implement a vector micro?)

Fabrication testing (does the fabricated chip meet specs?)

These classes do not necessarily overlap:

Verification tests don’t necessarily exercise all paths in circuits, (e.g., alladder propagations or SRAM data retention)

Fab tests won’t tell if RTL has design bug, only that chip matches buggyRTL.

Directed + Random TestsHand-written directed tests for specific functions.• Make sure all single events covered• Some events very hard to generate randomly• Very time-consuming

Randomly-generated tests for greater coverage• Good at finding bugs in combinations of events• Fast way of generating lots of test code• Difficult to randomly generate valid virtual machine test code

Hand-Written Tests

Nearly 100,000 lines of hand-written assembly test code!(Includes both design verification and fab test code)

Programs Lines

mips 107 14,293t0u 199 57,992t0raw 63 17,512t0cyc 3 537t0die 44 6,490t0diecyc 1 173Total 417 96,997

Random Program GenerationFirst attempt: rantor

rantor incrementally generates random test program, oneinstruction at a time

Problems:• Only instruction-by-instruction random - can’t generate instruction

sequences• Difficult to guarantee that random code obeys virtual machine limitations

rantor found quite a few RTL bugs initially, but eventually mostbugs were found to be in rantor -generated test programs.

Second Attempt: torture

1) User builds library of random test code sequence generators.

2)torture core randomly selects sequence generator.

3) Sequence generator builds random instruction sequence withvirtual registers (both visible and invisible in final test state).

4) torture interleaves multiple sequences randomlyallocating virtual registers to random physical registers.

Test programs guaranteed to obey virtual machine constraints.

All written as C++ class library.

Random Environment Events

Run test programs on simulated machine both in quietenvironment and also:

with random host and timer interrupts

with random host DMA I/O and scan-chain activity

Random Testing Results

Billions of RTL cycles run on network of workstations at ICSI(continuously running over several months on 4-20workstations).

Highly successful: 26 bugs found through random tests.

Any random bug found, added to regression tests.

LVS: Static Netlist Comparison

ISA

RTL

Schematic

Layout

Real Chip

gemini LVS

magic extract

schematic netlist

layout netlist

=?

Verification Summary

Intensive effort

Highly successful

No known logic bugs in first-pass silicon!

Day 2, Session D:Manufacture, Testing, Packaging

Manufacturing Path

T0 fabbed by Hewlett-Packard via MOSIS

Wafers delivered to test house

Wafer sort to select good die for bonding

Good die bonded to Spert-II boards (Chip-On-Board packaging)

Bare die on Spert-II board tested at ICSI

Assembly house surface-mounted components to good boards

Final whole board assembly and test at ICSI

Download - Day 2hpc.ac.upc.edu/Talks/dir15/T000015/slides.pdfRoughly half of datapath bit pitch is used for busses passing by cell. Design library of datapath cells (mostly latches and muxes)

Top Related