regular silicon structures a.k.a vlsi building blockscs250/fa09/lectures/lec... · 2009-10-02 ·...
TRANSCRIPT
CS250, UC Berkeley Fall ‘09Lecture 11, Regular Structures
CS250 VLSI Systems Design
Regular Silicon Structures a.k .a
VLSI Building Blocks
Fall 2009
John Wawrzynek, Krste Asanovic’, with John Lazzaro
CS250, UC Berkeley Fall ‘09Lecture 11, Regular Structures
Introduction‣ We've experienced synthesis and standard cell place and
route.‣ Is that all there is? We can implement any digital system
with only primitive logic gates and flip-flops.‣ If so, chip implementations would be pretty inefficient (and
boring to do!)
‣ Key questions: Where can special circuit- and layout-generators provide advantage and how much?
‣ Examples with a clear advantage: RAM blocks
‣ Example where it is not so clear: cross-bar switches, datapaths, ROMs, multipliers
‣ We’ll start with on-chip RAM2
CS250, UC Berkeley Fall ‘09Lecture 11, Regular Structures
Verilog RAM Specification// // Single-Port RAM with Synchronous Read // module v_rams_07 (clk, we, a, di, do); input clk; input we; input [5:0] a; input [15:0] di; output [15:0] do; reg [15:0] ram [63:0]; reg [5:0] read_a; always @(posedge clk) begin if (we) ram[a] <= di; read_a <= a; end assign do = ram[read_a]; endmodule
3
What do the synthesis tools do with this?
CS250, UC Berkeley Fall ‘09Lecture 11, Regular Structures
Memory-Block Basics
4
log2(M)
M X N memory:
Depth = M, Width = N.
M words of memory, each word N bits wide.
VLSI tools flows include parameterized RAM-generators. User specifies width, depth, (sometimes) aspect ratio; gets simulation &
timing models, layout.
CS250, UC Berkeley Fall ‘09Lecture 11, Regular Structures
Internal Memory Organization
‣ RAM/ROM naming convention: ‣ examples: 32 X 8, "32 by 8" => 32 8-bit words
‣ 1M X 1, "1 meg by 1" => 1M 1-bit words
2-D arrary of bit cells. Each cell stores one bit of
data.
5
Special circuit tricks are used for the cell array to improve storage density.
CS250, UC Berkeley Fall ‘09Lecture 11, Regular Structures
Address Decoding
• The function of the address decoder is to generate a one-hot code word from
the address.• The output is use for row selection.• Many different circuits exist for this
function. A simple one is shown to the right.
6
Address
sel_row0sel_row1
address
CS250, UC Berkeley Fall ‘09Lecture 11, Regular Structures
Memory Block Internals
These circuits are just functional abstractions of the
actual circuits used.
7
sel_row0
sel_row1
For read operation, functionally the memory is equivalent to a 2-D array
off flip-flops with tristate outputs on each:
For write operation, functionally equivalent includes a means to change state value:
CS250, UC Berkeley Fall ‘09Lecture 11, Regular Structures
State is coded as the amount of energy stored by a capacitor.
+++ +++
--- ---
Storing computational state as charge
State is read by sensing the amount
of energy
+++ +++
--- ---
1.5V
Problems: noise changes Q (up or down), parasitics leak or source Q. Fortunately,
Q cannot change instantaneously, but that only gets us in the ballpark.
8
CS250, UC Berkeley Fall ‘09Lecture 11, Regular Structures
Static Memory Circuits
Dynamic Memory: Circuit remembers for a fraction of a second.
Non-volatile Memory: Circuit remembers for many years, even if power is off.
Static Memory: Circuit remembers as long as the power is on.
9
CS250, UC Berkeley Fall ‘09Lecture 11, Regular Structures
Idea: Store each bit with its complementx
“Row”
Gnd Vdd
Vdd Gnd
We can use the redundant representation to compensate
for noise and leakage.
Why?
x
y y
10
CS250, UC Berkeley Fall ‘09Lecture 11, Regular Structures
Case #1: y = Gnd, y = Vdd ...x
“Row”
Gnd Vdd I ds
I sd
x
y y
11
CS250, UC Berkeley Fall ‘09Lecture 11, Regular Structures
Case #2: y = Vdd, y = Gnd ...x
“Row”
Gnd Vdd
I sd
I ds
x
y y
12
CS250, UC Berkeley Fall ‘09Lecture 11, Regular Structures
Combine both cases to complete circuit
x
Gnd Vdd Vdd Gnd Vth Vth
noise noise
“Cross- coupled
inverters”
x
y y
13
CS250, UC Berkeley Fall ‘09Lecture 11, Regular Structures
SRAM Challenge #1: It’s so big!
Capacitors are usually
“parasitic”capacitance of
wires and transistors.
Cell has both
transistor types
Vdd AND Gnd
Lots of contacts,
transistors, two bit lines ...
SRAM area is 6X-10X DRAM area, same generation ...
14
CS250, UC Berkeley Fall ‘09Lecture 11, Regular Structures
Recall: Positive edge-triggered flip-flop
D Q A flip-flop “samples” right before the edge, and then “holds” value.
!"#$%&'())* ++,!-.)'/ 012-)34$5$%& 67&1'-8
!"#$%&'(&)#'*+,#-*.
/ 0"12*&1'3" 4".2#1.&,4-3&5"#$%&
164-276&!"#$% #$1869
/ :#-8;&1-&<&5"#$% 4".2#1.&,4-3&
5"#$%&164-276&$&'()* #$1869
!
8#;
<
."12*&1'3" 8#-8;&1-&<&5"#$%
8#;
8#;=
8#;
8#;
8#;=
8#;=
8#;
8#;=
Sampling circuit
!"#$%&'())* ++,!-.)'/ 012-)34$5$%& 67&1'-8
!"#$%&'(&)#'*+,#-*.
/ 0"12*&1'3" 4".2#1.&,4-3&5"#$%&
164-276&!"#$% #$1869
/ :#-8;&1-&<&5"#$% 4".2#1.&,4-3&
5"#$%&164-276&$&'()* #$1869
!
8#;
<
."12*&1'3" 8#-8;&1-&<&5"#$%
8#;
8#;=
8#;
8#;
8#;=
8#;=
8#;
8#;=
Holds value
16 Transistors: Makes an SRAM look compact!What do we get for the 10 extra transistors?
Clocked logic semantics.15
CS250, UC Berkeley Fall ‘09Lecture 11, Regular Structures
Sensing: When clock is low
D QA flip-flop “samples” right before the
edge, and then “holds” value.
!"#$%&'())* ++,!-.)'/ 012-)34$5$%& 67&1'-8
!"#$%&'(&)#'*+,#-*.
/ 0"12*&1'3" 4".2#1.&,4-3&5"#$%&
164-276&!"#$% #$1869
/ :#-8;&1-&<&5"#$% 4".2#1.&,4-3&
5"#$%&164-276&$&'()* #$1869
!
8#;
<
."12*&1'3" 8#-8;&1-&<&5"#$%
8#;
8#;=
8#;
8#;
8#;=
8#;=
8#;
8#;=
Sampling circuit
!"#$%&'())* ++,!-.)'/ 012-)34$5$%& 67&1'-8
!"#$%&'(&)#'*+,#-*.
/ 0"12*&1'3" 4".2#1.&,4-3&5"#$%&
164-276&!"#$% #$1869
/ :#-8;&1-&<&5"#$% 4".2#1.&,4-3&
5"#$%&164-276&$&'()* #$1869
!
8#;
<
."12*&1'3" 8#-8;&1-&<&5"#$%
8#;
8#;=
8#;
8#;
8#;=
8#;=
8#;
8#;=
Holds value
!"#$%&'())* ++,!-.)'/ 012-)34$5$%& 67&1'-8
!"#$%&'(&)#'*+,#-*.
/ 0"12*&1'3" 4".2#1.&,4-3&5"#$%&
164-276&!"#$% #$1869
/ :#-8;&1-&<&5"#$% 4".2#1.&,4-3&
5"#$%&164-276&$&'()* #$1869
!
8#;
<
."12*&1'3" 8#-8;&1-&<&5"#$%
8#;
8#;=
8#;
8#;
8#;=
8#;=
8#;
8#;=
!"#$%&'())* ++,!-.)'/ 012-)34$5$%& 67&1'-8
!"#$%&'(&)#'*+,#-*.
/ 0"12*&1'3" 4".2#1.&,4-3&5"#$%&
164-276&!"#$% #$1869
/ :#-8;&1-&<&5"#$% 4".2#1.&,4-3&
5"#$%&164-276&$&'()* #$1869
!
8#;
<
."12*&1'3" 8#-8;&1-&<&5"#$%
8#;
8#;=
8#;
8#;
8#;=
8#;=
8#;
8#;=
clk = 0clk’ = 1
Will capture new value on posedge.
Outputs last value captured.
16
CS250, UC Berkeley Fall ‘09Lecture 11, Regular Structures
Capture: When clock goes high
D QA flip-flop “samples” right before the
edge, and then “holds” value.
!"#$%&'())* ++,!-.)'/ 012-)34$5$%& 67&1'-8
!"#$%&'(&)#'*+,#-*.
/ 0"12*&1'3" 4".2#1.&,4-3&5"#$%&
164-276&!"#$% #$1869
/ :#-8;&1-&<&5"#$% 4".2#1.&,4-3&
5"#$%&164-276&$&'()* #$1869
!
8#;
<
."12*&1'3" 8#-8;&1-&<&5"#$%
8#;
8#;=
8#;
8#;
8#;=
8#;=
8#;
8#;=
Sampling circuit
!"#$%&'())* ++,!-.)'/ 012-)34$5$%& 67&1'-8
!"#$%&'(&)#'*+,#-*.
/ 0"12*&1'3" 4".2#1.&,4-3&5"#$%&
164-276&!"#$% #$1869
/ :#-8;&1-&<&5"#$% 4".2#1.&,4-3&
5"#$%&164-276&$&'()* #$1869
!
8#;
<
."12*&1'3" 8#-8;&1-&<&5"#$%
8#;
8#;=
8#;
8#;
8#;=
8#;=
8#;
8#;=
Holds value
!"#$%&'())* ++,!-.)'/ 012-)34$5$%& 67&1'-8
!"#$%&'(&)#'*+,#-*.
/ 0"12*&1'3" 4".2#1.&,4-3&5"#$%&
164-276&!"#$% #$1869
/ :#-8;&1-&<&5"#$% 4".2#1.&,4-3&
5"#$%&164-276&$&'()* #$1869
!
8#;
<
."12*&1'3" 8#-8;&1-&<&5"#$%
8#;
8#;=
8#;
8#;
8#;=
8#;=
8#;
8#;=
!"#$%&'())* ++,!-.)'/ 012-)34$5$%& 67&1'-8
!"#$%&'(&)#'*+,#-*.
/ 0"12*&1'3" 4".2#1.&,4-3&5"#$%&
164-276&!"#$% #$1869
/ :#-8;&1-&<&5"#$% 4".2#1.&,4-3&
5"#$%&164-276&$&'()* #$1869
!
8#;
<
."12*&1'3" 8#-8;&1-&<&5"#$%
8#;
8#;=
8#;
8#;
8#;=
8#;=
8#;
8#;=
clk = 1clk’ = 0
Remembers value just captured.
Outputs value just captured.
17
CS250, UC Berkeley Fall ‘09Lecture 11, Regular Structures
Challenge #2: Writing is a “fight” When word line goes high, bitlines “fight” with cell inverters
to “flip the bit” -- must win quickly! Solution: tune W/L of cell & driver transistors
InitialstateVdd
InitialstateGnd
Bitline drives Gnd
Bitline drives
Vdd18
CS250, UC Berkeley Fall ‘09Lecture 11, Regular Structures
Challenge #3: Preserving state on readWhen word line goes high on read, cell inverters must drive
large bitline capacitance quickly, to preserve state on its small cell capacitances
CellstateVdd
CellstateGnd
Bitline a big
capacitor
Bitline a big
capacitor19
CS250, UC Berkeley Fall ‘09Lecture 11, Regular Structures
SRAM Operation Summary
20
Most common is 6-transistor (6T) cell
array.word
bit bit word
bit bit word
bit bit
word
bit bit word
bit bit word
bit bit word line
bit line bit line
Word selects this cell, and all others in a row.
Write operation: column bit lines are driven differentially (0
on one, 1 on the other).Values overwrites cell state.
Read operation: column bit lines are “precharged”, then released. Cell pulls down one bit line or the other. “Sense Amplifier” circuit quickly amplifies
difference between bit lines (saves time & energy).
CS250, UC Berkeley Fall ‘09Lecture 11, Regular Structures 21
CS250, UC Berkeley Fall ‘09Lecture 11, Regular Structures 22
CS250, UC Berkeley Fall ‘09Lecture 11, Regular Structures 23
CS250, UC Berkeley Fall ‘09Lecture 11, Regular Structures 24
CS250, UC Berkeley Fall ‘09Lecture 11, Regular Structures 25
CS250, UC Berkeley Fall ‘09Lecture 11, Regular Structures
Multi-ported Memory‣ Motivation:‣ Consider CPU core register file:
‣ 1 read or write per cycle limits processor performance.
‣ Complicates pipelining. Difficult for different instructions to simultaneously read or write regfile.
‣ Common arrangement in pipelined CPUs is 2 read ports and 1 write port.
databuffer
disk or network interface
CPU– I/O data buffering:
AaDina
WEa
Ab
DinbWEb
Dual-portMemory
Douta
Doutb
• dual-porting allows both sides to simultaneously
access memory at full bandwidth.
CS250, UC Berkeley Fall ‘09Lecture 11, Regular Structures
Dual-ported Memory Internals‣ Add decoder, another set of
read/write logic, bits lines, word lines:
• Example cell: SRAM
• Repeat everything but cross-coupled inverters.
• This scheme extends up to a couple more ports, then need to add additional transistors.
deca decbcell
array
r/w logic
r/w logic
data portsaddress
ports
b2 b2b1 b1
WL2
WL1
27
CS250, UC Berkeley Fall ‘09Lecture 11, Regular Structures
Cascading Memory-Blocks
28
How to make larger memory blocks out of smaller ones.
Increasing the width. Example: given 1Kx8, want 1Kx16
CS250, UC Berkeley Fall ‘09Lecture 11, Regular Structures
Cascading Memory-Blocks
29
How to make larger memory blocks out of smaller ones.
Increasing the depth. Example: given 1Kx8, want 2Kx8
CS250, UC Berkeley Fall ‘09Lecture 11, Regular Structures 30
CS250, UC Berkeley Fall ‘09Lecture 11, Regular Structures 31
CS250, UC Berkeley Fall ‘09Lecture 11, Regular Structures 32
CS250, UC Berkeley Fall ‘09Lecture 11, Regular Structures 33
CS250, UC Berkeley Fall ‘09Lecture 11, Regular Structures 34
CS250, UC Berkeley Fall ‘09Lecture 11, Regular Structures 35
CS250, UC Berkeley Fall ‘09Lecture 11, Regular Structures 36
CS250, UC Berkeley Fall ‘09Lecture 11, Regular Structures 37
CS250, UC Berkeley Fall ‘09Lecture 11, Regular Structures
Other Regular Structures‣ In Transparencies
38
CS250, UC Berkeley Fall ‘09Lecture 11, Regular Structures 39
CS250, UC Berkeley Fall ‘09Lecture 11, Regular Structures 40
CS250, UC Berkeley Fall ‘09Lecture 11, Regular Structures 41
CS250, UC Berkeley Fall ‘09Lecture 11, Regular Structures 42
CS250, UC Berkeley Fall ‘09Lecture 11, Regular Structures 43
CS250, UC Berkeley Fall ‘09Lecture 11, Regular Structures 44
CS250, UC Berkeley Fall ‘09Lecture 11, Regular Structures 45
! Custom Performance with ASICEffort
! 3X Faster than ASIC! 40% Smaller than ASIC! 10X Less Effort than Full Custom
The Tool for High PerformanceDesignsDPC is the tool used by designers needing highperformance chips. They want the performance offull custom design, but with a much shorterdesign cycle. Datapaths designed with DPC are3X (three times) faster and 40% smaller thansynthesis and place and route. At the same time, ittakes 10X (ten times) less effort than full customdesign.
In deep sub-micron design, wire length is thedominant factor affecting critical path timing.Cell placement becomes a critical step in chipperformance as well as power consumption. Withtraditional tools, designers are at the mercy ofautomatic placement tools. The DataPathCompiler (DPC) lets the designer controlplacement with immediate timing feedback.Multiple what-if experiments can be performed.Using a graphical display that back annotatestiming to the schematic, you can easily identifytiming problems and rapidly iterate throughpotential solutions, yielding faster results. DPC isso fast it can place, and then time, a 50K gatedatapath in 2-3 minutes.
Useful Identification of CriticalPathsDPC predicts wire lengths early in the designcycle. The resulting timing iterations are both fastand accurate, allowing the designer to quicklyiterate to their performance goal. The critical paths
are displayed directly on the schematic at all levelsof the design hierarchy. In addition, the actualdelays of the paths are annotated onto the wires inboth the schematic and placement view.
DPC for Critical Path OptimizationIn datapath designs, some simple directives by thedesigner can produce speed-optimized layouts.These directives are easily given and modified inDPC. The placement of components on theschematic directs relative placement in theplacement file.
Cells can also be hard placed at specific row orcolumn locations and empty space can beindicated. DPC automatically generates the rowand column placement and predicted wire lengths.Wire predictions can be used to drive the DPCtiming analyzer as well as external timing analyzersinluding Pearl, PrimeTime and PathMill.
DPCDataPath Compiler
Figure 1-a.The example above shows the placement generated for oursample 8-bit ALU. A critical path is highlighted in red and yellowon both the schematic an placement views. Timing for othernets is indicated in the menu and new nets can be selected andhighlighted.
Micro MagicMicro Magic Inc.Inc.
Micro Magic, Inc.
Sunnyvale, CA USA
Phone: 408.414.7647
www.micromagic.com
Copyright 1995-2006, Micro Magic, Inc. All rights reserved.
DPC reads the output from static timing analysisand displays the critical paths directly on theschematic. The placement can be modified tooptimize critical paths, or extra drivers can beadded to the critical path, all in the schematic. Youthen run through the placement and timingiteration again. This iteration continues until thetiming criteria are satisfied. The iteration loop isfast and visual. When you are satisfied with thedesign performance, the placement file (DEF file)is passed to a routing tool. The routed result canthen be read into the MAX Layout Editor to view,and edit if necessary.
DPC Design Flow
With DPC, you first enter the schematics into theSUE design manager. DPC then uses theschematic as a seed for placement. Once DPC hasthe placement, it is able to estimate the wiringdelays and send this info to a static timinganalyzer. The results of static timing analysis arethen read back into SUE. The critical path ishighlighted in both the schematic and placementview. Additionally, the delay and slope at eachnode are displayed.
DPC Features:
! Automatically route, generate parasitics,run timing analysis, and display critical-path timing directly on schematics.
! DPC includes its own timing analyzer,or you can use iintegrated static timinganalysis tools such as Pearl, PathMill andPrimeTime.
! Fast - can do a 50K gate data path in afew minutes.
! Use standard cells or custom datapathcells.
! Write out DEF placement informationand Verilog netlist for integration withrouting tools.
! Available on LINUX platforms.
DPCSUE
DPC Placement &Parasitic Estimation
Router
Parasitic Extraction
GDSII
Timing Analysis
DPC
FASTAST
Figure 2-a.DPC reduces the time required for placement and timinganalysis from days to minutes.
Figure 2-b.The DEF placement file is sent to a router. The resultingGDSII file can then be read into MAX (Micro Magic’s layouttool).
Micro MagicMicro Magic Inc.Inc.
Fast Silicon Fast
CS250, UC Berkeley Fall ‘09Lecture 11, Regular Structures 46
CS250, UC Berkeley Fall ‘09Lecture 11, Regular Structures 47
CS250, UC Berkeley Fall ‘09Lecture 11, Regular Structures 48
CS250, UC Berkeley Fall ‘09Lecture 11, Regular Structures 49
CS250, UC Berkeley Fall ‘09Lecture 11, Regular Structures 50
CS250, UC Berkeley Fall ‘09Lecture 11, Regular Structures
Regular Structures‣ In principle, standard cell libraries are sufficient for
implementation of any logic circuit. In practice, great for “random logic”, but what about other functions?
‣ With logic synthesis and standard cell place and route as good as it is, is there still a place for special regular structure layout generators?
‣ Exploiting regularity allow us to build special “generators”,‣ Which often leads to improved area, energy, and
performance.
‣ We looked at RAM, ROM, PLA, shifters‣ Are there others?
51
CS250, UC Berkeley Fall ‘09Lecture 11, Regular Structures
Random Notes‣ Multiplication another regular structure example
‣ How do we (or should we) exploit “regular structures” in our design flow?‣ Special predesigned blocks‣ ex: “large” SRAM block in library for instantiation
‣ Special layout generators with special leaf cells‣ ex: PLA generators. SRAM/ROM generators.
‣ Special layout generators using standard cells‣ Datapath compilers
‣ Is there always a clear win?‣ ex: ROM table might be smaller and faster implemented as logic
equations in standard cells (with place and route)
‣ Clear advantage for SRAM, others?
52