How to make true 3D-TSV IC application--Spreading 3D-TSV IC technologies, but not followed by
major applicationsMeisei University
Collaborative Research CenterKanji Otsuka
SEMICON Taiwan 2011 Kanji Otsuka, Meisei University
2
We still not find major application with TSV interconnection structure.
• As our recognition, the main figure of merit on TSV structure is avoiding from the 2-D restriction provided by 3-D interconnections.
• Is the figure of merit collect?
• We should again check the concept of this main figure of merit toward making major applications.
3
2D interconnection
TSV
Waste of active and 2D wiring area
Even if we chose the size of 2um dia.
Si substrate
1. TSV diameter: still very large for interconnection.
Kanji Otsuka, Meisei University
Kanji Otsuka, Meisei University
4
Current technology: 6 or 10 metal layer
TSV can provide approximately 2 more wiring layers prevented with wiring length prolong
Si substrate
TSV
TSV would not get down with wiring limitation. TSV advantage is rather in 3D structure.
5
Si substrate
2. Trade-off issue between TSV aspect ratio and intrinsic gettering layer
Loss of intrinsic gettering layer from when wafer thickness is 50um or less.
TSV
Thinning edge IG Layer
In case of Via-last
Kanji Otsuka, Meisei University
SEMICON Taiwan 2011 Kanji Otsuka, Meisei University
6
Failure die
3. Difficult solving on Know-Good-Die issue at W2W, therefore needed redundancy implement
Kanji Otsuka, Meisei University
7
4. Difficulty in thermal issue on many stacking structure, then saving power required
Si substrate
TSV
Si substrate
TSV
Si substrate
TSV
integrated thermal energy
Kanji Otsuka, Meisei University
8
5. Effective function overcome cost issue
6. Other many restrictions under process and design technology: complexity increasing
8
SEMICON Taiwan 2011 Kanji Otsuka, Meisei University
9
Restriction and problem Task(Red characters are focused now)
1. Less area efficiency under wasting active and 2D wiring
Find function and performance beyond TSV area penalty
2. Trade-off issue between TSV aspect ratio and loss of IG layer
Improvement process came into view now
3. Difficulty on known-Good-Die Introduce W2C or C2C for production or made redundancy
4. Thermal issue limitation ; the most important issue for 3D
Choosing power saving circuit and system; need fundamental approach
5. Cost issue limitation Effective function and performance overcome cost
6. Complicate process and design methodology
Simple process and easy design algorithm
Summary of 3D-TSV restrictions
Kanji Otsuka, Meisei University
10
Several solutions have been announced. Trend seems to be still not enough now.
(1) Tile or small block array through TSV interconnection are good for memory or image sensor system with wide band interconnection by several thousand TSVs.
(2) Cache DRAM faces on CPU as providing large size cache with area saving.
(3) Stacked closed function block including FPGA and core makes to scalable system with redundancy.
(4) Using silicon interposer with TSVs gets higher performance of 2D wiring.
(5) Memory stacked module and many small core stacked module connect with diagnosis-restoration and dynamic reconfiguration wiring module. This is some of ideal system, however there is not any specified now.
MemoryDiagnostic-restorationMany core
Redundant memory
Core CPU or bus controller
MemoryFPGACore
FPGASi interposer
Kanji Otsuka, Meisei University
11
Small number of TSVs in each tile or small block would make most effective structure.
However, different function of tile would have different size and different connection requirement. Therefore it could not produce to efficient stack-up and interconnection.
Naturally, an idea can be created as unified circuit in whole of system. Then we can make the tile structure efficiently.
Neuron of our brain is unified function conjugated with logical processing and memory. Can we make such circuit by CMOS unit gate? Neuron and axon network
Kanji Otsuka, Meisei University
12
Array of mat
Logic
Cache surrounded the logic
Increasing and decreasing depend on cache hit ratio
Adding cache by new generated logic
When job capacity increasing
Expanding Logic
Cache surrounded the logic
Multi task with shared cache
Dynamic reconfiguration algorism by unified function block Efficient communication
between neighbor block with high band width and high
processing rate
Kanji Otsuka, Meisei University
13
For memory For logic
Unified circuit! Easy to make as following configuration.SRAM can change to any function even wiring connection.
Changed by mode selector
Kanji Otsuka, Meisei University
14
FPGA○ Logic block: LUT (SRAM) and simple logic with relative small driver○ Switching block: FF+switch○ Connecting block: wiringAbove is not true unified block that is composed by primitive logic and additional memory (both are of hard structure)
Toward unified circuit (before slide)○ Logic block: SRAM with mode selector○ Memory block: SRAM with mode selector○ Switching block: SRAM○ basic cell connection (wiring): SRAM
Unified ! However poor efficiency on switching block and wiring by SRAMThen, arrange optimum basic cell size and cluster size ○ Logic block: SRAM with mode selector with relative small driver○ Memory block: SRAM with mode selector with relative small driver○ Cluster connection: bus with driver (through TSV)
Logic BlockConnecting Block
I/O
FPGA’s Basic Cell
Switching Block
FF
0:off1:on
10
0
0
0 0
CIN
BX
B1B2B3B4B5B6
6-LUT
MUX5-LUT
5-LUT
FF BQ
B
BMUX
COUT
LUT architecture of Xilinx Virtex-5
Unified like algorithm is already current in FPGAs.
Kanji Otsuka, Meisei University
15
Now I introduce our memory-logic conjugate system
Meisei UniversityYoichi Sato
Kanji OtsukaHitachi ULSI Systems
Masahiro Yoshida
SRAM based 8bit ProcessorAn application
of Memory-Logic Conjugate System (MLCS)
in Smallest model
Kanji Otsuka, Meisei University
16
The Outlook of the Memory - Logic Conjugate System(MLCS)
1. Solving the problem of low band width between memories and logics. (because of memory to be logic itself)
2. Effective architecture: dynamic reconfiguration can done by only rewriting register. (because of memory to be logic itself)
3. High speed operation: miscellaneous registers in a basic cell can be used by dynamic reconfiguration. (a basic cell itself can be programmable)
4. Suitable for 3D-TSV assembly and scalable made by small block configuration.
5. Low power: no need I/O circuits between Logic circuits and SRAMs. And access path can be saved.
Kanji Otsuka, Meisei University
17
:Control signal(1bit each)
:address, data(4bit each)
Control bus(CY etc)
(4bit×4) (4bit×4)
(4bit×4)
Structure of Basic Cell
:Outputs of RouteConfiguration register
or Mode register:reconfiguration bus
(4bit each)
Simple operation can be programmable by using rich internal registers.Bus wiring can be routing on the memory area (about 70%), which can save area.
Sub control bus (8bit)
SRAM(LUT)256W×8bit
R/W
CKCE
DIN
D
Ch. set register
ADD(Write)
Input control circuit(mode change control
& channel control)
Output control circuit(register, switch, etcControl)
(4bit REG x8)
Mode set register
ADD
(4bit×4)
:write command bus
(4bit×4)
(4bit×4) (4bit×2)(4bit×2)
(4bit×4)
Kanji Otsuka, Meisei University
18
Operation mode
Through Access mode (= initial mode)
System mode
Arithmetic operation mode
Combinational Circuit mode
Internal memory mode
External memory mode
S/R=“L”(reset mode)
S/R=“H”
Memorymode
Logic mode
External memory mode
Logic library mode (Macro-cell)
Operation mode of basic cell (Memory-logic conjugate cell)
Route Configuration Register Mode (making LUT)
Information Update mode for Route Configuration Register
Route Configuration modeby Mode Register
Route ConfigurationRegister Mode (making LUT for dynamic reconfiguration)
Rich operation modes can construct flexible and variable systems.
For dynamic reconfiguration
Kanji Otsuka, Meisei University
19
・ ・ ・ ・ ・ ・
・・
・・
・・
m rows
n columns
・ ・ ・ ・ ・ ・
・・
・・
・・
Basic Cell Array
Other Systems(including Cluster memory)
Other Systems(including Cluster memory)
8 bitq bit
Memory address of B.C.
Extension address
(address space of Cluster memory)
Addresses
Clk + Control signal
Data( 8 bit×n )
Multiple bus
Basic CellArray
decoders
Control Circuit+Bus I/F
CX
CYCluster memory
Memory – Logic Conjugate System (MLCS):Total system including some Cluster memories
Basic Cell
Outlook of MLCS structureSome size of cluster allocation matches to operation and logic density.
Kanji Otsuka, Meisei University
20
Actual design of four basic cell configuration
Four basic cell Area for TSVs
Memory (SRAM) for testing
256W x 8bit x 4cell
Kanji Otsuka, Meisei University
21Memory space of LSI Memory space of MLCS
:memory mode
:logic mode
Basic cell
MLCS memory space
Cluster memory 1
Bus switchFor memory space
256w256w256w
256w256w256w256w
256w256w256w
256w
Channel set register
Memory space is adjustable for dynamic reconfiguration function.
Cluster memory 2
Cluster memory 3
Cluster memory n
For memory space
Kanji Otsuka, Meisei University
22
● Area is about 330X330um2 @90nm process (One Cluster)
X
Y00 01 10 11
11
10
01
00
Program memory(512w×8b)
Logical judgment circuit
Instruction decoder
Reserve part
(decoder control)Basic cell
Basic cell array
shifter(8bit)
decoder(Note)
(1)Program counter:16bit.2-cycle operation in case of overflow inaddress operation
.1-cycle operation (without overflow)(by using 8bit ALU)
(2)structure of 8bit ALU.To enable 2-cycle 16bit addition,
new type of adder with carry code input is introduced (which uses 4 Basic Cells).
Cluster memory layout example in single 8 bit ALU
PC Adder & 8bit ALUs (one resource shared)
22
Meisei University Confidential
23
Operation speed of processor mode
Area consumption on the same logic with different peripheral circuitArea Pure logic MLCS FPGARatio
: constant size with some allowance design: dynamic size with minimum
design
Performance comparison between pure logic and MLCS
Power Pure logic MLCS FPGARelative ratio 1 2 20
Power consumption on the same logic with one thread
Band frequency
Pure logic**
(8/32bit)
MLCS (8bit) MLCS (32bit)Non-parallel
Four parallel*
Non-parallel
Four parallel*
Maximum 4GHz 1GHz 4GHz 1GHz 4GHzMean rate ? (1GHz) (3GHz ) (1GHz) (4GHz)
Note: *Incase of 50% independency between four threads**One thread in pure logic that is superior than the SRAM based MLCS
γβα ,,
α+1 β+7 γ+30
γα ,
β
Pure logic would be the best for processing, however MLCS can operate dynamic reconfiguration mode and memory function.
Four multi-thread processing Program command + data
Rearrangement
Kanji Otsuka, Meisei University
24
Configuring from cluster to mat structure controlled by synchronous clock
Basic Cell Array=Cluster
decoders
deco
ders
Control Circuit
decodersde
code
rsControl Circuit
decodersControl Circuit
decoders Control Circuit
Cluster memory
Space for wiring and TSVsconnecting between clusters in a mat
deco
ders
deco
ders
A mat(unit processor element)
Position of clock supply
Basic Cell Array=Cluster
Basic Cell Array=Cluster
Basic Cell Array=Cluster
Kanji Otsuka, Meisei University
25Clock synchronous cube, we said Mat
cluster
Sub-Processor
Master clock ; asynchronous on mat-to-mat
Dynamic access by asynchronous clock on mat-to-mat with dynamic reconfiguration
Hit signal from neighbor mat by the header of a packet
Clock timing image for synchronous and asynchronous
Kanji Otsuka, Meisei University
26
Array of mat
Logic
Cache surrounded the logic
Increasing and decreasing depend on cache hit ratio
Adding cache by new generated logic
When job capacity increasing
Expanding Logic
Cache surrounded the logic
Multi task with shared cache
Of course, mat itself can dynamically set number of registers depend on requirement.Mat also can include penetrated caches inside.
Dynamic reconfiguration algorism
Adjacent addressing can save the latency within 1clock within synchronous cube
Kanji Otsuka, Meisei University
27
Kanji Otsuka, Meisei University
Memory structured LUT presented by Masayuki Sato, RECONF Symposium 2006.9
Other approach in technical papers.
One idea introduce as half quadrate interconnection memory basedlogic circuit in random array, however still memories are consumed for interconnection / switching. Rearrangement of unit tile is developing now by Mr. Sato and Prof. Hironaka from Hiroshima City University.
28
SEMICON Taiwan 2011 Kanji Otsuka, Meisei University
29
Next significant issue is power saving.Is there drastic power saving method?
Yes we have one idea.
2
21, mvKmvI ==
start stopRadiation of heat
Kanji Otsuka, Meisei University
30
Physics of power consumption
Power consumption on unit circuit
2)(21]W[
)(]C[
ddILT
ddILT
VCCCP
VCCCQ
⋅++=
⋅++=
VoltageCurrent
0
Current to waste
( ) ( )
( ) ( ) ⎟⎟⎠
⎞⎜⎜⎝
⎛⎟⎟⎠
⎞⎜⎜⎝
⎛−=⎟⎟
⎠
⎞⎜⎜⎝
⎛=
⎟⎟⎠
⎞⎜⎜⎝
⎛=⎟
⎟⎠
⎞⎜⎜⎝
⎛⎟⎟⎠
⎞⎜⎜⎝
⎛−=
=
sumondis
sumonch
sumonddf
sumonddr
on
dd
CRtii
CRtii
CRtVv
CRtVv
RVi
exp1,exp
exp,exp1
maxmax
max
We should recover it.
CI
RIRon
CL
CT
On current
Off current
RC遅延回路
2
21, mvKmvI ==start stop
Radiation of heat
Kanji Otsuka, Meisei University
31
2
21, mvKmvI ==
Sports EV
Discharge
Charge by brake
battery
One of solution can be found on electric motor car operation.
DS
G
P-type
N-type N-type
Active carriers on conduction band
DS
G
P-type
N-type N-type
Diffusing and shifting to valence band
0V
association
Generating heat
Vacancy layer
However, transistor can not recover the active carrier energy, we all would think. Is that true?
Kanji Otsuka, Meisei University
32K computer, performance : 10PFLOPS, Largest computer in the world at now
Power supply building
Huge power!!
Kanji Otsuka, Meisei University
33
Input characteristic impedance Z0=100Ω
Output characteristic impedanceZ0=100Ω
Differential MOS’s in the same well
11.5um
Drain
Source
Gate
Differential pair
Space1um 2um
5um
4.3um
7.2um
Recovering signal energy method: Active carrier reused on differential CMOS circuit
Key structure is that differential MOS
transistors are positioned in the same well.
Kanji Otsuka, Meisei University
34
P n+ P
N-WellP_SUB
N p+ N N
IN-PositiveIN-Negative
+
-
+
VRF
INP
INNOUTNOUTP
VDD VDD VDDVDD
Recovering signal energy method: Active carrier reused on differential CMOS I/O Driver
P-Well
Input ESD Output ESDInverter
Current control
Arrangement differential transistors in the same well
P NP
Kanji Otsuka, Meisei University
36ESD Inverter ESD
Unit cell ray-out configuration
SEMICON Taiwan 2011 Kanji Otsuka, Meisei University
37
Kanji Otsuka, Meisei University
38
Forced releasing carrier by capacitance change
Moving free carrier to other
capacitance by voltage sink
TransientInitial
Discharge limiting inductance at carrier rejection through
source or drain
After inversion
Paired switch in same well
Set condition is as mobility of hole=4×102[cm2/Vs] at 300k in carrier density 1014~1015[cm-3], and Vdd=1.8V. Then drift speed D=7.2×102 [cm2/s] is counted. When carrier traveling length is 10μm, 0.001cm=√Dt=√2×102・t is derived, thus t=1.3×10-
9s=1.3ns is given comparing with longer time for our object rise time of pulse 100ps (3GHz equivalent). But electron travel time is 130ps that is our order of rise time.
Kanji Otsuka, Meisei University
39
Carrier reuse driver chip
Kanji Otsuka, Meisei University
40
Power current measurement from the voltage drop at 4.7ohm series resistance.Substrate wiring length
for differential output; 8mm Z0=100Ω
IC chip
Z0=100ohm
Z0=100ohm0.25mm length
Differential input
Flip chip bonding
Terminator 100ohmDifferential probing
R for current measurement
Cip=0.47pFCip=0.47pF
Cin=0.45pF Cin=0.45pF
Cwel=1.56pF“0.18um node” conventional
CMOS process
Vdd
0
2
4
6
8
0.001 0.01 0.1 110
Cur
rent
[mA
]
Calculation current by cap.
Ohmic current
Current at Vdd 1.8V
Depressed swing height region
Vdd
Differential inverter current depending on frequency
0
2
4
6
8
10
12
14
0.001 0.01 0.1 110
Frequency [GHz]
Cur
rent
[mA
]
Calculation current by cap.
Ohmic current
Current at Vdd 1.8V
Depressed swing height region
Reduction!!
We can save the power by carrier reused circuit.
DC current by current control transistors and
clumping drivers on others
Kanji Otsuka, Meisei University
41
termination Probe point
4mm
Random pulse eye pattern shows high speed even in 0.18um process node.
FR-4 substrate:transmission line =100Ω ESDZ=50ΩVCC=1.8V termination=100Ω、input swing1.8V
8Gbps 9Gbps 10Gbps
11Gbps 12Gbps
Kanji Otsuka, Meisei University
42
More effective carrier reuse circuit structure is in double gate Fin type.
Drain1 Gate1
Source 2
Source1
Drain 2 Gate 2
Insulating layer
Gatedrain
source
Kanji Otsuka, Meisei University
43
Device Function Initial / Carrier reuse Power saving ratio(1) Pure logic ALU
15 to 30 %Peripheral I/O
(2) DRAM memory mat10 to 30 %Addressing
I/O(3) SRAM Memory mat
25 to 45 %AddressingI/O
(4) MLCS with small cell
M/L mat30 to 50 %Addressing
I/O
Relative power consumption level
Power saving image in each device used by carrier reuse transistor circuit
Applicable on all differential circuit
Less than SRAM due to small cell
Kanji Otsuka, Meisei University
44
Previous listed task Solution
1. Find function and performance beyond TSV area penalty
Tile or small block array structure through TSV interconnection
3. Made redundancy Unified circuit such as memory-logic conjugation system
4. Choosing power saving circuit and system
Carrier reuse transistor circuit
5. Effective function and performance turning over cost
Unified circuit such as memory-logic conjugation system
6. Easy design algorithm Unified circuit such as memory-logic conjugation system
Summary for a solution
As like my presentation example, more fundamental physics and algorithm concept should be developed for 3D
structure with TSVs.