Download - How to make true 3D-TSV IC application...Kanji Otsuka, Meisei University 14 FPGA Logic block: LUT (SRAM) and simple logic with relative small driver Switching block: FF+switch Connecting

How to make true 3D-TSV IC application--Spreading 3D-TSV IC technologies, but not followed by

major applicationsMeisei University

Collaborative Research CenterKanji Otsuka

SEMICON Taiwan 2011 Kanji Otsuka, Meisei University

2

We still not find major application with TSV interconnection structure.

• As our recognition, the main figure of merit on TSV structure is avoiding from the 2-D restriction provided by 3-D interconnections.

• Is the figure of merit collect?

• We should again check the concept of this main figure of merit toward making major applications.

3

2D interconnection

TSV

Waste of active and 2D wiring area

Even if we chose the size of 2um dia.

Si substrate

1. TSV diameter: still very large for interconnection.

Kanji Otsuka, Meisei University


4

Current technology: 6 or 10 metal layer

TSV can provide approximately 2 more wiring layers prevented with wiring length prolong

Si substrate

TSV

TSV would not get down with wiring limitation. TSV advantage is rather in 3D structure.

5

Si substrate

2. Trade-off issue between TSV aspect ratio and intrinsic gettering layer

Loss of intrinsic gettering layer from when wafer thickness is 50um or less.

TSV

Thinning edge IG Layer

In case of Via-last



6

Failure die

3. Difficult solving on Know-Good-Die issue at W2W, therefore needed redundancy implement


7

4. Difficulty in thermal issue on many stacking structure, then saving power required

Si substrate

TSV

Si substrate

TSV

Si substrate

TSV

integrated thermal energy


8

5. Effective function overcome cost issue

6. Other many restrictions under process and design technology: complexity increasing

8


9

Restriction and problem Task(Red characters are focused now)

1. Less area efficiency under wasting active and 2D wiring

Find function and performance beyond TSV area penalty

2. Trade-off issue between TSV aspect ratio and loss of IG layer

Improvement process came into view now

3. Difficulty on known-Good-Die Introduce W2C or C2C for production or made redundancy

4. Thermal issue limitation ; the most important issue for 3D

Choosing power saving circuit and system; need fundamental approach

5. Cost issue limitation Effective function and performance overcome cost

6. Complicate process and design methodology

Simple process and easy design algorithm

Summary of 3D-TSV restrictions


10

Several solutions have been announced. Trend seems to be still not enough now.

(1) Tile or small block array through TSV interconnection are good for memory or image sensor system with wide band interconnection by several thousand TSVs.

(2) Cache DRAM faces on CPU as providing large size cache with area saving.

(3) Stacked closed function block including FPGA and core makes to scalable system with redundancy.

(4) Using silicon interposer with TSVs gets higher performance of 2D wiring.

(5) Memory stacked module and many small core stacked module connect with diagnosis-restoration and dynamic reconfiguration wiring module. This is some of ideal system, however there is not any specified now.

MemoryDiagnostic-restorationMany core

Redundant memory

Core CPU or bus controller

MemoryFPGACore

FPGASi interposer


11

Small number of TSVs in each tile or small block would make most effective structure.

However, different function of tile would have different size and different connection requirement. Therefore it could not produce to efficient stack-up and interconnection.

Naturally, an idea can be created as unified circuit in whole of system. Then we can make the tile structure efficiently.

Neuron of our brain is unified function conjugated with logical processing and memory. Can we make such circuit by CMOS unit gate? Neuron and axon network


12

Array of mat

Logic

Cache surrounded the logic

Increasing and decreasing depend on cache hit ratio

Adding cache by new generated logic

When job capacity increasing

Expanding Logic


Multi task with shared cache

Dynamic reconfiguration algorism by unified function block Efficient communication

between neighbor block with high band width and high

processing rate


13

For memory For logic

Unified circuit! Easy to make as following configuration.SRAM can change to any function even wiring connection.

Changed by mode selector


14

FPGA○ Logic block: LUT (SRAM) and simple logic with relative small driver○ Switching block: FF+switch○ Connecting block: wiringAbove is not true unified block that is composed by primitive logic and additional memory (both are of hard structure)

Toward unified circuit (before slide)○ Logic block: SRAM with mode selector○ Memory block: SRAM with mode selector○ Switching block： SRAM○ basic cell connection (wiring): SRAM

Unified ! However poor efficiency on switching block and wiring by SRAMThen, arrange optimum basic cell size and cluster size ○ Logic block: SRAM with mode selector with relative small driver○ Memory block: SRAM with mode selector with relative small driver○ Cluster connection: bus with driver (through TSV)

Logic BlockConnecting Block

I/O

FPGA’s Basic Cell

Switching Block

ＦＦ

0：off1：on

10

0

0

0 0

CIN

BX

B1B2B3B4B5B6

6-LUT

MUX5-LUT

5-LUT

FF BQ

B

BMUX

COUT

LUT architecture of Xilinx Virtex-5

Unified like algorithm is already current in FPGAs.


15

Now I introduce our memory-logic conjugate system

Meisei UniversityYoichi Sato

Kanji OtsukaHitachi ULSI Systems

Masahiro Yoshida

SRAM based 8bit ProcessorAn application

of Memory-Logic Conjugate System (MLCS)

in Smallest model


16

The Outlook of the Memory - Logic Conjugate System(MLCS)

1. Solving the problem of low band width between memories and logics. (because of memory to be logic itself)

2. Effective architecture: dynamic reconfiguration can done by only rewriting register. (because of memory to be logic itself)

3. High speed operation: miscellaneous registers in a basic cell can be used by dynamic reconfiguration. (a basic cell itself can be programmable)

4. Suitable for 3D-TSV assembly and scalable made by small block configuration.

5. Low power: no need I/O circuits between Logic circuits and SRAMs. And access path can be saved.


17

：Control signal（1bit each）

：address, data（4bit each）

Control bus（CY etc）

（4bit×4）（4bit×4）

（4bit×4）

Structure of Basic Cell

：Outputs of RouteConfiguration register

or Mode register：reconfiguration bus

(4bit each)

Simple operation can be programmable by using rich internal registers.Bus wiring can be routing on the memory area (about 70%), which can save area.

Sub control bus （8bit）

SRAM（LUT）256W×8bit

R/W

CKCE

DIN

D

Ch. set register

ADD（Write)

Input control circuit（mode change control

& channel control）

Output control circuit（register, switch, etcControl)

(4bit REG x8)

Mode set register

ADD

（4bit×4）

：write command bus

（4bit×4）

（4bit×4）（4bit×2）（4bit×2）

（4bit×4）


18

Operation mode

Through Access mode (= initial mode)

System mode

Arithmetic operation mode

Combinational Circuit mode

Internal memory mode

External memory mode

S/R=“L”（reset mode）

S/R=“H”

Memorymode

Logic mode

External memory mode

Logic library mode (Macro-cell)

Operation mode of basic cell (Memory-logic conjugate cell)

Route Configuration Register Mode (making LUT)

Information Update mode for Route Configuration Register

Route Configuration modeby Mode Register

Route ConfigurationRegister Mode (making LUT for dynamic reconfiguration)

Rich operation modes can construct flexible and variable systems.

For dynamic reconfiguration


19

・・・・・・

・・

・・

・・

m rows

n columns

・・・・・・

・・

・・

・・

Basic Cell Array

Other Systems(including Cluster memory)

Other Systems(including Cluster memory)

8 bitq bit

Memory address of B.C.

Extension address

（address space of Cluster memory）

Addresses

Clk + Control signal

Data( 8 bit×n )

Multiple bus

Basic CellArray

decoders

Control Circuit+Bus I/F

CＸ

CＹCluster memory

Memory – Logic Conjugate System (MLCS):Total system including some Cluster memories

Basic Cell

Outlook of MLCS structureSome size of cluster allocation matches to operation and logic density.


20

Actual design of four basic cell configuration

Four basic cell Area for TSVs

Memory (SRAM) for testing

256W x 8bit x 4cell


21Memory space of LSI Memory space of MLCS

：memory mode

：logic mode

Basic cell

MLCS memory space

Cluster memory 1

Bus switchFor memory space

256w256w256w

256w256w256w256w

256w256w256w

256w

Channel set register

Memory space is adjustable for dynamic reconfiguration function.

Cluster memory 2

Cluster memory 3

Cluster memory n

For memory space


22

● Area is about 330X330um2 @90nm process (One Cluster)

X

Y00 01 10 11

11

10

01

00

Program memory(512w×8b)

Logical judgment circuit

Instruction decoder

Reserve part

（decoder control）Basic cell

Basic cell array

shifter(8bit）

decoder（Note）

（１）Program counter：16bit．2-cycle operation in case of overflow inaddress operation

．1-cycle operation (without overflow)（by using 8bit ALU）

（２）structure of 8bit ALU．To enable 2-cycle 16bit addition,

new type of adder with carry code input is introduced (which uses 4 Basic Cells).

Cluster memory layout example in single 8 bit ALU

PC Adder & 8bit ALUs (one resource shared)

22

Meisei University Confidential

23

Operation speed of processor mode

Area consumption on the same logic with different peripheral circuitArea Pure logic MLCS FPGARatio

: constant size with some allowance design: dynamic size with minimum

design

Performance comparison between pure logic and MLCS

Power Pure logic MLCS FPGARelative ratio １ 2 20

Power consumption on the same logic with one thread

Band frequency

Pure logic**

(8/32bit)

MLCS (8bit) MLCS (32bit)Non-parallel

Four parallel*

Non-parallel

Four parallel*

Maximum 4GHz 1GHz 4GHz 1GHz 4GHzMean rate ? （1GHz）（3GHz ） (1GHz) (4GHz)

Note: *Incase of 50% independency between four threads**One thread in pure logic that is superior than the SRAM based MLCS

γβα ,,

α+1 β+7 γ+30

γα ,

β

Pure logic would be the best for processing, however MLCS can operate dynamic reconfiguration mode and memory function.

Four multi-thread processing Program command + data

Rearrangement


24

Configuring from cluster to mat structure controlled by synchronous clock

Basic Cell Array=Cluster

decoders

deco

ders

Control Circuit

decodersde

code

rsControl Circuit

decodersControl Circuit

decoders Control Circuit

Cluster memory

Space for wiring and TSVsconnecting between clusters in a mat

deco

ders

deco

ders

A mat（unit processor element）

Position of clock supply





25Clock synchronous cube, we said Mat

cluster

Sub-Processor

Master clock ; asynchronous on mat-to-mat

Dynamic access by asynchronous clock on mat-to-mat with dynamic reconfiguration

Hit signal from neighbor mat by the header of a packet

Clock timing image for synchronous and asynchronous


26

Array of mat

Logic


Increasing and decreasing depend on cache hit ratio

Adding cache by new generated logic

When job capacity increasing

Expanding Logic


Multi task with shared cache

Of course, mat itself can dynamically set number of registers depend on requirement.Mat also can include penetrated caches inside.

Dynamic reconfiguration algorism

Adjacent addressing can save the latency within 1clock within synchronous cube


27


Memory structured LUT presented by Masayuki Sato, RECONF Symposium 2006.9

Other approach in technical papers.

One idea introduce as half quadrate interconnection memory basedlogic circuit in random array, however still memories are consumed for interconnection / switching. Rearrangement of unit tile is developing now by Mr. Sato and Prof. Hironaka from Hiroshima City University.

28


29

Next significant issue is power saving.Is there drastic power saving method?

Yes we have one idea.

2

21, mvKmvI ==

start stopRadiation of heat


30

Physics of power consumption

Power consumption on unit circuit

2)(21]W[

)(]C[

ddILT

ddILT

VCCCP

VCCCQ

⋅++=

⋅++=

VoltageCurrent

0

Current to waste

( ) ( )

( ) ( ) ⎟⎟⎠

⎞⎜⎜⎝

⎛⎟⎟⎠

⎞⎜⎜⎝

⎛−=⎟⎟

⎠

⎞⎜⎜⎝

⎛=

⎟⎟⎠

⎞⎜⎜⎝

⎛=⎟

⎟⎠

⎞⎜⎜⎝

⎛⎟⎟⎠

⎞⎜⎜⎝

⎛−=

=

sumondis

sumonch

sumonddf

sumonddr

on

dd

CRtii

CRtii

CRtVv

CRtVv

RVi

exp1,exp

exp,exp1

maxmax

max

We should recover it.

CI

RIRon

CL

CT

On current

Off current

RC遅延回路

2

21, mvKmvI ==start stop

Radiation of heat


31

2

21, mvKmvI ==

Sports EV

Discharge

Charge by brake

battery

One of solution can be found on electric motor car operation.

DS

G

P-type

N-type N-type

Active carriers on conduction band

DS

G

P-type

N-type N-type

Diffusing and shifting to valence band

0V

association

Generating heat

Vacancy layer

However, transistor can not recover the active carrier energy, we all would think. Is that true?


32K computer, performance : 10PFLOPS, Largest computer in the world at now

Power supply building

Huge power!!


33

Input characteristic impedance Z0=100Ω

Output characteristic impedanceZ0=100Ω

Differential MOS’s in the same well

11.5um

Drain

Source

Gate

Differential pair

Space1um 2um

5um

4.3um

7.2um

Recovering signal energy method: Active carrier reused on differential CMOS circuit

Key structure is that differential MOS

transistors are positioned in the same well.


34

P n+ P

N-WellP_SUB

N p+ N N

IN-PositiveIN-Negative

+

-

+

VRF

INP

INNOUTNOUTP

VDD VDD VDDVDD

Recovering signal energy method: Active carrier reused on differential CMOS I/O Driver

P-Well

Input ESD Output ESDInverter

Current control

Arrangement differential transistors in the same well

P NP


36ESD Inverter ESD

Unit cell ray-out configuration


37


38

Forced releasing carrier by capacitance change

Moving free carrier to other

capacitance by voltage sink

TransientInitial

Discharge limiting inductance at carrier rejection through

source or drain

After inversion

Paired switch in same well

Set condition is as mobility of hole=4×102[cm2/Vs] at 300k in carrier density 1014～1015[cm-3], and Vdd=1.8V. Then drift speed D=7.2×102 [cm2/s] is counted. When carrier traveling length is 10μm, 0.001cm=√Dt=√2×102・t is derived, thus t=1.3×10-

9s=1.3ns is given comparing with longer time for our object rise time of pulse 100ps (3GHz equivalent). But electron travel time is 130ps that is our order of rise time.


39

Carrier reuse driver chip


40

Power current measurement from the voltage drop at 4.7ohm series resistance.Substrate wiring length

for differential output; 8mm Z0=100Ω

IC chip

Z0=100ohm

Z0=100ohm0.25mm length

Differential input

Flip chip bonding

Terminator 100ohmDifferential probing

R for current measurement

Cip=0.47pFCip=0.47pF

Cin=0.45pF Cin=0.45pF

Cwel=1.56pF“0.18um node” conventional

CMOS process

Vdd

0

2

4

6

8

0.001 0.01 0.1 110

Cur

rent

[mA

]

Calculation current by cap.

Ohmic current

Current at Vdd 1.8V

Depressed swing height region

Vdd

Differential inverter current depending on frequency

0

2

4

6

8

10

12

14

0.001 0.01 0.1 110

Frequency [GHz]

Cur

rent

[mA

]

Calculation current by cap.

Ohmic current

Current at Vdd 1.8V

Depressed swing height region

Reduction!!

We can save the power by carrier reused circuit.

DC current by current control transistors and

clumping drivers on others


41

termination Probe point

4mm

Random pulse eye pattern shows high speed even in 0.18um process node.

FR-4 substrate：transmission line ＝100Ω ESDZ=50ΩVCC＝1.8V termination＝100Ω、input swing1.8V

8Gbps 9Gbps 10Gbps

11Gbps 12Gbps


42

More effective carrier reuse circuit structure is in double gate Fin type.

Drain１ Gate１

Source ２

Source１

Drain ２ Gate ２

Insulating layer

Gatedrain

source


43

Device Function Initial / Carrier reuse Power saving ratio(1) Pure logic ALU

15 to 30 %Peripheral I/O

(2) DRAM memory mat10 to 30 %Addressing

I/O(3) SRAM Memory mat

25 to 45 %AddressingI/O

(4) MLCS with small cell

M/L mat30 to 50 %Addressing

I/O

Relative power consumption level

Power saving image in each device used by carrier reuse transistor circuit

Applicable on all differential circuit

Less than SRAM due to small cell


44

Previous listed task Solution

1. Find function and performance beyond TSV area penalty

Tile or small block array structure through TSV interconnection

3. Made redundancy Unified circuit such as memory-logic conjugation system

4. Choosing power saving circuit and system

Carrier reuse transistor circuit

5. Effective function and performance turning over cost

Unified circuit such as memory-logic conjugation system

6. Easy design algorithm Unified circuit such as memory-logic conjugation system

Summary for a solution

As like my presentation example, more fundamental physics and algorithm concept should be developed for 3D

structure with TSVs.

Download - How to make true 3D-TSV IC application...Kanji Otsuka, Meisei University 14 FPGA Logic block: LUT (SRAM) and simple logic with relative small driver Switching block: FF+switch Connecting

Top Related