a mixed mode self-programming neural system-on-chip for real … networks... · 2001-10-17 ·...

X X X X X X

X X X X X X

X X X X X X

X X X X X X

ANALOG /DIGITALINPUTS

ANALOG/DIGITAL

OUTPUTS

(OPTIONAL)DIGITALINPUT

SIGNALS

DIGITAL CHIPLEVEL

CONTROLSIGNALS

ANALOG NEURALPROCESSING

LAYER

DIGITAL STORAGE,PROCESSING AND CONTROL

LAYER

DIGITAL SUPERVISORY &MULTIPLEXING LAYER

RAM

X

RAM

X

RAM

X

RAM

X

RAM

X

RAM

X

CHIPLEVEL

DECODER

CHIPLEVELMUX/

DEMUX

Hierarchical Structure of the Chip

A Mixed Mode Self-Programming Neural System-on-Chip forReal-Time Applications

Khurram Waheed and Fathi M. Salam

Department of Electrical and Computer EngineeringMichigan State University

East Lansing, MI 48824-1226

Abstract

The paper provides an overview of the development of aself-learning computing chip in the new 0.18 micron coppertechnology. The chip realizes an architecture that achievesthe task of self-learning execution times in the micro to milliseconds. The core consists of basic building blocks of 4-quadrant multipliers, trans-conductance amplifiers, andactive load resistances, for analog (forward-) networkprocessing and learning modules. Super-imposed on theprocessing network are digital memory and controlmodules composed of D-Flip-flops, ADC, Multiplying D/AConverter (MDAC), and comparators for parameter(weight) storage, logical control and analog/digitalconversions.

The single System-on-Chip design impacts several domainsof critical applications that include nano-scalebiotechnology, automotive sensing, control and actuation,wireless communications, image feature extraction andpattern matching etc

1. Introduction

The core of the chip is a neurally inspired scalable (re-configurable) array network for compatibility with VLSI.The chip is endowed with tested auto-learning capability,realized in hardware, to achieve global task auto-learningexecution times in the micro to milli seconds.

The architectural forward network (and learning modules)process in analog continuous-time mode, while the(converged, steady state) weights/parameters can be storedon chip in digital form. The overall architectural designadopts engineering methods from adaptive networks andoptimization principles [1,2].

The designed chip can handle 16 inputs, 16 outputs. Inaddition there are inputs for control interface,synchronization and stand-alone programmability of thechip resulting in an approximate die area of 4000 X 6000µm2 in a QFP-208L package

The 6 layer Copper (Cu) interconnect, single poly, 0.18micron process enables dense connectivity and dense diearea of this highly interconnected network resulting in acompact powerful engine. Moreover, the special lowresistance and low capacitance electrical properties ofcopper permit the design to achieve the high connectivitywhile still managing precise distributions of resistive andcapacitive loads. These properties enable one to predictperformance and limit signal time-delays along theinterconnect. The small feature size and the electricalinterconnect properties for copper are enablers to therealization of such a powerful chip with denseinterconnectivity.

Figure 1: Arcitectural Overview of the Chip

The chip operates in four different modes: (i) learn, (ii)(on-chip) store, (iii) program read/write, and (iv) processwhich selectively combine its intrinsic analog and digitalbuilding blocks in a novel manner (see fig. 1).

Initially the system-level chip design was simulated andverified using SIMULINK/MATLAB. All the buildingblocks were custom designed and extensively simulatedusing HSPICE (incorporating the UMC level 49 models).The design was implemented in the 6 level Copper (Cu)interconnect, single poly, 0.18 micron process which is anenabler for the dense connectivity and dense die area ofthis highly interconnected network. The design was laidout and verified using Cadence Tools. More details of

high-level design, circuit design of major blocks and chip-layout will be provided in the subsequent sections [8,9].

The resulting chip design requires no traditionalprogramming or coding [2,3]. In addition to novelarchitectural design, the hardware also performs the heavycomputational burden by selectively realizingprogrammability as on-chip auto-learning modules. Theresulting System-on-Chip operates on 1.5V power sourceand consumes approximately 1 mWatt of power.

2. Architectural Design

The design process was comprised of consecutive stages,based on a top-down definition of the chip. A generaldefinition of the functionality and intended applications wascreated, and the development of the chip design was donein three different levels.

• A high-level design, specifying the characteristics ofthe neural network to be implemented and thedefinition of its basic building blocks.

• A circuit level design, describing each of these blocksbased on the copper technology with theircorresponding simulations.

• Finally a layout level design, where the actual chiplayout is created and verified

3. High Level Design: Modified BP Algorithm

We present our stepwise approach for tailoring the BPalgorithm so that it becomes suitable for VLSIimplementation. For illustration in each case, we present thesimulated results of the XOR problem, highlighting theeffect of each modification on the performance of thealgorithm.

BP Neural Network Model

Load Data

BP Neural Network with Nonlinear Multiplier and Removal of the Derivative Function in All Hidden Layers

w34(1,3)

w34(1,2)

w34(1,1)

w23(2,3)

w23(2,2)

w23(2,1)

w23(1,3)

w23(1,2)

w23(1,1)

w12(2,3)

w12(2,2)

w12(2,1)

w12(1,3)

w12(1,2)

w12(1,1)

Weight

Update

Process

Weight Update

Training Error

Data Stream

Train Data

Test data output

Test data intput2

Test data input1Data Stream

Test Data

load Data Stop Training

Stop Training Condition( Stop Automatically )

Stop Training

Stop Training( Manually )

Signal Feedforward

( Testing )

Signal Feedforward Process

( Training )

Weight

Memory

Ground

Ground

Error Feedback Process

Error Feedback

Figure 2: Simulink Model for Chip Simulations

For our simulations, we constructed a four-layer neuralnetwork, it has an input layer, two hidden layers and anoutput layer. The nonlinear mapping function is used onlyby the neurons in the two hidden layers. For the XOR

simulation we are illustrating, there are two input neuronsand one output neuron.

The update law in each case is derived from themathematical model for a multi-layer feed-forward neuralnetwork.. In the equations below

(3)W - represents the time derivative of the weight foroutput layer

( 2 )W - represents the time derivative of the weight for 2nd

hidden layer(1)W - represents the time derivative of the weight for 1st

hidden layer

3.1 Modified BP algorithm with linear multiplier.Using the weight update rule

EW W

Wη α

∂= − ⋅ − ⋅

∂where ( ) ( ) ( 1)i i iNET W X −= ⋅

The update laws are(3) (3) (3)

(2) (2) (3) (2) (2)

(1) (1) (2)

(3) (2) (1) (1)

( )

( ) (( ) )

( ) ( ( )

(( ) ) )

W D Y X W

W NET D Y W X W

W NET NET

D Y W W X W

η α

η ψ α

η ψ ψ

α

= ⋅ − ⋅ − ⋅

′= ⋅ ⋅ − ⋅ − ⋅

′ ′= ⋅ ⋅ ⋅

− ⋅ ⋅ ⋅ − ⋅

Figure 3: Training Error and convergence for the modifiedBP Algorithm

3.2 Modified BP algorithm with nonlinear multiplier.This update law and the subsequent law considers theeffect of the presence of resistive elements in hardwarecircuits and the multiplication using a Gilbert multiplier.Each case uses the following update law

( ) ( ) ( 1), where tanh( ) tanh( )i i iEW W NET W X

Wη α −∂

= − ⋅ − ⋅ = ⋅∂

Figure 4: Training Error and convergence for the modifiedBP Algorithm with non-linear multiplier

The update laws are(3) (3) (3)

(2) (2)

(3) (2) (2)

(1) (1)

tanh( ) tanh( )

tanh(tanh( ( )) tanh(tanh( )

tanh( ))) tanh( )

tanh(tanh( ( ))

tanh(tanh(tanh( (

W D Y X W

W NET D Y

W X W

W NET

N

η α

η ψ

α

η ψ

ψ

= ⋅ − ⋅ − ⋅

′= ⋅ ⋅ − ⋅

⋅ − ⋅

′= ⋅ ⋅

′ (2)

(3) (2) (1) (1)

)) tanh(tanh( )

tanh( ))) tanh( ))) tanh( )

ET D Y

W W X Wα

⋅ −

⋅ ⋅ ⋅ − ⋅

3.3 The Modified BP algorithm with nonlinearmultipliers and the removal of the derivative function inall the hidden layers.The update laws are

(3) (3) (3)

(2) (3) (2)

(2)

(1)

tanh( ) tanh( )

tanh(tanh( ) tanh( )) tanh( )

tanh(tanh(tanh( ) tanh(

W D Y X W

W D Y W X

W

W D Y W

η α

η

α

η

= ⋅ − ⋅ − ⋅

= ⋅ − ⋅ ⋅

− ⋅

= ⋅ − ⋅ (3) (2)

(1) (1)

)) tanh( ))

tanh( )

W

X Wα

⋅

⋅ − ⋅

Figure 5: Training Error and convergence for modified BPAlgorithm with removal of derivative function in all hiden

layersPresented in fig. 6 are the input/output waveforms for thenetwork trained using the above final update rule

Test Input 1 Test Input 2

Test OuputFigure 6: Input /Output waveforms for the final network

3.4 Summary of the High-level Simulation ResultsFrom the high-level Matlab/Simulink simulations, one candraw the following conclusions:

1) In replacing the ideal linear multiplier model by therealistic nonlinear multiplier model, the neural networkconverges.2) Removing the derivative function of the second hiddenlayer, the neural network could still converge. In fact, thiscan be easily verified mathematically.3) When the derivative function in all hidden layers areremoved, the neural network could converge, but in thiscase the training error is not zero but attains a smallconstant mean value. The update law derived in this case isstill a gradient type law [7].

For all presented simulation results, we used the same setof initial conditions for all the models i.e all initialconditions at zero. In general for different nonlinearsystems, the same initial conditions may bring differenttraining result. An initial weight set which results in a goodtraining result for one nonlinear system will notnecessarily ensure that it will result in a good trainingresult for another nonlinear system.

4. Conceptual Chip Design

Presented below is a higher block level design of the chip.For more details see [1,9]

4.1 The Synaptic Cell:The initial and highest level design of each neural cellshows the central idea of the processing network and thelearning network (Fig. 7).

The processing stage is composed of 16 neurons builtusing vector multipliers and a sigmoid function. Themultipliers use as operands an input vector and a weightvector. The input is common to all processing units, andthe weights belong to each neuron. The scalar product isthen applied to the non-linear function, resulting in theoutput of a neuron.

The learning stage works in a similar manner as theprocessing stage, but using different sources for theproduct. It receives the signals from the next stage, createsa new signal to be sent to any previous cell, and updatesthe weights according to the update law. Both the stages oflearning and the processing networks are merged locally.To accomplish this, it was necessary to decompose the 17-D multipliers that constituted each network node into a setof 1-D multipliers.

On-chip memory is designed as local digital memory. It istherefore necessary to add a stage where the current analog

16S

16S

.

.

.

δi

y

w1i w2i...wNi θi

wN1 wN2...wNN θN

.

.

.

o

δ

e

wi1 ...wiN θi

1D

Weight

17-D

wi1 ...wiN θi

1D

Weight

δN

17-D

17-D

17-D

16

16

16

1616

16

V

V

V

V

V

V

V

V

ADC

MDAC

V

I

MemoryFF

V

Learn/ Process

φ2

R

V

wij

wij

V

V

Weight

φ1

Learn/ Process

Unit 1

Unit 8

8

8

Unit 9

Unit 16

.

.

.o

16

V8

8

MSB bus

LSB bus

o16

V

TransAm

Figure 7: Architectural Block Diagram - Synapse

value of the weight is converted into a digital value usingan ADC, and then converted back using a DAC. Thememory is built by using 5 data flip-flops. The update law,however, uses a capacitor (see Fig. 7) and 1-dimension (1-D) multipliers. These multipliers are also used in eachneuron, to form the 17-dimension (17-D) multipliers.

To optimize the number of ADCs required for theconversion of the weight and still achieve goodperformance, an array of ADCs was designed away fromthe neural network. With this new configuration one ADCcould be shared by the whole row of weights, reducing the

number of ADCs to n . This design uses multiplexers,decoders, control logic for the store mode and the need of aclocked input to drive this logic. This clock also regulatesthe ADC operation, as it is designed to be successiveapproximations type. Please note that having a clock in thissection does not imply that the neural network stops beingasynchronous. For more detailed review of the chiparchitecture, kindly refer to [1,8,9].

5. Design and Layout of Components

There are a number of custom designed components for thischip. All the component circuits were designed on Star-Hspice using BSIM Level-49 models supplied bySRC/UMC. Avant’s software was used for the schematicentry and waveform viewing. Initial layout of the sub-circuits was carried out in Tanner Tools but the verificationand LVS were performed using Cadence Tools. In thispaper we restrict ourselves to presenting results on the morevital components of the chip. This includes a Gilbertmultiplier, a wide transconductance amplifier and acomparator in this paper. In addition, demonstrativesimulations for ADC and the vector multiplier are alsoprovided.

5.1 Gilbert MultiplierTo implement the multiplication in the analog domain, aGilbert multiplier cell has been employed. The circuit

diagram of the modified Gilbert multiplier is shown in fig.8. Assume that all transistors in fig. 8 are in saturationregion, and are matched so that the trans-conductanceparameters satisfy the equationsβN = βM1 = βM2 and βP = β8 = β9 =β10 = β11

Figure 8: Gilbert Multiplier - Schematics

The output current then is the difference between ID(M13)

and ID(M14) since the current IS(M16) and IS(M17) are reflectedby the current mirrors.

Defining the output currents

( 8) ( 10)

( 9) ( 11)

S M S M

S M S M

I I I

I I I

+

−

= +

= +it can be readily shown that the ideal characteristics ofdifferential output current DIFFI I I+ −= − is given by

( )3 4 1 2( )( )DIFF p NI V V V Vβ β= − −

The modified Gilbert multiplier takes the differencebetween two voltages (V3-V4) and multiplies thatdifference by a difference of two other voltages (V1-V2).In the small signal range, the characteristic curve isapproximately linear, with all four inputs carryingmultiplication information. For the large signal range, themultiplier is non-linear but does not cause any instability.

Figure 9: Gilbert Multiplier - Layout

The multiplier layout is shown in fig. 9, while the HSPICEsimulated DC characteristics are shown in fig. 10 andshows four quadrant multiplication

Figure 10: Gilbert Multiplier – DC Characteritics

5.2 Wide Transconductance AmplifierThe wide-transconductance amplifier is used in multiplemodes in this chip. It is used as the sigmoid non-linearity atthe end of each row, and as a buffer for in-chip signalbuffering.

For the transconductance amplifier, the differential-in,differential-out transconductance is given by the equation

1 D3 1D3

1

K I W= I = (at 0)

Lout

md IDID

Ig V

Vβ∂= =

∂Wide transconductance amplifier was preferred over thesimple transconductance amplifier for its bettercharacteristics against

• transistor size/current mismatch and hence common-mode gain

• input common-mode voltage range• better input/output voltage swing

The designed amplifier was achieved by adding two extracurrent mirrors to the simple transconductance amplifier.By reflecting the currents of Ml and M2 to the uppercurrent mirrors, the output current is just the differencebetween I1 and I2, with the advantage that both input andoutput voltages can run almost up to Vdd and almost downto Vss, without affecting the operation of the circuit.

The output current, Iout, in the schematic (fig. 11), isconverted to a voltage value using a 2-CMOS active load(not shown in the schematics). The complete layout isshown in fig. 12 with the simulated transfer characteristicsshown in fig. 13.

Figure 11: Wide Transamp - Schematics

Figure 12: Wide Transamp - Layout

Figure 13: Wide Transamp - DC Characteristics

5.3 ComparatorThe comparator is used in ADC to compare theMultiplying DAC output voltage with the actual inputvoltage that is to be converted. The input stage of thecomparator is a differential amplifier and the next stage isa decision circuit. The last stage is nothing but an inverterused as a thresholding/polarity correction circuit. Theschematic is shown in fig. 14, the layout in fig. 15 and thecharacteristics waveforms are shown in fig. 16.

Figure 14: Threshold Comparator- Schematics

Figure 15: Threshold Comparator- Layout

Figure 16: Threshold Comparator- Characteristics

5.4 ADC OperationThe simulation results below illustrate a simulation ofconversion of an analog voltage of 0.85V to its digitalequivalent by the successive approximation ADC internalto the chip. In this simulation the conversion process takesapproximately 3µs to convert, implying that all theconverged analog weights will be converted to their digitalequivalent in approximately 3µs × 17 = 51µs.

Figure 17: ADC Convergence Characteristics

5.5 Vector MultiplicationsPresented below are the transient characteristics ofmultiplication of two time domain sinusoids. The thirdwaveform presents the result, which is the cascaded resultof the current output of seventeen multipliers collected incurrent busbars and converted to the voltage domain at theend using a 4-CMOS active load.

Figure 18: Vector Signal Multiplication- Transient

6. Overall Chip Layout & Interconnects

The figure below shows the array structure of theimplemented chip, best possible fit of the building sub-cellswere tried in order to achieve a highly dense building blockstructure. A hierarchical routing assignment was made forthe available 6 metal layers to achieve the required denseconnectivity. For more details see [1].

Figure 19: Layout for the Building Block

7. References

[1] Khurram Waheed and Fathi M. Salam, “A Mixed-Mode Design for a Self-programming Chip For real-timeestimation, prediction, and control”; Proc. Of 43rd IEEEMidwest Symposium on Circuits and Systems, Aug 8-11,2000, pp. 810-813[2] Gert Cauwenberghs and M. Bayoumi, (editors)Learning on Silicon, adaptive VLSI neural systems,Kluwer Academic Publishers, July 1999.[3] F. M. Salam, H-J. Oh, “Design of a TemporalLearning Chip for Signal Generation and Classification,"Analog Integrated Circuits and Signal Processing, aninternational journal, Kluwer Academic Publishers, Vol.18, No. 2/3, February 1999, pp. 229-242.[4] M. Ahmadi and F. Salam, Special Issue on Digital andAnalog Arrays, International Journal on Circuits, Systems,and Computers, October/December 1998 (Issue publishedin December 1999).[5] F. M. Salam, M. R. Choi.”An All-MOS AnalogFeedforward Neural Circuit with Learning,” IEEE Int’Symp. on Circuits and Systems (ISCAS), May 1990, pp.2508-2511.[6] MSU Team, Copper IC Design Challenge, Phase IReport, January 2000.[7] MSU Team, Copper IC Design Challenge, Phase IIReport, August 2000.[8] Website:http://www.egr.msu.edu/annweb/cu_contest/

a mixed mode self-programming neural system-on-chip for real … networks... · 2001-10-17 ·...

Documents