a mixed mode self-programming neural system-on-chip for real … networks... · 2001-10-17 ·...
TRANSCRIPT
X X X X X X
X X X X X X
X X X X X X
X X X X X X
ANALOG /DIGITALINPUTS
ANALOG/DIGITAL
OUTPUTS
(OPTIONAL)DIGITALINPUT
SIGNALS
DIGITAL CHIPLEVEL
CONTROLSIGNALS
ANALOG NEURALPROCESSING
LAYER
DIGITAL STORAGE,PROCESSING AND CONTROL
LAYER
DIGITAL SUPERVISORY &MULTIPLEXING LAYER
RAM
X
RAM
X
RAM
X
RAM
X
RAM
X
RAM
X
CHIPLEVEL
DECODER
CHIPLEVELMUX/
DEMUX
Hierarchical Structure of the Chip
A Mixed Mode Self-Programming Neural System-on-Chip forReal-Time Applications
Khurram Waheed and Fathi M. Salam
Department of Electrical and Computer EngineeringMichigan State University
East Lansing, MI 48824-1226
Abstract
The paper provides an overview of the development of aself-learning computing chip in the new 0.18 micron coppertechnology. The chip realizes an architecture that achievesthe task of self-learning execution times in the micro to milliseconds. The core consists of basic building blocks of 4-quadrant multipliers, trans-conductance amplifiers, andactive load resistances, for analog (forward-) networkprocessing and learning modules. Super-imposed on theprocessing network are digital memory and controlmodules composed of D-Flip-flops, ADC, Multiplying D/AConverter (MDAC), and comparators for parameter(weight) storage, logical control and analog/digitalconversions.
The single System-on-Chip design impacts several domainsof critical applications that include nano-scalebiotechnology, automotive sensing, control and actuation,wireless communications, image feature extraction andpattern matching etc
1. Introduction
The core of the chip is a neurally inspired scalable (re-configurable) array network for compatibility with VLSI.The chip is endowed with tested auto-learning capability,realized in hardware, to achieve global task auto-learningexecution times in the micro to milli seconds.
The architectural forward network (and learning modules)process in analog continuous-time mode, while the(converged, steady state) weights/parameters can be storedon chip in digital form. The overall architectural designadopts engineering methods from adaptive networks andoptimization principles [1,2].
The designed chip can handle 16 inputs, 16 outputs. Inaddition there are inputs for control interface,synchronization and stand-alone programmability of thechip resulting in an approximate die area of 4000 X 6000µm2 in a QFP-208L package
The 6 layer Copper (Cu) interconnect, single poly, 0.18micron process enables dense connectivity and dense diearea of this highly interconnected network resulting in acompact powerful engine. Moreover, the special lowresistance and low capacitance electrical properties ofcopper permit the design to achieve the high connectivitywhile still managing precise distributions of resistive andcapacitive loads. These properties enable one to predictperformance and limit signal time-delays along theinterconnect. The small feature size and the electricalinterconnect properties for copper are enablers to therealization of such a powerful chip with denseinterconnectivity.
Figure 1: Arcitectural Overview of the Chip
The chip operates in four different modes: (i) learn, (ii)(on-chip) store, (iii) program read/write, and (iv) processwhich selectively combine its intrinsic analog and digitalbuilding blocks in a novel manner (see fig. 1).
Initially the system-level chip design was simulated andverified using SIMULINK/MATLAB. All the buildingblocks were custom designed and extensively simulatedusing HSPICE (incorporating the UMC level 49 models).The design was implemented in the 6 level Copper (Cu)interconnect, single poly, 0.18 micron process which is anenabler for the dense connectivity and dense die area ofthis highly interconnected network. The design was laidout and verified using Cadence Tools. More details of
high-level design, circuit design of major blocks and chip-layout will be provided in the subsequent sections [8,9].
The resulting chip design requires no traditionalprogramming or coding [2,3]. In addition to novelarchitectural design, the hardware also performs the heavycomputational burden by selectively realizingprogrammability as on-chip auto-learning modules. Theresulting System-on-Chip operates on 1.5V power sourceand consumes approximately 1 mWatt of power.
2. Architectural Design
The design process was comprised of consecutive stages,based on a top-down definition of the chip. A generaldefinition of the functionality and intended applications wascreated, and the development of the chip design was donein three different levels.
• A high-level design, specifying the characteristics ofthe neural network to be implemented and thedefinition of its basic building blocks.
• A circuit level design, describing each of these blocksbased on the copper technology with theircorresponding simulations.
• Finally a layout level design, where the actual chiplayout is created and verified
3. High Level Design: Modified BP Algorithm
We present our stepwise approach for tailoring the BPalgorithm so that it becomes suitable for VLSIimplementation. For illustration in each case, we present thesimulated results of the XOR problem, highlighting theeffect of each modification on the performance of thealgorithm.
BP Neural Network Model
Load Data
BP Neural Network with Nonlinear Multiplier and Removal of the Derivative Function in All Hidden Layers
w34(1,3)
w34(1,2)
w34(1,1)
w23(2,3)
w23(2,2)
w23(2,1)
w23(1,3)
w23(1,2)
w23(1,1)
w12(2,3)
w12(2,2)
w12(2,1)
w12(1,3)
w12(1,2)
w12(1,1)
Weight
Update
Process
Weight Update
Training Error
Data Stream
Train Data
Test data output
Test data intput2
Test data input1Data Stream
Test Data
load Data Stop Training
Stop Training Condition( Stop Automatically )
Stop Training
Stop Training( Manually )
Signal Feedforward
( Testing )
Signal Feedforward Process
( Training )
Weight
Memory
Ground
Ground
Error Feedback Process
Error Feedback
Figure 2: Simulink Model for Chip Simulations
For our simulations, we constructed a four-layer neuralnetwork, it has an input layer, two hidden layers and anoutput layer. The nonlinear mapping function is used onlyby the neurons in the two hidden layers. For the XOR
simulation we are illustrating, there are two input neuronsand one output neuron.
The update law in each case is derived from themathematical model for a multi-layer feed-forward neuralnetwork.. In the equations below
(3)W - represents the time derivative of the weight foroutput layer
( 2 )W - represents the time derivative of the weight for 2nd
hidden layer(1)W - represents the time derivative of the weight for 1st
hidden layer
3.1 Modified BP algorithm with linear multiplier.Using the weight update rule
EW W
Wη α
∂= − ⋅ − ⋅
∂where ( ) ( ) ( 1)i i iNET W X −= ⋅
The update laws are(3) (3) (3)
(2) (2) (3) (2) (2)
(1) (1) (2)
(3) (2) (1) (1)
( )
( ) (( ) )
( ) ( ( )
(( ) ) )
W D Y X W
W NET D Y W X W
W NET NET
D Y W W X W
η α
η ψ α
η ψ ψ
α
= ⋅ − ⋅ − ⋅
′= ⋅ ⋅ − ⋅ − ⋅
′ ′= ⋅ ⋅ ⋅
− ⋅ ⋅ ⋅ − ⋅
Figure 3: Training Error and convergence for the modifiedBP Algorithm
3.2 Modified BP algorithm with nonlinear multiplier.This update law and the subsequent law considers theeffect of the presence of resistive elements in hardwarecircuits and the multiplication using a Gilbert multiplier.Each case uses the following update law
( ) ( ) ( 1), where tanh( ) tanh( )i i iEW W NET W X
Wη α −∂
= − ⋅ − ⋅ = ⋅∂
Figure 4: Training Error and convergence for the modifiedBP Algorithm with non-linear multiplier
The update laws are(3) (3) (3)
(2) (2)
(3) (2) (2)
(1) (1)
tanh( ) tanh( )
tanh(tanh( ( )) tanh(tanh( )
tanh( ))) tanh( )
tanh(tanh( ( ))
tanh(tanh(tanh( (
W D Y X W
W NET D Y
W X W
W NET
N
η α
η ψ
α
η ψ
ψ
= ⋅ − ⋅ − ⋅
′= ⋅ ⋅ − ⋅
⋅ − ⋅
′= ⋅ ⋅
′ (2)
(3) (2) (1) (1)
)) tanh(tanh( )
tanh( ))) tanh( ))) tanh( )
ET D Y
W W X Wα
⋅ −
⋅ ⋅ ⋅ − ⋅
3.3 The Modified BP algorithm with nonlinearmultipliers and the removal of the derivative function inall the hidden layers.The update laws are
(3) (3) (3)
(2) (3) (2)
(2)
(1)
tanh( ) tanh( )
tanh(tanh( ) tanh( )) tanh( )
tanh(tanh(tanh( ) tanh(
W D Y X W
W D Y W X
W
W D Y W
η α
η
α
η
= ⋅ − ⋅ − ⋅
= ⋅ − ⋅ ⋅
− ⋅
= ⋅ − ⋅ (3) (2)
(1) (1)
)) tanh( ))
tanh( )
W
X Wα
⋅
⋅ − ⋅
Figure 5: Training Error and convergence for modified BPAlgorithm with removal of derivative function in all hiden
layersPresented in fig. 6 are the input/output waveforms for thenetwork trained using the above final update rule
Test Input 1 Test Input 2
Test OuputFigure 6: Input /Output waveforms for the final network
3.4 Summary of the High-level Simulation ResultsFrom the high-level Matlab/Simulink simulations, one candraw the following conclusions:
1) In replacing the ideal linear multiplier model by therealistic nonlinear multiplier model, the neural networkconverges.2) Removing the derivative function of the second hiddenlayer, the neural network could still converge. In fact, thiscan be easily verified mathematically.3) When the derivative function in all hidden layers areremoved, the neural network could converge, but in thiscase the training error is not zero but attains a smallconstant mean value. The update law derived in this case isstill a gradient type law [7].
For all presented simulation results, we used the same setof initial conditions for all the models i.e all initialconditions at zero. In general for different nonlinearsystems, the same initial conditions may bring differenttraining result. An initial weight set which results in a goodtraining result for one nonlinear system will notnecessarily ensure that it will result in a good trainingresult for another nonlinear system.
4. Conceptual Chip Design
Presented below is a higher block level design of the chip.For more details see [1,9]
4.1 The Synaptic Cell:The initial and highest level design of each neural cellshows the central idea of the processing network and thelearning network (Fig. 7).
The processing stage is composed of 16 neurons builtusing vector multipliers and a sigmoid function. Themultipliers use as operands an input vector and a weightvector. The input is common to all processing units, andthe weights belong to each neuron. The scalar product isthen applied to the non-linear function, resulting in theoutput of a neuron.
The learning stage works in a similar manner as theprocessing stage, but using different sources for theproduct. It receives the signals from the next stage, createsa new signal to be sent to any previous cell, and updatesthe weights according to the update law. Both the stages oflearning and the processing networks are merged locally.To accomplish this, it was necessary to decompose the 17-D multipliers that constituted each network node into a setof 1-D multipliers.
On-chip memory is designed as local digital memory. It istherefore necessary to add a stage where the current analog
16S
16S
.
.
.
δi
y
w1i w2i...wNi θi
wN1 wN2...wNN θN
.
.
.
o
δ
e
wi1 ...wiN θi
1D
Weight
17-D
wi1 ...wiN θi
1D
Weight
δN
17-D
17-D
17-D
16
16
16
1616
16
V
V
V
V
V
V
V
V
ADC
MDAC
V
I
MemoryFF
V
Learn/ Process
φ2
R
V
wij
wij
V
V
Weight
φ1
Learn/ Process
Unit 1
Unit 8
8
8
Unit 9
Unit 16
.
.
.o
16
V8
8
MSB bus
LSB bus
o16
V
TransAm
Figure 7: Architectural Block Diagram - Synapse
value of the weight is converted into a digital value usingan ADC, and then converted back using a DAC. Thememory is built by using 5 data flip-flops. The update law,however, uses a capacitor (see Fig. 7) and 1-dimension (1-D) multipliers. These multipliers are also used in eachneuron, to form the 17-dimension (17-D) multipliers.
To optimize the number of ADCs required for theconversion of the weight and still achieve goodperformance, an array of ADCs was designed away fromthe neural network. With this new configuration one ADCcould be shared by the whole row of weights, reducing the
number of ADCs to n . This design uses multiplexers,decoders, control logic for the store mode and the need of aclocked input to drive this logic. This clock also regulatesthe ADC operation, as it is designed to be successiveapproximations type. Please note that having a clock in thissection does not imply that the neural network stops beingasynchronous. For more detailed review of the chiparchitecture, kindly refer to [1,8,9].
5. Design and Layout of Components
There are a number of custom designed components for thischip. All the component circuits were designed on Star-Hspice using BSIM Level-49 models supplied bySRC/UMC. Avant’s software was used for the schematicentry and waveform viewing. Initial layout of the sub-circuits was carried out in Tanner Tools but the verificationand LVS were performed using Cadence Tools. In thispaper we restrict ourselves to presenting results on the morevital components of the chip. This includes a Gilbertmultiplier, a wide transconductance amplifier and acomparator in this paper. In addition, demonstrativesimulations for ADC and the vector multiplier are alsoprovided.
5.1 Gilbert MultiplierTo implement the multiplication in the analog domain, aGilbert multiplier cell has been employed. The circuit
diagram of the modified Gilbert multiplier is shown in fig.8. Assume that all transistors in fig. 8 are in saturationregion, and are matched so that the trans-conductanceparameters satisfy the equationsβN = βM1 = βM2 and βP = β8 = β9 =β10 = β11
Figure 8: Gilbert Multiplier - Schematics
The output current then is the difference between ID(M13)
and ID(M14) since the current IS(M16) and IS(M17) are reflectedby the current mirrors.
Defining the output currents
( 8) ( 10)
( 9) ( 11)
S M S M
S M S M
I I I
I I I
+
−
= +
= +it can be readily shown that the ideal characteristics ofdifferential output current DIFFI I I+ −= − is given by
( )3 4 1 2( )( )DIFF p NI V V V Vβ β= − −
The modified Gilbert multiplier takes the differencebetween two voltages (V3-V4) and multiplies thatdifference by a difference of two other voltages (V1-V2).In the small signal range, the characteristic curve isapproximately linear, with all four inputs carryingmultiplication information. For the large signal range, themultiplier is non-linear but does not cause any instability.
Figure 9: Gilbert Multiplier - Layout
The multiplier layout is shown in fig. 9, while the HSPICEsimulated DC characteristics are shown in fig. 10 andshows four quadrant multiplication
Figure 10: Gilbert Multiplier – DC Characteritics
5.2 Wide Transconductance AmplifierThe wide-transconductance amplifier is used in multiplemodes in this chip. It is used as the sigmoid non-linearity atthe end of each row, and as a buffer for in-chip signalbuffering.
For the transconductance amplifier, the differential-in,differential-out transconductance is given by the equation
1 D3 1D3
1
K I W= I = (at 0)
Lout
md IDID
Ig V
Vβ∂= =
∂Wide transconductance amplifier was preferred over thesimple transconductance amplifier for its bettercharacteristics against
• transistor size/current mismatch and hence common-mode gain
• input common-mode voltage range• better input/output voltage swing
The designed amplifier was achieved by adding two extracurrent mirrors to the simple transconductance amplifier.By reflecting the currents of Ml and M2 to the uppercurrent mirrors, the output current is just the differencebetween I1 and I2, with the advantage that both input andoutput voltages can run almost up to Vdd and almost downto Vss, without affecting the operation of the circuit.
The output current, Iout, in the schematic (fig. 11), isconverted to a voltage value using a 2-CMOS active load(not shown in the schematics). The complete layout isshown in fig. 12 with the simulated transfer characteristicsshown in fig. 13.
Figure 11: Wide Transamp - Schematics
Figure 12: Wide Transamp - Layout
Figure 13: Wide Transamp - DC Characteristics
5.3 ComparatorThe comparator is used in ADC to compare theMultiplying DAC output voltage with the actual inputvoltage that is to be converted. The input stage of thecomparator is a differential amplifier and the next stage isa decision circuit. The last stage is nothing but an inverterused as a thresholding/polarity correction circuit. Theschematic is shown in fig. 14, the layout in fig. 15 and thecharacteristics waveforms are shown in fig. 16.
Figure 14: Threshold Comparator- Schematics
Figure 15: Threshold Comparator- Layout
Figure 16: Threshold Comparator- Characteristics
5.4 ADC OperationThe simulation results below illustrate a simulation ofconversion of an analog voltage of 0.85V to its digitalequivalent by the successive approximation ADC internalto the chip. In this simulation the conversion process takesapproximately 3µs to convert, implying that all theconverged analog weights will be converted to their digitalequivalent in approximately 3µs × 17 = 51µs.
Figure 17: ADC Convergence Characteristics
5.5 Vector MultiplicationsPresented below are the transient characteristics ofmultiplication of two time domain sinusoids. The thirdwaveform presents the result, which is the cascaded resultof the current output of seventeen multipliers collected incurrent busbars and converted to the voltage domain at theend using a 4-CMOS active load.
Figure 18: Vector Signal Multiplication- Transient
6. Overall Chip Layout & Interconnects
The figure below shows the array structure of theimplemented chip, best possible fit of the building sub-cellswere tried in order to achieve a highly dense building blockstructure. A hierarchical routing assignment was made forthe available 6 metal layers to achieve the required denseconnectivity. For more details see [1].
Figure 19: Layout for the Building Block
7. References
[1] Khurram Waheed and Fathi M. Salam, “A Mixed-Mode Design for a Self-programming Chip For real-timeestimation, prediction, and control”; Proc. Of 43rd IEEEMidwest Symposium on Circuits and Systems, Aug 8-11,2000, pp. 810-813[2] Gert Cauwenberghs and M. Bayoumi, (editors)Learning on Silicon, adaptive VLSI neural systems,Kluwer Academic Publishers, July 1999.[3] F. M. Salam, H-J. Oh, “Design of a TemporalLearning Chip for Signal Generation and Classification,"Analog Integrated Circuits and Signal Processing, aninternational journal, Kluwer Academic Publishers, Vol.18, No. 2/3, February 1999, pp. 229-242.[4] M. Ahmadi and F. Salam, Special Issue on Digital andAnalog Arrays, International Journal on Circuits, Systems,and Computers, October/December 1998 (Issue publishedin December 1999).[5] F. M. Salam, M. R. Choi.”An All-MOS AnalogFeedforward Neural Circuit with Learning,” IEEE Int’Symp. on Circuits and Systems (ISCAS), May 1990, pp.2508-2511.[6] MSU Team, Copper IC Design Challenge, Phase IReport, January 2000.[7] MSU Team, Copper IC Design Challenge, Phase IIReport, August 2000.[8] Website:http://www.egr.msu.edu/annweb/cu_contest/