הפקולטה להנדסת חשמל department of electrical engineering

1 Department of Electrical Engineering

- Technion - Israel Institute of TechnologyNeural NetworkFor Handwritten Digits Recognition

High Speed Digital System LaboratoryHS-DSLWinter 2013/2014Supervisor: Guy RevachRoi Ben Haim & Omer Zimmerman2Background

Neural Network is a Machine Learning System designed for supervised learning using examples. Such network can be used for hand written digit recognition, and when used in software is in-efficient in time and resources. This project is the third part of a 3-parts project. Our goal is to implement an efficient hardware solution to the handwritten digit recognition problem.Implementing a dedicated HW to this task is part of a new trend in VLSI architecture called heterogeneous computing- design of a system on chip with many accelerators for different tasks which will achieve better performance/power ratio, each for its purposed task.

3The output of a network can be a multiple neurons.Each neuron can be represented mathematically as a function with multiple variables. If we will approach the weights as parameters.

So, for each Input X, we would like to minimize the average error between Y and the desired vector D.

Learning algorithm4The method we use to reach the minimum error it a gradient based algorithm.For each example input we compute:

This step is done for each weight in each layer, what the calculation actually does is walking us one small step towards the minimum of the error.

An error function has lots of local minimums, the algorithm does not guarantee we will reach the global minimum.

min a min b Learning algorithm

Output:10 neurons+1 answer-1 other 9layer 0layer 1layer 2layer 3layer 4Convolutional layersFully-Connected layersInput:29x29 grayscale image

Network DescriptionStructure & Functionality

Feature map #0Feature map #513x1313x1329x29Input ImageLayer #0841 neuronsLayer #11014 neurons

NN Structure & FunctionalityLayer #1Feature map #0Feature map #1Feature map #5Layer #21250 neuronsmap #0map #1map #2map #4913x135x5Layer #3100 neuronsLayer #410 neurons8Output Layern#0n#1n#99d#0d#9

NN Structure & FunctionalityLayer #0: The first layer. The input to this layer is the 29x29 pixels image. The pixels can be seen as 841 neurons without weights. The pixels are the input to the next layer.

Layer #1: The first convolution layer that produces 6 feature maps, each has 13x13 pixels/neurons. Each neuron on the output layer is a result of a masking operation between 5x5 (+1 for the bias, total of 26 weights) weights kernel, different for each one of the 6 maps, and 25 pixels from the input image. The 25 results are summed with a bias and entered to the activation function (tanh). Each feature map is the result of a non-standard 2D masking between a 5x5 weight kernel (each weight kernel results in a different feature map) and the 29x29 input neurons, summed with an added bias. The masking is of non-standard form, because the 5x5 neuron sub-matrices are derived by shifts of 2 (instead of 1), both vertically and horizontally, starting with the 5x5 sub-matrix at the upper left corner of the 29x29 input neurons.8Network DescriptionStructure & FunctionalityLayer #2: This layer is the second convolution layer. Its output is 50 feature maps of 5x5 neurons (summing up to a total of 1250 neurons). Each neuron is a result of a similar masking calculation as in the previous layer, only now each of the 50 feature maps is the sum of six 2D mask operations, each masking has its own 5x5 (+1 bias weight) weight kernel, and is between the kernel and its matching feature map of layer 1 (horizontal and vertical shift are 2, as in previous layer).Layer #3: This is a fully connected layer that contains 100 neurons. Each has 1250 entries (output neurons of previous layer) that are multiplied by 1250 corresponding weights. There are 125100 weights on this layer.Layer #4: Last fully connected layer. It contains 10 output neurons, each of which is connected to the previous layer's 100 neurons by a different weights vector. The 10 outputs represent the 10 possible recognition options. The neuron with the highest value out of the 10 neurons corresponds to the recognized image. There are 1010 weights in this layer (101x10).9Network DescriptionStructure & Functionality10Network DescriptionStructure & FunctionalitySummary Table:

TypeNum of neuronsInputsNum of weights#029x29 = 84110#113x13x6 = 10145x5 = 2526*6 = 156#25x5x50 = 12505x5x6 = 150 50*6*26 = 7800#310012501251*100 = 125100#410100101*10 = 101011Project A summaryIn project A, we have implemented the neural network which is described in previous slides using matlab:

The matlab implementation achieved 98.5% correct digit recognition rate. 12Project A usage in Project BIn the current project, we used the results of the previous software implementation as reference point to the hardware implementation, both in the success rate and in the performance of the implementation. We tried to achieve the same success rate, and in order to do so we have simulated several fixed-point implementations of the network (opposed to previous matlab floating point arithmetic), and chose the minimal format that achieved ~98.5% :

Another usage of project A is for the weight parameters of the network, which were produced using the learning process implementation.

Number of bits for the fractional partNumber of bits for the integer part7654350.54%49.93%53.08%50.73%10.26%198.41%98.33%98.32%98.02%69.07%298.46%98.43%98.42%98.08%78.97%398.48%98.43%98.42%98.08%78.97%413Project Goals

Devise efficient & scalable HW architecture for the algorithm

Implement dedicated HW for handwritten digit recognition.

Achieving SW models recognition rate (~98.5%).

Major performance improvement compared to SW simulator.

Low cell count, low power consumption.

Fully functional system NN HW implementation on FPGA with PC I/F that runs a digit recognition application. 14Project Top Block Diagram

15Architecture aspectsThis architecture tries to optimize the resources/throughput tradeoff. Neuron networks have strong parallel nature, and so our implementation tries to pursue this parallel nature (which is expressed by high throughput at the output of the system). A fully parallel implementation would require 338,850 multipliers (for a total of 338,850 multiplications needed for a single digit recognition input), which is obviously not a feasible implementation.In our architecture, we have decided to make use of 150 multipliers. This number was chosen with careful attention to current FPGA technology On the one hand we didn't want to utilize all the multipliers of the FPGA, but on the other we do want to utilize a substantial number of multipliers, in order to support the parallel nature of the algorithm.

16Our destination technology is VIRTEX 6 XC6VLX240T, which offers 768 DSP slices, each containing (among other things) a 25X18 bit multiplier, meaning we are utilizing 150 of the 768 DSP blocks. In theory, we can add future functionalities to the FPGA, as we are far from the resources limit. This was done intentionally, as modern FPGA DSP designs are usually systems which integrates many DSP modules.Architecture aspects17Memory aspectsAnother important guideline for the architecture is memory capacity. The algorithm requires ~135,000 weights and ~3250 neurons, each of them represented by 8 bits (fixed point 3.5 format, which is the minimum number of bits required to achieve the same success rate as the MATLAB double precision model). This means that a minimum of 1.1 Mb (Megabit) of memory is required. VIRTEX 6 XC6VLX240T offers 416 RAM blocks of 36 kb (kilobit), totaling in 14.625Mb. This means that we utilize only 7.5% of the internal FPGA RAM memory.

18Micro-architecture implementation

Memories:

All RAM memories were generated using Coregen, specifically designed for the target technology (VIRTEX 6). Small memories (~ 10 kb) were implemented as distributed RAM, and large memories were implemented using block RAM.Overall, 4 memory blocks were generated:

Layer 0 neuron memory: single-port distributed RAM block of depth 32 and width 29*8=232. total memory size ~ 9 kb

Layer 1 neuron memory: single-port distributed RAM block of depth 16 and width 13*6*8=624. total memory size ~ 10 kb

Weights bias memory: single-port ROM block of depth 261 and width 6*8=48. total memory size ~ 12 kb

19Weights and layer 2 memory: dual-port block RAM. One port has read & write width of 1200 (depth of 970 each), and the second port has write width of 600 (depth 1940) and read width of 1200 (depth 970).total memory size ~ 1.15 Mb

Layer 2 neuron memory and weights memory were combined to one big RAM block for better utilization of the memory architecture provided by VIRTEX 6.

Layer 3 & Layer 4 neuron memory: implemented in registers (110 Bytes).Micro-architecture implementation20Micro-architecture implementation

Mult_add_top:This unit receives 150 neuron & 150 weights and returns the following output:

This arithmetic operation is implemented using 150 DSP blocks and 6 adder trees, each adder tree containing 15 signed adders (all adders generated using coregen and implemented using fabric and not DSP blocks), totaling at 90 adders.

21

22Micro-architecture implementation

Tanh:Implemented as a simple LUT 8 input bits are mapped to 8 output bits (total LUT size therefore 256 x 8)23Micro-architecture implementation

Rounding & saturation unit:Logic to cope with the bit-expansion caused by multiplication (results in twice the amount of bits of the multiplicands) & addition (results in 1 bit expansion compared to the added numbers representation) operands.Neurons are represented using 8 bits, of 3.5 fixed point format. This format was decided upon after simulating several fixed point formats, and finding the minimal number of bits needed to achieve 98.5% accurate digit recognitions (which is equal to the success rate of matlabs floating point).

The rounding & saturation logic operates according to the following rules:

if input < -4 then assign output = -4 (binary '100.00000')else if input > 3.96875 then assign output = 3.96875 (binary '011.11111')else assign output = round(input*25)*2-5where round() is the hardware implementation of MATLAB's round() function.

24Resource utilization summaryOur implementation

Nave implementation

resource

~1.2 Mb~1.1 Mbmemory150~350,000Multipliers~100~175,000Adders6~2400Activation function (tanh) unitsAs can be seen, the nave implementation (which is a brute-force, totally parallel implementation of the network) is not feasible in hardware, because of impractical resource demands.Our architecture offers reasonable resource utilization, while still improving performance substantially in comparison to software implementation.25Development Environment

SW development platforms- MATLAB.

HW development platforms- Editor Eclipse.Simulation- Modelsim.Synthesis- XST (ISE).FPGA- Virtex 6 XC6VLX240T26HDL implementation & verificationAll of the modules described in previous slides, were successfully implemented using Verilog HDL.A testbench was created for each module, input vectors were created & injected, and simulation results were compared to a bit-accurate matlab model. After simulation results of all stimuli vectors were consistent with the bit-accurate model, the model was considered verified.

After each module was individually verified, we have connected between the different modules, and implemented a controller over The entire logic.A testbench & bit accurate matlab models for all stages of the controller were created for the entire project. 27

HDL implementation & verificationCurrent status:simulation results of all stimuli vectors were consistent with the bit-accurate models, and so we have a hardware implementation of the network, ready to be implemented onto the FPGA.Here are modelsim simulation results for recognition of the digit 9:28Project challengesThe main goal of the project, which was implementing a highly functioning handwritten digit recognition system, proved very challenging. We can divide the challenges into 3 main categories: Architecture, implementation and verification.

Architecture-oriented problems Devising efficient hardware architecture for the algorithm proved one of the biggest challenges of the project. Neural networks theoretically can be implemented completely in parallel, but this solution is not practical resources-wise. A lot of thought had been put into the tradeoff between parallelism and resource usage.

In order to allow a degree of scalability, we had to think of the common things between the different layers, and figure out an architecture that will allow all layers to use the same logic modules, instead of implementing each layer in a straight forward fashion.

Resource estimation our target device is a Virtex 6 FGPA, and therefore our architecture had to take that under consideration. We had a well defined limit regarding the amount of memory & multipliers which were available for us, and therefore needed to devise an architecture which will not exceed these limits.

2929Project challengesImplementation-oriented problems

Our target device (Xilinx Virtex 6 FPGA) was unfamiliar to us, and so we had to learn how to operate Xilinx's tools to implement logic modules which are compatible to the target technology

We have used fixed point arithmetic for the first time, and gained much experience in this area, including implementing hardware rounding & saturation logic. At first, we have implemented a simple truncation rounding, but found out that it did not satisfy us and lowered the success rate. Therefore, we needed to implement a more complicated rounding method, which imitates matlab's round() function.

In order to implement such modular & scalable architecture, a smart controller had to be implemented. Composing a control algorithm and afterwards coding this controller proved very challenging

303030Project challengesVerification-oriented problems

Verification of the system was probably the most challenging aspect

As stated in the previous section (Implementation-oriented problems), we were unfamiliar with Xilinx's tools, and therefore after creating the desired logic (such as memories, multipliers etc.) using this tools, we had to verify that they work. Nothing worked at first, so this proved a long process, until we have learned to properly use Xilinx's IP's

Most challenging of all was to achieve a successful verification of the entire system. Our system contains an extremely large amount of data (neurons, weights, partial results), and so every small mistake in the controller leads to a lot of wrong output data, and it is very difficult to pinpoint the origin of the mistake. For example, if we accidently coded the controller such that a data_valid strobe arrives to a certain module one clock earlier than it actually should have, than the entire data flow continues with data which is essentially garbage, and it is hard to find the origin of the mistake. To overcome this, we had to produce a matlab bit accurate module to each step of the design, not only to the final results.

31Future workCurrently, we have a Verilog hardware design, which in simulation achieves the same recognition success rate as the earlier software implementation of the algorithm. Future work includes:

Successfully implementing the design onto Virtex 6 FPGA (under the ML-605 development board)

Constructing a UART I/F between the PC & the board

Designing a matlab GUI to connect between the user to the FPGA, thus achieving a fully functional product.

After successfully implementing our hardware onto the FPGA, we need to measure exact performance improvement in comparison to the software implementation (single digit recognition time is ~5 ms in software, and ~ 0.03 ms in hardware, assuming system clock frequency of 100 MHz)

32Project Gantt

HDL ImplementationFunctional simulationSynthesisLayout DesignUser I/F App developmentSystem VerificationReportWeek12345678910111213141533

Thank you HS DSL

HS DSL

Memory allocation

Preprocessing

Read images

Normalize images

Produce random Initial weights

Choose next input image

Forward propagation

Layer 2 forward Prop




Back propagation

Layer 4 back Prop (update layer 4 weights)




Statistics gathering

Did not finish training?

Plot learning process curve,Save final weigths

Finished training?

HS DSL

HS DSL

HS DSL

HS DSL

HS DSL

HS DSL

HS DSL

HS DSL

HS DSL

הפקולטה להנדסת חשמל department of electrical engineering

Documents

convolution layer

different feature map

29x29 input neurons

feature maps

multiple neurons

input x

29x29 pixels image

weights weights kernel