neural computation platforms: to the blue brain and beyond

8/9/2019 Neural computation platforms: to the Blue Brain and beyond

1/50

TALLINNA TE

XNIKA ULIKOOLInfotehnoloogiateaduskond

Arvutitiehnika InstituutDigitaaltehnika oppetool

ParalleelarhitektuuridIAY0060

Neural computation platforms: to the Blue Brain andbeyond

Referaat

Oppejoud: K. Tammemae

Uleoppelane: Valentin Tihhomirov971081 LASM

Tallinn 2005


2/50

Contents

1 Prologue: The Blue Brain project 2

2 Intorduction 42.1 Brain research . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.2 Artificial NNs . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.3 Demand for the Neural HW . . . . . . . . . . . . . . . . . . . 10

3 Traditional Approach 123.1 Simulating Artificial Neural Networks on Parallel Architec-

tures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123.2 Mapping neural networks on parallel machines . . . . . . . 123.3 Benchmarking . . . . . . . . . . . . . . . . . . . . . . . . . . . 143.4 Simulation on General-Purpose Parallel Machines . . . . . . 15

3.5 Neurocomputers . . . . . . . . . . . . . . . . . . . . . . . . . 193.6 FPGAs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3.6.1 ANNs on RAPTOR2000 . . . . . . . . . . . . . . . . . 253.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

4 Spiking NNs 294.1 Theoretical background . . . . . . . . . . . . . . . . . . . . . 294.2 Sample HW . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

4.2.1 Learning at the Edge of Chaos . . . . . . . . . . . . . 304.2.2 MASPINN on NeuroPipe-Chip: A Digital Neuro-Processor 314.2.3 Analog VLSI for SNN . . . . . . . . . . . . . . . . . . 32

4.3 Maas-Markram theory: WetWare in Liquid Computer . . . . 334.3.1 The Hard Liquid . . . . . . . . . . . . . . . . . . . . 36

4.4 The Blue Brain . . . . . . . . . . . . . . . . . . . . . . . . . . . 374.4.1 Blue Gene . . . . . . . . . . . . . . . . . . . . . . . . . 374.4.2 Brain simulation on BG . . . . . . . . . . . . . . . . . 39

4.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

5 Conclusions 41

6 Epilogue 45

1


3/50

Chapter 1

Prologue: The Blue Brain

project

What really motivated us to study the field was the announce [1] of BlueBrain project 1, the assent of IBM corp to grant their TOP15-listed BG/Lcomputer to the Brain Mind Institute at Switzerlands Ecole PolytechniqueFederale de Lausanne (EPFL) for replicating in silico one of the brains

building blocks, a neocortex column (NCC).Henry Markram, the project leader, its initiator and founder of the Brain

Mind, explains:

The neocortical column is the beginning of intelligence and adapt-ability marking the jump from reptiles to mammals. When itevolved, it was like Mother Nature had discovered the Pentiumchip. The circuitry was so successful that its just duplicated,with very little variation, from mouse to man. In the humancortex, there are just more cortical columns about 1 million.

Over the past 10 years, Markrams laboratory has developed new tech-niques for multi-neuron patch-clamp recordings producing highly quan-titative data on the electrophysiology and anatomy of the different typesof neurons and the connections they form. The obtained data give al-

most complete digital description of microstructure, operation and learn-ing function making it possible to begin the reconstruction of the NCC inSW.

The BB goal is merely to build a simulacrum of a biological brain. Itis achieved when the outputs produced by the simulation in response toparticular inputs are identical to the wet experiments. If that works, twodirections are planned. Once cellular model of NCC is debugged and opti-mized, the BG/L will be replaced by a HW chip, which is easy to replicatein millions for simulation of the whole brain. The second track will be to

1http://bluebrainproject.epfl.ch/

2


4/50

work at more elementary level to simulate the brain at molecular level

and to look at the role of genes in the brain function.Replacing the in vivo experiments by in vitro simulation would turn

years of brain research in days, save huge funds and lab-animals. What ismuch more interesting is the hope that the project will shed some light onthe emergence of consciousness. Scientists have no purchase on this eva-sive phenomena at all but you have to start somewhere, quips Markram.

It is not the first attempt to build a computer model of the brain 1 . Thistime however, it is launched by neuro- and computer science world leaders,so we take it seriously as the most ambitious project ever conducted inneuroscience.

Figure 1.1: What kind of HW can effectively rely the pure connectivity ofthese parallel computing threads, which are 3D and morphing at that?Any ideas on decomposition? [A video frame from the BB site]

3


5/50

Chapter 2

Intorduction

In this potpourri well rush through the milestones of neuro-hardware.We review the classical approach to HW engineering for neural modelsdiscovering hardware-significant peculiarities of ANNs, estimate applica-

bility of Blue Gene for the neurosimulation and, finally, freeze mesmerizedby the computational paradigm behind the the BB project. By the way, fun-damental aspects of artificial intelligence will be considered.

2.1 Brain research

Some computers think that theyare intelligent. Tell them that theyare wrong.

Anecdote

All truths are easy to see once theyhave discovered; the point is todiscover them.

Galileo

In this year, 2006, the World celebrates a century since Spanish histol-ogist Santiago Ramon Cajal was rewarded with the Nobel Prize for pio-neering the field of cellular neuoroscience through his research into themicroscopic properties of the brain. He is credited to be the founder ofmodern neuroscience after discovering the structure of brain cortical layerscomposed of millions of individual cells (neurons), which communicate viaspecialized junctions (synapses).

The neocortex constitutes about 85 % of the human brains total massand is thought to be responsible for the cognitive functions of language,

4


6/50

learning, memory and complex thought. It is also responsible for the mira-

cles of thought making people creative, inventive and philosophical to asksuch questions. During the century of neuroscience research, it have beendiscovered that neocortexs neurons are organized into columns. Thesecylindrical structures are about 0.5 mm in diameter and 24 mm high butpack inside up to 60 000 neurons, each has 10 000 connections with othersproducing 5 km of cabling1. In fact, what we call the the gray matter is

just a thin surface of neuron bodies that cover the white matter, the insu-lated cabling.

Any biological research contributes to the technology and science. Look-ing at the finesse of the life creatures, we wonder their beauty so much thateven 200 years after Darvins publication some cannot believe that the un-

intelligent random process under pressure of natural selection can gener-ate such perfection [2]. Throughout the history, people draw the resourcesfrom the Nature and the inspiration from its evolution-optimized appli-ances, like wings and silk. The computers were invented as a byproductin the attempt to formalize the consciousness and computability, started byGilbert in the beginning of XX century. Von Neumann derived the com-puter from the theoretic Turing machine. This machine executes a pre-scribed algorithm at speeds as high as 109 op/sec. It is astonishing howmuch can be done by simple (in terms of complexity theory) algorithmsrunning automatically. Throughout the 20th century, mankind has devel-oped communication and information processing technology entering the

information society. The very idea of evolution was adopted by computertechnology in forms of OOP and genetic algorithms. Looking for more ad-vanced computation techniques, mankind has finally resorted to the secretsof the brain.

Despite almighty, the biological evolution is useless during the life ofits creatures since genes perform slowly. But the pressure to react immedi-ately in the rapidly changing environment forced them create the nervoussystem to prompt

1. adequate solutions

2. in real-time

3. by analyzing incomplete and controversial sensory information.

The neural networks (NN) turn out good where it is difficult or impossibleto solve by math. Unlike traditional computer methods, they learn by ex-ample rather than solve by algorithm. Exactly these adaptation capabilitiesoutwitted the large-toothed enemies, discovered and subordinated forcesof nature and, finally, try to understand themselves.

It is time to take over the most powerful tool in the nature created bymillions years of evolution. Likewise computer science studies combin-

1http://bluebrainproject.epfl.ch/TheNeocorticalColumn.htm

5


7/50

ing the logic gates to get a computer, neuroscience studies how the brain

is made of neurons. Its neuroinformatics branch uses mathematical andcomputational techniques such as simulation to understand the function ofnervous system. But is not still clear that the goal is reachable.

The issue is that, despite it gave us the computer, Gilberts programon mathematics fundamentals has failed: it was shown that some truths,which can be seen by a human, cannot be deduced by a formal algorithm(a computer). The examples of such true sentences are the Turings Halt-ing problem and the identical Gedel sentence P = (P cannot be proved).Some argue that the brain must necessarily possess a degree of random-ness2 in the struggle for survival because its purpose is to deceive the en-emy whereas predictability means defeat. The random generators, being

capable of transcending the deduction of formal logic, are also attributed ascreativity tools. In [3], this is presented as entertaining story, which pointsto the mind as it is that very Gods stone, which might be created but can-not be understood. Penrose agrees that algorithmic computation exists toweed out suboptimal solutions generated at random. He locates the sourceof randomness from the theoretical physics standpoint [4]. Penrose seesa gap between the micro- and macro- worlds considering the mind as a

bridge, which performs the irreversible quantum function reduction thus joining the material world with the ideal world of Plato. Anybody whostarts learning neural networks and quantum theory notes this similarity intheir magic to traverse the huge problem space quickly3. As one argument,

Penrose points out that the ideas, no matter how big they are, are compre-hended as holistic pictures like the huge ensembles of atoms consolidate inpseudo-crystals. Promoting his own TOE4, the prominent scientist attacksformalists strong AI not saying a word about analogue computation norconnectionism. Whatever who is right, the tremendous success of computerin XX century suggests that the neuroinformatics research will be the mainchallenge of the 21st century.

Let us start with the parallel architecture of the brain, which is capableto recognize mothers image in 1/ 100 sec, while operating at frequenciesas low as 103 Hz; that is, less than in 100 steps [5].

2.2 Artificial NNs

In order to explain its incredible features, mathematical models of the brain were proposed. They are known as Artificial Neural Networks (ANNs).

An ANN is a computational structure consisting of a multitude of ele-

2According to theory of algorithmic complexity, a random sequence is unpredictable it is infinitely complex and corresponds to infinitely long program.

3this is my opinion4theory of everything

6


8/50


9/50

(a) Artificial neuron (b) Examples of Activationfunction

(c) 3-layer Perceptron (d) Hopfield model

(result) and hidden (intermediate) ones. The loop-less graphs are called di-rect propagation networks or perceptrons and recurrent otherwise.

Many ANN tasks, are boiled down to classification: any input is mappedto a given set of classes. Geometrically interpreting, this corresponds to

breaking space of solutions (or attributes) into domains by hyperplanes.Rosenblatts perceptron consisted of one layer. The hyperplane for a two-input single-layer perceptron (a neuron) with a steep threshold function is aline x1w1 + x2w2 T. Bisecting a plane, it allows to implement AND and OR

binary functions, but not XOR. The XOR problem revealing limited capabil-ities of perceptrons was pointed out by Minsky and Peppert [6] suggestingthat a hidden layer would resolve the linear separability problem.

The graph output depends on its topology and synaptic weights. Theprocedure of weight adjustment is said to be learing (or training). In theone-layer perceptron learning, one of input vectors Xt is submitted to thenetwork and output vector Yt is analyzed. At iteration t, all input i weightswij of all the neurons j are adjusted according to the error = Rt Yt

between Yt and supervisor provided reference Rt: wij (t + 1) = wij (t) + k jxij , where k [0, 1] is a learning speed factor. The procedure is repeateduntil network answers have converged to the references. Obviously, thistechnique is not applicable to the multilayer perceptron, because its hiddenlayer outputs are unknown. In their work, Minsky/Pappert conjectured

8


10/50

that there such learning is infeasible5 effectively abridging enthusiasm and

funding of ANN research for the next 20 years.With invention of backward propagation learning, multilayer percep-

trons gained popularity. Its essence is gradient-decent minimization ofthe network error square mean E = (yj dj)

2. The weights wij con-

necting ith neuron of layer n with jth neuron of layer n + 1 are adjustedas wij = k E/wij . The derivative exploits the smooth activationfunction. The algorithm is not free of deficiencies. The high weights mayshift the working point of sigmoids into saturation area. Additionally, highlearning speed causes instability; it, therefore, is slowed down and networkstops learning the gradient decent is likely trapped in local mimima here.Another algorithm, simulated annealing, performs better.

Self-Organizing Models

It is also possible to learn without the training information. The self-organization capability is a cornerstone feature of all alive systems, includ-ing nerve cells. As far as ANNs are concerned, the self-organization meansadjustment of weighting factors, a number of neurons and network topol-ogy. For simplicity, only weights are adjusted.

One such approach is the Hebbian learning. It reflects a known neu-robiological fact: if two connected neurons are excited simultaneously andregularly, their connection becomes stronger. Mathematically, this is ex-

pressed as wij = k xij (t)yj(t), where xij = yi and yj are outputs of jth andith neuron.

Building a one-layer fully recurrent NN, which size matches the inputvector (object) length and weights are programmed by Hebb algorithm, weget the epochal Hopfield model. Initialized the state by input, it is iteratedXt+1 = F(W Xt) until convergence. The state is effectively attracted toone of the synapse-predefined states accomplishing the classification. Ba-sically, the system energy E = 0, 5 wij xixj is minimized. The neuralassociative memory (NAM) operates similarly the input vectors are pre-defined in pair with their output.

Kohonen self-organizing maps (SOMs) minimize the difference between

neurons input and its synaptic weight: wij = k (xi wij ). In contrast toHebbians algorithm, the weights are adjusted not for all neurons, ratherthe group around the strongest input neuron. Such a principle is known aslearning by competition. At the beginning, the neighborhood is set as largeas 2/3 of the network and is shrinked down to a single neuron during thecourse. This shapes the network so that the close input signals correspondto close neurons effectively implementing categorization task.

5http://ece-www.colorado.edu/~ecen4831/lectures/NNet3.html

9


11/50

2.3 Demand for the Neural HW

The maturity of some NN algorithms and importance of their intrinsicnon-linearity contrasted to classical linear approaches are long-proven [7]:

Quite to our surprise, connectionist networks consequently out-performed all other methods received, ranging from visual pre-dictions to sophisticated noise reduction techniques.

The problem is to find a proper substrate for their execution. The real ap-plications tend to require large networks and process many vectors quickly.The ordinary HW is extremely inefficient here. Furthermore, as the field oftheoretical neuroscience develops and the electrophysiological evidence ac-

cumulate, researchers need more and more efficient computational tools forthe study of neural systems. In cognitive neuroscience and neuro-informaticsthe neurosimulation is the essential approach for understanding the com-plex brain processing. The computational resources required far exceedthose available to researchers.

There is always demand for faster computing. The two approaches tospeed up are: the speed daemon, which means to wait for the faster pro-cessors and briniac, running many processors simultaneously. The latter issupported by CMOS technology packaging billions of gates in micro-areas.All that is left to do is learn how to connect them efficiently. Those who arefamiliar with parallel architectures know that the subject is all about scala-

bility. You cannot just take a handful of fast processors and obtain a linearspeedup, since the computation overhead along with the synchronizationand communication inevitably involved limit the performance growth.

That is not the case for the brain, which demonstrates the tremendousscalability from a few neurons in the primitive species up to 100 billion inhuman at unprecedented connectivity of 10 000 per neuron. As [8] pointsout in his review, the terms parallel architectures and neural networksare so close that are often used as synonyms.

Summarizing the classical ANN models presented above, the neurosim-ulation consists of two phases: 1) in the learning phase, the weights areadjusted in accordance with input examples; 2) inputs are mapped to out-

puts during recall. The neuron processing is simple: real-valued activationfunction is applied to the weighted sum of inputs (scalar multiplication)and the resulting activation value is broadcasted over to other nodes. Ba-sically, two simple operations are needed: multiply-and-accumulate andactivation function F(W X). Nevertheless, owing to the large number ofneurons, which are massively interconnected at that, the workload in recallphase ends quite involved. The learning phase is even more burdensome.However, all the synapses and neurons operate independently.

The classical ANN models suggest that the information and computa-tion are uniformly distributed over the network in such way that the stored

10


12/50

objects are memorized in synapses so that every synapse bears information

on all the memorized objects. This utmost diffusion of information is op-posed to the unambiguity pursued in the classical (mechanic, symbolic)algorithms and data structures. This revolutionary concept, the highestdegree of distribution, is defined as connectionism. Confining all the com-putation right into the connections would result in to the truly wired logic.

The inherently parallel neural models can take most of the highly par-allel machines. However, the redundancies and simulation of neural op-erations instead of their implementation in HW make the general purposesupercomputers expensive and slow. The HW is fast and efficient when itis in line with the model it executes. The degree of parallelism inherentto the neuroprocessing is provoking for the parallel execution. The neu-

roscience inspires the computer scientists to look for optimal substrate forthe neural models the fast and efficient parallel HW architectures. Mas-sively parallel VLSI implementations of NNs would combine the high per-formance of the former with good scalability of the latter.

11


13/50

Chapter 3

Traditional Approach

3.1 Simulating Artificial Neural Networks on Parallel

Architectures

The classical NN models were presented in the previous chapter. Thischapter will overview the machines built for simulating them. The novel,spiking-based models will be discussed in the following chapter.

The field of neurosimulation is still young and is used primarily formodel development in research labs. This means that besides the efficiency(cost, power, space), the machine must support different models, particu-

larly activation functions: threshold, sigmoid, hypertangent. This degreeof freedom is known as flexibility, the ability to support existing and novelparadigms, and programmability are required. Another feature of a goodarchitecture, modularity, is defined in [7] in two different ways: 1) the pos-sibility to offer a machine of user problem size/available funding and fullyexploit its resources. 2) possibility to replace the elements of the frame-work without redesigning the machine in order to support new elements.Whereas speed and efficiency favor each other, the flexibility conflicts withthem. It is natural, therefore, that the lifetime of ANN models starts fromgeneral and migrates to the special platforms, like it is planned in BB.

The figure 3.1 summarizes the taxonomies of the neurosimulation plat-

forms. I have added FPGA because this class of universal computing de-vices is missing in the taxonomies developed earlier.

3.2 Mapping neural networks on parallel machines

The ANN is called a guest graph G(N, W), which is a set of neurons Ninterconnected by weighted synapses W. Its possible topologies are fully-recurrent, random, layered, toroid, modular, hierarchical. It is mapped onthe target HW called a host graph H(P, C), which is a set of processing ele-

12


14/50

Figure 3.1: Taxonomy of neural platforms

ments (PE) interconnected by connecting elements (CE) into architecture. Inthe biological implementation we have: G = H, the isomorphic one-to-onemapping, where the NN is organized into a hierarchy of modules. A com-puter PE supports one or more neuron and each CE supports several virtualnode-to-node connections.

The parts of the network are processed by processing elements (PE).One chip chip may incorporate more than one PE. The key concepts of anefficient mapping of the network to the available PEs are load balancing,minimizing inter-PE communication and synchronization. Furthermore,the mapping should be scalable both for different network sizes and for dif-ferent number of processing elements. The amount of parallelism achieveddepends on the granularity of problem decomposition. The following lev-els of parallelism are exploited (ordering from coarseset to finest granular-ity):

training-session parallelism: processors emulate a netrowk indepen-dently (Their results may be exchanged after a number of cycles);

pattern-parallelism: every processor simulates a different input vec-tor on its network reducing the communication to zero;

layer parallelism: concurrent execution of layers within a network. Apopular variation of layer parallelism is to divide a layer vertically.This makes more sense because layers are computed in sequence.

neuron parallelism: a whole layer of neurons or the full network issimulated in parallel; and

13


15/50

Figure 3.2: Mapping an ANN (guest graph) onto a parallel machine (host)[9]

synapse parallelism: simultaneous weighting.

Dispite this categorization was made with the feed-forward networksand back propagation learning in mind, it can be applied to many othermodels. Packaging more neurons onto a single processor has proved to beadvantageous in the coarse processing.

3.3 Benchmarking

The performance measurements play a key role in deciding about theapplicability of a neuroimplementation. Yet, because of immaturity, thereis no standard benchmarking application like TEX and Spice we have in

the ordinary computing. Only few exceptions exist, like NETtalk, and theyare sometimes used for comparisons. The most commonly accepted CPS(Connections Per Second), which measures how fast a network performsthe recall phase, and CUPS (Connection Updates Per Second). Implemen-tations are compared against Alpha workstation and alternative designs.This measure is blamed more deceptive [7] (EPFL) however than the FLOPsfor two reasons:

it does not define neither network model nor its size and precision.A more complex neuron can replace many simpler ones and prolongdata lifetime on PE, considerably reducing the communication, whichis crucial for I/O-bound NNs. The missing benchmark application al-lows the developers to choose the best-case network and misses anyinformation on the capability of the structure to adapt different prob-lems.

the definition of CPS/CUPS is vague if not unexciting to the levelsexceeding MIPS and FLOPS. This allows the Adaptive Solution de-signers to obtain it as a product of connections in the network by thenumber of input vectors processed per second. The Philips Lneuroreports MCUPS > MCPS because the time to perform one operationis left out.

14


16/50

(a) A Ring Architecture (b) A Bus Architecture

The critics propose a rationale evaluation, used by a Army/Navy CFACommittee in selecting computer architecture for future military imple-mentations. They mention that not all organizations proposing the newarchitectures may not use the latest technologies: for instance, large compa-nies may own advanced CMOS processes or invest heavily in consolidatedtechnologies as a key factor of performance, whereas academic projects arefunds-limited. The idea is, therefore, to select the architecture by evaluatingit and undergo a novel implementation regarding the original implemen-tation as irrelevant.

A detailed theoretical analysis of neural HW architectures based on themeasurements adopted in parallel architectures is done in [7] (EPFL). [10]shows how to analyze the mappings. The basic assumption taken is that inmassively parallel neural network implementations, the major bottle-neckis formed by the communication process rather than the calculation of theneural activation and learning rules. The efficiency of a neurocomputerimplementation can therefore best be defined in terms of the time taken bya single iteration of the total NN. In [11], the same authors bring a broadoverview on the parallel machines used for the neural simulations.

3.4 Simulation on General-Purpose Parallel Machines

The general parallel architectures as possible hosts for neurosumula-tions are characterized by a large number of processors organized accord-ing to some topology. Depending on the presence/absence of central con-trol, parallel computers may be divided into two broad categories: data-parallel and control-parallel. The two categories require a quite differentstyle of programming.

Data-parallel architectures simultaneously process large distributed datasets using centralized control flow. A large amount of data is processed by

a large number of processors in a synchronous (typically SIMD) of regular(e.g. pipelined) fashion. Pipelining usually provides layer parallelization.

Pipeline structuring is often exploited on systolic arrays specific hard-ware architectures designed to map high-level computation directly ontoHW. Numerous simple processors are arranged in one- or multi-dimenti-onal arrays, performing simple operations in a pipelined fashion. Circularcommunication insures that data arrive at regular time intervals from (pos-sibly) different directions.

The ring can be 100% efficient on the fully recurrent ANNs [10]: everynode computes the one MAC operation and advances the result and its out-

15


17/50

put to the next node for the accumulation. Once all the sums are computed,

the nodes apply the activation function and start a new round.Control-parallel architectures perform processing in a decentralized man-

ner, allowing different programs to be executed on different processors(MIMD). A parallel program is explicitly divided into several different taskswhich are placed on different processors. The communication schema isusually general routing, i.e. the processors are message-passing computers.The transputers are the most popular control-parallel neural simulations.

Making this paper in a course of parallel architectures, it is curious tonote that [8] identifies data-parallel decomposition with the SIMD architec-ture and multiprocessor. This class of equivalence is opposed to anotherconsisting of the parallelized control, MIMD architecture and message-

communicating multicomputer. The author lists some neural simulationson general-purpose computers. The summary is presented in table 3.1.Author concludes that data-parallel techniques significantly outperformtheir control-parallel counterparts. In my opinion, it is not fair to com-pare the power of group of six transputers against kilo-processor armies.Admitting linear scalability of transputers, the data shows that they are anorder of magnitude faster. Yet, from theoretical point of view, it is a reason-able to think that the data-parallel architectures are a natural mapping ofneuroparadigm since the neural computations are most often interpretedin terms of synchronous matrix-vector operations. For this reason, it isnot surprising that control-parallel architectures are programmed in data-

parallel style. [9] explains that MIMD is used forcefully because productionof SIMD has stopped a long time ago.

Curiously, [12] finds the Beowulf cluster the most attractive in his re-view on simulating NNs by parallel general-purpose computers. Equippedwith high speed connection network such as Myrinet, Beowulf offers excel-lent performance at a very competitive price. This cost advantage often can

be as high as an order of magnitude over multiprocessor machines of com-parable capabilities.

Programming neural networks on parallel machines requires high-leveltechniques reflecting both inherent features of neuromodels and character-istics of the underlying computers. To simplify the task of neuroscientist,

a number of parallel neurosimulators were proposed on general-purposemachines. Some institutions develop libraries for MIMD supercomputersthat enable NN developers to use the supercomputer efficiently withoutspecific knowledge[9]. Others develop professional portable neurosimu-lators, like NEURON and NCS. The compromise between portability andefficiency is usually achieved by parallel programming environments, e.g.

Message-Passing Interface (MPI), Parallel Virtual Machine (PVM), Pthreads andOpenMP, on heterogeneous and homogeneous clusters and multiproces-sors.

Simulations on general-purpose parallel computers were mostly done

16


18/50

in late eighties. A large number of parallel neural network implementation

studies have been carried out on existing the massively parallel machineslisted below, simply because neural hardware was not available. Althoughthese machines were not specially designed for neural implementations, inmany cases very high performance rates have been obtained. The universalcomputers still remain popular in the neurocomputing because they aremore flexible and easier to program.

Zhang et al. [13] have used node-per-layer and training prallelism toimplement backpropagation networks on the Connection Machine. Eachptocessor is used to store a node from each of the layers, so that a sliceof nodes lies on a single processor. The number of processors needed tostore a network is equal to the number of nodes in the largest layer of the

network. The weights are stored in a memory structure shared by a groupof 32 processors reflecting CM specific architecture. With 64 K processors,the CM is a perfect candidate for training-example parallelism. The au-thors use the network replication to fully utilize the machine. The NETtalkimplementation achieves peak performance of 38 CUPS and 180 CPS.

The MasPar implementation [14] similarly to Zhangs implementa-tion exploits both layer and training-session parallelism. Each processorstores the weights of corresponding neurons in its local memory. In theforward phase, the weighted sums are evaluated, with intermediate resultsrotated from right to left of the processor array using MasPars local in-terconnect. Once the input values have been evaluated, sigmoid activation

functions are applied and the same procedure is repeated for the next layer.In the backward phase, a similar procedure is performed, with errors prop-agated from the output down to the input layer. After performing a num-

ber of training examples on multiple copies of the same natwork, wheightsare synchronously updated. Maximal NETtalk performance obtained is176 CPS and 42 CUPS.

Rosenberg and Belloch [15] have used node and weight parallelismto implement backpropagation networks on a one-dimentional array withone processor being allocated to a node and two processors to each side ofconnection: input and output. Connection-processors multiply the values

by their respective weights. The nodes accumulate the products and com-

pute sigmoids. The backpropagetion is done in a similar way. The NETtalkmaximum speed achievs 13 MCUPS.

Pomerleau et al. [16] have used training and layer parallelism to im-plement backpropatation network on a Wrap computer with processorsorganized in a systolic ring. In the forward phase, the activation valuesare shifted circularily along the ring and multiplied by the correspondingweights. Each processor accumulates the parital weighted sum. When thesum has been evaluated, the activation function is performed. In backwardprocessing, is similar, but instead of activation values, accumulated errorsare shifted circularly. Performance measurements for the NETtalk applica-

17


19/50

Structuring

Technique

Paralle-

lism

Num of

procrs

Computer

architecture

Performance

CPS CUPSCOARSE training,

layer64K Connection Machine

(Zhang 90)180M 38M

COARSE training,layer

16K MasPar (Zell 90) 176M 42M

FINE node,weight

64K Connection Machine(Rosenberg 87)

13M

PIPELINED training,layer

10 Warp (Parmelau 88) 17M

PIPELINED layer,node

13K Systolic Array(Chung 92)

148M

COARSE partitions 6 Transputers(Straub 91)

207K

Table 3.1: NETtalk implementations on general supercomputers. The re-sults are from late 80s and early 90s.

tion on a Warp computer showed a speed of 17 MCUPS.Chung et al. [17] have applied classical systolic algorithms for matrix-

by-vector multiplication when simulating backpropagation networks. Theyexploited layer and node parallelization by partitioning the neurons of each

layer into groups and by partitioning the opertion of each neuron into thesum or the product operation and the non-linear function. The executionof forward and backward phases was also done in parallel by pipeliningmultiple inputs sets. The NETtalk application done on a 2-D systolic arraywith 13 K processing elements achieved a maximum speed of 248 MCUPS.

In [18], authors describe a backpropagation implementation on the T8000system consisting, consisting of central transputer and six slaves. A mul-tilayer feedforward network is vertically divided so that each slave con-tains a fragment of nodes from each layer. Computation is synchronized

by master so that the layers are computed in sequnce. This is similar tolayer decomposition, but the execution flow is closer to the SPMD model.

The authors give 58 KCUPS for small network and 207 KCUPS for a largernetwork, which better utilizes processors.MindShape [10] is a fractal-architecture universal computer that was

designed for simulating the brain and, inspired by the fractal brain orga-nization, propose a fractal node-parallel architecture. Analyzing the node-parallel mappings, they conclude that the scalability is bound by commu-nication overhead O(1) < ti < O(n) the network iteration time growswith the number of nodes n (node-parallelism). For instance, ti is O(n)for the fully recurrent network there is no good architecture for it. Us-ing this measure, it is shown how the fractal architecture manages to host

18


20/50

Figure 3.3: The fractal topology and MindShape architecture: the similarmodule interconnect pattern at all levels of hierarchy

different guest topologies: fully and randomly connected, layered feedfor-ward, torus, modular, and hierarchical modular (fractal). The hierarchicalones perform the best (O(1)).

The CEs store the transfer tables, the information on what data has togo where. The brain capacity is estimated: 9 1010 neurons x 104 synapse/neuron x 7 bits/synapse at 500 Hz = 53 PCUPS. With 256 neurons per chip,32 chips on board, 32 boards per rack, 32 racks 32 racks per floor x 32floors at 100 MHz, we could deliver 8.7 PCUPS, the cortex capacity. Thevolume of such system is equivalent to the first IBM computer. The 2-byteweights are proposed totaling 8.3 TByte. The issue of fault-tolerance at thisscale is discussed.

3.5 Neurocomputers

Despite of the advances, the speed and efficiency requirements cannotbe successfully met by general-purpose parallel computers. The general-purpose neuroarchitectures offer generic neural features aiming at a widerange of ANN models. The neurocomputers can be further specialized forsimulating concrete models and networks.

Architecturally, neurocomputers are large processor arrays complexregular VLSI architectures organized in a data-parallel manner. A typicalprocessing unit of a neurocomputer has local memory for storing weightsand state information. The whole system is interconnected with a paral-lel broadcast bus, and usually has a central control unit. The data-parallel

programming techniques and HW architectures are most efficient for neu-ral processing. The dominating approaches are: systolic arrays, SIMD andSPMD processor arrays.

Important for the design of highly scalable hardware, finding an in-terconnection strategy for large numbers of processors has turned out to

be a non-trivial problem. Much knowledge about the architectures of thesemassively parallel computers can be directly applied in the design of neuralarchitectures. Most architectures are however regular, for instance grid-

based, ring-based, etc. Only a few are hierarchical. As was argued in [11],the latter forms the most brain-like architecture.

19


21/50

Analog architectures tend to the full connectivity. Digital chips use

localized communication plan. Three architectural classes of system in-terconnect can be distinguished: systolic, broadcast bus, and hierarchicalarchitectures. Systolic arrays are considered non-scalable. According tomany designers, broadcasting the most efficient multiplexed interconnec-tion architecture for large fan-in and fan-out. It seems that broadcast com-munication is often the key to success in getting communication and pro-cessing balanced, since it is a way to time-share communication paths effi-ciently.

2D architectures are less modular and less reconfigurable as the dataflow is quite rigid. At the same time, they allow throughputs much higherthan 1D architectures.

Implementing neural functions on special purpose chips speeds theneural iteration time up by about 2 orders of magnitude compared to general-purpose P. The common goal of the neurochip designers is to pack asmany processing elements as possible into a single silicon chip, thus pro-viding faster connectivity. To achieve this, developers limit the computa-tion precision. [11] remarks that overfocusing on this shoves back the inter-chip connectivity issue, which is also important for their integration into alarge-scale architecture.

Digital technology has produced the most mature neurochips, provid-ing flexibility-programmability and reliability (stable precision comparedto analog) at relatively low costs. Furthermore, due to mass-production, a

lot of powerful tools to custom design are available. Numerous programsfor digital neurochip design are offered, all major microchip companies andresearch centers world-wide announced their neuroproducts. Digital im-plementations use thousands of transistors to implement a single neuronor synapse.

On the other hand, these computationally intensive calculations are au-tomatically performed by analog physical processes such as summing ofcurrents or charges. The operational amplifiers, for instance, are easily builtfrom single transistors and automatically perform synapse- and neuron-like functions, such as integration and sigmoid transfer. Being natural,analog chips are very compact and offer high speed at low energy dissi-

pation. Simple neural (non-learning) associative memory chips with morethan 1000 neurons and 1000 inputs each can be integrated on a single chipperforming about 100 GCPS [11]. Another advantage is ease of integrationwith real world while digital counterparts need AD-DA converters.

The first problem why analog did not replace digital chips is a lack offlexibility: analog technology is unusually dedicated to one model and re-sults in scarcely-usable neurocomputer. Another problem of representingadaptable weights limits the applicability of analog circuits. Weights can,for instance, be represented by resistors, but such fixing of weights in pro-duction of the chips makes them not adaptable they can only be used in

20


22/50

the recall phase. The capacitors suffer of limited storage time and trouble-

some learning. Off-chip training is sometimes used with refreshing the ana-log memory. For on-chip training, statistical methods, like random weightchanging, are proposed in place of back-propagation because its complexcomputation and non-local information make it prohibitive. Other mem-ory techniques are incompatible with the standard VLSI technology.

In addition to the weight storage problem, analog electronics is suscep-tible to temperature changes, (interference) noise, and VLSI process vari-ations that make the analog chips less accurate and harder to understandwhat exactly is computed, complicates the design and debug. At the sametime, the practice shows that realistic neuroapplications often require accu-rate calculations especially for back-propagation.

Taking these drawbacks into account, the optimal solution would ap-pear to be a combination of both analog and digital techniques. Hybridtechnology exploits advantages of the two approaches. The optimal com-

bination applies digital techniques to perform accurate and flexible trainingand uses potential density of analog chips to obtain finer parallelism on asmaller area in the recall phase.

Here, we do not consider the optical technology, which introduces pho-tons as basic information carriers. They are much faster that electrons andhave less interference problems. In addition to the greater potential com-munication bandwidth, the processing of light beams also offers massiveparallelism. These features put optical computing first among possible

candidates form the neurocomputer of the future. The optics ideally suitsto the realization of dense networks of weighted interconnections. Spatialoptics offers 3-D interconnection networks with enormous bandwidth andvery low power consumption. Besides optoelectronics, electro-chemical,and molecular are also very promising. Despite enormous parallel pro-cessing and 3D connection prospects of optical technology, silicon technol-ogy continues to dominate with more and more neurons being packed ona chip.

Neurocomputers are popular in the form accelerator boards added topersonal computers.

The CNAPS System (Connected Network of Adaptive Processors, 1991)

developed by Adaptive Solutions became one of the most well known com-mercially available neurocomputers. It is build of N6400 neurochips thatitself consist of 64 processing nodes (PN) that are connected by a broad-cast bus in a SIMD mode. Two 8-bit buses allow the broadcasting of inputand output data to all PNs and easily adding more chips. Additionally, the

buses connect PNs to the common instruction sequencer.The PNs are designed like DSPs including fixed-point MAC and equipped

with 4 KB of local SRAM for holding the weights one matrix for learningand one for back-propagation learning. It limits system size: the perfor-mance drops dramatically when 64 PNs try to communicate over the two

21


23/50

buses, which becomes necessary when network and weight matrix grow.

The complete CNAPS system may have 512 nodes connected to a hostworkstation and includes SW support. It uses layer decomposition andoffers a maximum performance of 5.7 GCPS and 1.46 GCUPS tested on a

backpropagation network. The machine can be used as a general-purposeaccelerator.

The SYNAPSE System (Synthesis of Neural Algorithms on a ParallelSystolic Engine) is build by Siemens in 1993 of MA-16, the neurochps de-signed for fast 4x4 matrix operations with 16-bit fixed-point precision. Thechips are cascaded to form a systolic array: one MA-16 chip outputs to an-other in a pipelined manner ensuring optimal throughput. Two parallelrings of SYNAPSE-1 are controlled by Motolola processors. The weights

are stored off-chip in 128MB SDRAM. Similarly to CNAPS, wide range ofNN models are supported but, in opposite to SIMD, programming is diffi-cult because of complex PEs and 2D systolic structure. The system is pack-aged with SW to make neuroprogramming easier. Each chip throughput is500 MCPS, the full system performs at 5.12 GCPS and 33 MCPUS.

The RAP System, developed at Berkley in 1993, is a ring array of DSPchips specialized for fast dot-product arithmetic. Each DSP has a localmemory (256 KB of static RAM, and 4 MB of dynamic RAM) and a ringinterface. Four DSPs can be packed on a board, with a maximum of ten

boards. Each board has a VME bus interface with host workstation. Theprocessing is performed in a SPMD manner. Several neurons are mapped

onto a single DSP in layer decomposition style. The maximum speed of10-board system is estimated at 574 MCPS and 106 MCUPS.

The SAIC SIGMA-1 neurocomputer is a PC computer with a DELTAfloating-point processor board and two software packages: an object ori-ented language and a neural net library. The coprocessor can hold 3 M vir-tual processing elements and connections, performing 2 MUPS and 11 MCPS.

The Balboa 869 co-processor board for PC and Sun workstations is in-tended to enhance the neurosoftware package ExploreNet. It uses Intel i860as a central processor and reaches the maximum speed of 25 MCPS for a

backpropagation network in the recall phase, and 9 MCUPS in the learningphase.

The Lneuro (1990, Learning Neurochip) implemented by Philips im-plements 32 input and 16 output neurons. By updating the whole set ofsynaptic weights related to a given output neuron is in parallel, a sort ofweight parallelism is reached. The chip comprises on-chip learning withan adjustable learning rule. A number of chips can be cascaded within areconfigurable, transputer controlled network. The experiments with 16LNeuro 1.0 chips report 8x speed-up compared to an implementation ona transputer. Measured performance: 16 LNeuros on 4 dedicated boardsshow 19 MCPS, 4.2 MCUPS. The authors guarantee a linear speed-up withthe size of machine.

22


24/50

(a) CNAPS made of 6400 chips

(b) SYNAPSE

(c) MANTRA: Systolic Array of GenesVI chips

(d) MANTRA: Genes IV processing ele-ment

(e) MANTRA-I System Architecture

Figure 3.4: Representative neurocip-based architectures

23


25/50

Mantra I (1993, Swiss Federal Institute of Technology) is aimed at a

multi-model neural computer which supports several types of networksand paradigms. It consists of a 2-D array of up to 40x40 GENES IV sys-tolic processors and the linear array of auxiliary processors called GACD1.The GENES chips (Generic Element for Neuro-Emulator Systolic arrays)are bit-serial processing elements that perform vector/matrix multiplica-tions. The Mantra architecture is in principle very well scalable. It is oneof the rare examples of synaptic parallelism. It shares the difficult recon-figurability and programming with SYNAPE. The slow controller and se-rial communication limit the performance. Performance: 400 MCPS, 133MCUPS (backpropagation) with 1600 PEs.

BACHUS III (1994, Darmstadt University of Technology, Univ. of Dus-

seldorf, Germany) is chip containing the functionality of 32 neurons with1 bit connections. The chips are mounted together resulting in 256 simpleprocessors. The total system was called PAN IV. Chips are only used in thefeed forward phase; learning or programming is not supported and thushas to be done off-chip. The system only supports neural networks with

binary weights. Applications are to be found in fast associative databasesin a multi-user environment, speech processing, etc.

Analog Mod2 neurocomputer (Naval Air Warfare Center Weapons Di-vision, CA, 1992) system incorporates neural networks as subsystems ina layered hierarchical structure. The Mod2 is designed to support par-allel processing of image data at sensor (real-time) rates. The architec-

ture was inspired by the structures of biological olfactory, auditory, andvisual systems. The basic structure is a hierarchy of locally densely con-nected, globally sparsely connected networks. The locally densely inter-connected network is implemented in a modular/block structure basedupon the ETANN chip. Mod2 is said to implement several neural networkparadigms, and is in theory infinitely extensible. An initial implementationconsists of 12 ETANN chips, each able to perform 1.2 GCPS.

Epsilon, 1992, the (Edinburgh Pulse Stream Implementation of a Learn-ing Oriented Network) developed in Edinburgh University is a hybrid large-scale generic building block device. It consists of 30 nodes and 3600 synap-tic weights, and can be used both as an accelerator to a conventional com-

puter and as an autonomous processor. The chip has a single layer ofweights but can be cascaded to form larger networks. The synapses areformed by transconductance multiplier circuits which generate output cur-rents proportional to the product of two input voltages. A weight is rep-resented by fixing one of these voltages. In neuron synchronous mode,the first uses pulse width modulation and is specially designed with vi-sion applications in mind. The asynchronous mode is provided by pulsefrequency modulation, which is advantageous for feedback and recurrentnetworks, where temporal characteristics are important. The synchronousimplementation was successfully applied to a vowel recognition task. An

24


26/50

MLP network consisting of 38 neurons (hidden and output) was trained by

the chip in loop method and showed performance comparable to a soft-ware simulation on a SPARC station. With this chip it has been shown thatit is possible to implement robust and reliable networks using the pulsestream technique. Performance: 360 MCPS.

3.6 FPGAs

The massively parallel and reconfigurable FPGAs very well suit to im-plement the highly parallel and dynamically adoptable ANNs. In addition,

being general-purpose computing devices, FPGAs offer the level of flexibil-ity for many neuromodels and are also useful for pre- and post-processingthe interface around the network in conventional way.

However despite the custom-chip fine-grain parallelism offered, theFPGA is not true digital VLSI; they are one order of magnitude slower. Yet,the newest FGPAs incorporate ASIC multipliers and MAC units that haveconsiderable effect in the multiplication-rich ANNs. The floating-point op-erations are impractical; particularly, the non-linear activation (sigmoid)function, which is too expensive in direct implementation, is usually lin-early approximated peace-wise.

The reconfigurability permits the neural morphing. During training,the topology and the required computational precision for an ANN can

be adjusted according to some learning criteria. The [19] review referstwo works that used genetic algorithms to dynamically grow and evolveANNs-based cellular automata and implemented the algorithm, which sup-ports on-line pruning and construction of network models.

Reviewers mention the need for more friendly learning algorithms andsoftware tools.

Below are some examples of implementing different models on RAP-TOR2000 board. Because of flexibility, many other neural and conventionalalgorithms can be mapped on the system and reconfigured at runtime.

3.6.1 ANNs on RAPTOR2000

RAPTOR2000 is an extensible PCI board with Dual-Port SRAM. The(expansion) FPGAs are connected in linear array with neighbors (128-bit

bus) as well as by two buses that are 75 and 85 bits wide. It is tested withon for 3 sample applications.

For Kohonen SOM, four Virtex FPGAs were connected in 2D array. FifthFPGA implements host-PC interfacing controller communicating NN inputvectors and results and equipped with 128 MB SDRAM for storing trainingvectors. Architecture of the processing elements (PE) is similar to the onesproposed for ASIC. FPGA BlockRam is used for storing the weights. Man-

25


27/50

(a) The prototyping board (b) SOM architecture

(c) BiNAM architecture (d) Radial Basis Functions

Figure 3.5: RAPTOR2000

26


28/50

hattan distances instead of Euclidean to avoid multiplications and square

roots. An interesting trick to start PE learning in fast, 8-bit precision con-figuration for rough ordering of the map and are reconfiguration to theslower 16 bits for fine-tuning is demonstrated. The number of cycles perinput vector depends on input vector length l and number of neurons/PEn: crecall = n (l + 2ld(l 255) + 4) and is almost twice for learning.Achieved 65 MHz clocking, XCV812E-6 outperforms the 800 MHz AMDAthlon more than 30 times.

Another application is Binary Neural Associative Memory. Using sparseencoding, i.e. when almost all bits of input and output vectors are 0, beststorage efficiency and almost linear scalability is achievable for both recalland learning. Every processor works on its part of neurons but large stor-

age is required. More than million associations can be stored on six Viretexmodules (512 neurons per FPGA) using external SDRAM. Every FPGA hasa 512-bit connection with the SDRAM bus and every neuron processes onecolumn of the memory matrix. This 50MHz implementation is limited bySDRAM access-time and results in 5.4 s.

The last sample app is (Radial Basis) Function Approximation. A net-work with flexible number of hidden neurons is trained incrementally:if good approximation is not achieved, a neuron is added and learningrestarts. This also minimizes the risk of local minimum. A number of iden-tical PEs compute their neurons in parallel. The data selectors assign theinputs and select correct outputs, which are summed up in a global accu-

mulator. Simultaneously, error calculation unit analyzes the error, whichis submitted to controller and PEs for weight update. Such implementa-tion can run at 50 MHz with the number of cycles/recall given by: c =l + NPE +

NneurNPE

((4 l + 5) + 2).

3.7 Conclusions

During the past five decades, the most frequently used types of artifi-cial neural networks have been the perceptron-based models. Implemen-tation projects such as those reported above are giving rise to new insights,

insights that most likely should have never emerged from simulation stud-ies. However, the neurocomputers did not uncover all the potential for fast,scalable and user-friendly neurosimulation.

In the late 1980s and early 1990s neurocomputers based on digital neu-rochips reached the peak of their popularity some even came out fromthe research laboratories and entered the market. However, the progressstalled initial enthusiasm decreased because 1) user experience in solv-ing the real-world problems was not very satisfactory and 2) in competitionwith the general-purpose P that grew according to Moore Law. Develop-

27


29/50

Figure 3.6: Neurocomputer Performances [11]

ment of custom chips1 is very expensive (and especially hard for the con-nectionists who are not familiar with such things as VHDL), they are lessprogrammable and it turns out better to rely on the massively producedand exponentially growing general-purpose P. DSPs are especially popu-lar because of highly parallel SIMD-style MACs for synapse and tightly in-tegrated FPU for sigmoid computation. The relatively new FPGAs, which

are also general-purpose computation devices, surpass their performanceone order of magnitude. These implementations will of course never bemaximally efficient and fast as dedicated chips.

If you look at the diagram 3.7, neurocomputers are 2 orders of magni-tude faster than the general-purpose multiporcessors. The neurocomputersmade of neurochips are additionally 2 magnitudes faster than the P-basedcounterparts. The analog technology brings two additional orders.

Later works show that the analog inaccuracy can be an advantage in theinherently fault-tolerant neurocomputing remarking that the wetwareof real brains keeps going surprisingly well in the wide range of tempera-tures and variety of neurons. But recall that CPS is a vague figure. It was

realized that the wetware, which constitutes the animal brains, uses morepowerful spike-based models. The relatively recent, last decade, trend wasto move to the spiking neural networks [principles of designs for large-scale], which are much more powerful yet allow simulating more neuronswith less HW.

1[22] justifies it only 1) for large system solutions and 2) when the topological and com-putational model flexibility by user-simple description is provided

28


30/50

Chapter 4

Spiking NNs

4.1 Theoretical background

As opposed to the 2nd generation neural models presented above, thereal neurons do not encode their activation values as binary words in com-puter. Rather, axons are the ion channels that propagate the wave packetsof charge. These pulses are called action potentials. At this, the activationfunction acts like a leaking capacitor, which integrates the charge and firesa pulse once the sum of pulses, the membrane potential, overcomes its thresh-old. The length of spikes is not accounted.

The integrate-and-fire model can operate on both rate coding and pulsecoding (timing of pulse is taken into account). Both encodings are compu-tationally powerful and easy to implement in computer simulation as wellas hardware VLSI systems. Yet, since every biological neuron fires no morethan 3 pulses during the estimated brain reaction, 150 ms, the timing ofpulses must be accounted for realistics. In fact, the pulse coding allows theNN to respond even faster than one neuron spiking time. As [20] remarks,the pulse coding is very promising for tasks in which temporal informationneeds to be processed, which is the case for virtually all real-world tasks.Additionally, the rate coding is difficult in learning.

The back-propagation is not suitable for the SNNs. Spike-timing de-pendent synaptic plasticity (STDP), which is a form of competitive Heb-

bian learning that uses the exact spike timing information, is used instead.The synapse strengthening, named long-term potentiation (LTP), occurs ifpost-synaptic neuron fires a spike in 50 ms after pre-synaptc. More for-mally: wij = h(t), where correlation h(t) grows from 0 to somepeak and then decays to 0 so that late spikes do not affect the weights.

All the HW implementations examined make use of sparse SNN con-nectivity and low activity: only about 1 % of neurons are firing in a timeslot.

29


31/50

Figure 4.1: The Action Potential and Spiking NN

4.2 Sample HW

4.2.1 Learning at the Edge of Chaos

Any internal dynamics emerges from the collective behavior of inter-acting neurons. This is a product of the neuron coupling. As mentioned by[21] authors, it is, therefore, necessary to study the coupling factor theaverage influence of one neuron upon another.

They experiment on STDP by building a video-processing robot, whichlearns to avoid the obstacles (walls and moving objects) with the purposeto investigate the internal dynamics of the network. A Khepera robot with

a linear (1-D horizon vision) camera and collision sensors is used for ex-periment. The video image is averaged to 16 pixels, which are fed to 16input neurons, processed by 40 fully recurrent hidden neurons and twooutput neurons that control two motors. The weights in the fully recur-rent network are initialized randomly with some variance from the normaldistribution center.

The full-black color is supplied by 10 Hz spikes, full-white correspondsto 100 Hz. In average, input spike happens every 100 steps, meantime thenetwork keeps firing and learning. The STDP learning factor = constdepending on if robot moves or hits a wall and 0 otherwise. Started atchaotic firing, neurons synchronize with each other and external world.

However, fast synchronization is not always good. The network mustexhibit two contradictory dynamic features: the plasticity to remain re-sponsive and the autism to maintain stability of internal dynamics es-pecially in case of noisy environment. Authors look at the average mem-

brane potential developing in time: m(t) = Vi(t). The coupling must behigh enough to avoid neural death then its dynamics evolves from ini-tial chaotic firing to synchronous mode. At weak coupling, this measureis chaotic, neurons fire asynchronously and aperiodically, robot behavesalmost randomly. Increased variance favors the increased periodicity and

30


32/50

Figure 4.2: NeuroPipe-Chip on MASPINN board

synchrony among the neurons the average membrane potential rectifiesinto a straight line. IMO, here is some confusion between coupling vari-ance and strength.

The task is achieved by a simple controller: single Motorola 23Mhz,512 KB of RAM and 512 KB ROM. Having got the idea, lets move on morepowerful designs.

4.2.2 MASPINN on NeuroPipe-Chip: A Digital Neuro-Processor

MASPINN is a pretty typical accelerator1 of that time. It implementsmany concepts proposed in previous works. Based on a custom NeuroPipe,it simulates 106 neurons with up to 50100 connections per each if networkactivity < 0.5% to enable video-processing in real-time. The common con-cepts are:

a spike event list that models the axon delays: it stores spike sourceneurons along with spike time2. The fact that the next time slot iscomputed from the data of the previous one allows the network to be

processed in parallel.

a sender-oriented connectivity list (a map) that keeps the destinationneurons for every source neuron along with the connection weight.

tagging of dendrite potentials the dendrite potentials that have de-cayed to zero and, thus, have no impact on the membrane, are tagged

by ignore bit.

1For instance, [22] is a very similar design2This is like VHLD simulation but simpler

31


33/50

num Alpha SPIKE 128k ParSpike MASIPNN

of 500 MHz 10 MHz 100 MHz 100 MHzneurons FPGA 64 DSPs NeuroPipe-Chip

1K 0.56ms 1ms 1ms 6.5s128K 67ms 10ms 1ms 0.83ms

1M 650ms 8ms 6.5ms

Table 4.1: Comparison of MASPINN

The chip supports programmable NN models by user code specifyingthe connections and how they contribute to the membrane potential. Im-

plemented in 0.35 m digital CMOS at 100 MHz and consuming 2 Wattshows two orders of magnitude improvement over 500 MHz Alpha work-station and approaches the real-time requirements for SANNs. It also cu-rious to note the 10-fold improvement over FPGA and competitiveness ofDSP-based designs with the custom made chips.

4.2.3 Analog VLSI for SNN

Similarities between biological neurons and continuous analogue tech-nology are drawn in [23] highlighting that the digital technology is inca-pable of simulating large parts of brain in foreseeable future. The authorspropose the natural (analogue) computing VLSI chip architecture instead.

Yet, existing digital protocols are exploited for conducting the naturallydiscrete action potentials. Additionally, SRAM is used for weight storageinstead of capacitors to span the operation beyond few millisecond andsimplify weight adjustment. Moreover, the digital controller is proposedfor developmental changes of NN connectivity.

The prototype is a 256-neuron x128 synapses chip implemented on 0.18mCMOS process. The core of architecture is a synaptic matrix. Neuron ax-ons drives a triangular pulse of voltage along a horizontal line. A columnof synapses-nodes translate it into the current driven down the column,where a neuron membrane is located and accumulates the charge. Oncethreshold voltage is exceeded, neurons comparator fires a pulse to its re-spective axon. The architecture results in very small synapses 15 iden-tical NMOS gates each. This is important, since synapses dominate in NNcomputation.

The weights are stored in 4-bit SRAM. A digital STDP controller adjuststhem. The synapses are equipped with two capacitances for the controllerto store the pre- and post- synaptic events (pulse timestamps?). The corre-lation computation and weight updates can be done sequentially becausesynaptic plasticity is a slow process: it takes minutes in biology that corre-sponds to tens of milliseconds in the timescale of this chip, while the entire

32


34/50

Figure 4.3: SNN on analog VLSI. Axon drives a row. A column of synapsesfeeds a neuron at the column bottom. A neuron spikes are converted into

axon current, which drives a row.

matrix can be updated in microseconds. The low network activity allowsto traverse the matrix faster.

Analyzing the external interface to transmit the numbers of spikingneurons, authors estimate 2.6 107 spikes/s. The timing precision must

be 150 ps (10s biological time). Eight bits for neuron number plus othereight bit for time synchronization sub-periods would require 52 MB/s. The1.6 GB Hypertransport proposed should sustain the bursts of higher spik-ing rates and still leaves headroom for more chips. The spikes travel off-chip from the axon. They are priority-encoded to handle simultaneous

spiking of multiple neurons. A conflict inflicts a 2.5 ns (250 s biological)delay error at maximum. The reverse direction is similar: an off-chip sourcetransmits the row number to pulse. Beside the spike transport, the monitoramplifiers allow external monitoring of four neuron membranes at a time.

Please note that this implementation offers the true synapse-parallelism.It closely mimics the biological system while being as simple as possible.By operating several chips in parallel, it should be feasible to build systemof 10000 neurons. At speed 105 faster than the real-time, this system allowsto test many hypothesis, where simulation time on a digital computer istoo long. Besides physical modeling on a serious time and size scale, it cor-responds to the continuous, non-Turing, computations of the real neurons

and supports the amazing model behind the BB, the liquid computing.

4.3 Maas-Markram theory: WetWare in Liquid Com-

puter

To give an insight on the strength of spiking networks, one spiking neu-ron is more powerful than sigmoidal NN with 412 hidden levels. This wasrecently proven by Wolfgang Maass, a mathematician and accomplice ofMarkram in the BB project. From his bibliography in theoretical computer

33


35/50

science, we see that Maass started by studying the hard, symbolic automa-

tion and Turing machines, then moved to analog and neural networks ap-proaching the Markrams brain research. Together they have developed aliquid state machine (LSM), a kind of SNN, to be used in real-world ap-plications. This is needed for the following reasons.

The authors point out that the computer science lacks the universalmodel of organization of computations in cortical microcircuits that arecapable of carrying out potentially universal information processing tasks[24]. The universal computers adopted, the Turing machines and attrac-tor ANNs, are inapplicable because they process the static discrete inputswhereas the neural microcircuits carry out computations on continuousstreams of inputs. The conventional computers keep the state in a num-

ber of bits and are intended for static analysis: you record the input datai and revise this history, in order compute the output at current time t:o(t) = O(i1, i2, ..., it). This costs HW and time. However in the real-world,there is no time to wait until a computation has converged results areneeded instantly (anytime computing) or within a short time window (real-time computing). The computations in common computational modelsare partitioned into discrete steps, each of which require convergence tosome stable internal state, whereas the dynamics of cortical microcircuitsappears to be continuously changing (the only stable is the dead state).The biological data suggest that cortical microcircuits may process manytasks in parallel while the most of NN models are incompatible with this

pattern-parallelism. Finally, the components of biological neural micro-circuits, neurons and synapses, are highly diverse and exhibit complexdynamical responses on several temporal scales, what makes them com-pletely unsuitable as building blocks of computational models that requiresimple uniform components, such as virtually all models inspired by com-puter science or ANNs. These observations motivated the authors to lookfor alternative organization of computations, calling this a key challengeof neural modeling. The proposed framework not just compatible with theaforementioned constrains, it requires them.

Every new neuron adds a degree of freedom to the network, making itsdynamics very complicated. The conventional approaches are, therefore,

to keep the (chaotic) high-dimentional dynamics, under control or workonly with the stable (attractor) states of the system. This eliminates theinherent property of NNs to continuously absorb information about inputsas a function of time. This gives an idea for explaining how a continuousstream of multi-modal input from a rapidly changing environment can beprocessed by stereotypical recurrent integrate-and-fire neuron circuits inreal-time.

The authors look at NN as a liquid, which dynamics (state) is per-turbed by inputs. All the temporal aspects of input data are digested intothe highly-dimensional liquid state. The desired function output is read

34


36/50

Figure 4.4: The Liquid State Computing

out from the (literally) current state of the liquid by another NN. That is all:the high-dimensional dynamical system formed by neural liquid serves auniversal source of information about past stimuli for the readout neurons,

which implement the extract particular aspects needed for diverse tasks inreal-time.

Owing to the fact that

1. The liquid is fixed its connections and synaptic weights are ran-domly predefined; and

2. The only part that learns is the readout, which is memory-less (itrelies only on the current state of the liquid ignoring any previousstates) and can thus be as simple as 1-layer perceptron

this approach dramatically simplifies the computation resolving the com-

plexity problem. Furthermore, one liquid reservoir of information mayserve many readouts in parallel.

Like the Turing machine, the LSM is based on a rigorous mathematicalframework that proves its universal computational power. However, unlikethe sequential in nature Turing machines that process the static discrete in-puts off-line, LSMs are not based on stable states and algorithms presenting

biologically more relevant case of real-time computing on continuous inputstreams. The analysis shows that NNs are ideal liquids, as opposed to cof-fee cup, for instance. Despite [25] who introduces the term super-Turingis skeptical on this, dr. H. Jaeger has independently discovered the echostate networks, which share the same reservoir computing concept [26]

while implement nonlinear filters in simple and computationally efficientway.

The liquid carries all the complexity. The readouts are made 1-layerfor trivial training. Such readouts are unable to solve the linearly non-separable problems and, thus, 1) are sometimes called linear classifiers and2) linearly non-separable problems are good benchmarks for checking theliquid quality. Likewise it was between order and chaos, the liquid can bemore or less useful. It can be too stable (order) disregarding all the inputs,and, on the other end, chaotic the current input overwrites all the mem-ory. The optimum lies inbetween in- and over-sensitive response to the

35


37/50

Figure 4.5: Hard Liquid. a) A structural block. b) The interpolated Mem-ory Capacity for different weight distributions (the points). The largest 3distributions are highlighted).

inputs.

4.3.1 The Hard Liquid

The authors of the hybrid VLSI above, elaborate their chip for LSMapplicability [27]: the network size and the technology (the analog inte-gration of current in synapses plus digital signaling) are retained but theMcCullohc-Pitts neurons (step activation function ) are used instead ofspiking and the 11-bit nominal weight storage is made capacitive. Im-plemented in 0.35 m CMOS process, the full network can be refreshed in200 s. The speed is I/O limited while the core allows for 20 times fasteroperation.

The network operates in discrete time update scheme, i.e. all the neu-ron outputs are calculated once for each network cycle. The 256-neuronnetwork is partitioned into four blocks: the 128 synapses of every neuronare driven by axons incoming from all the four blocks and the networkinputs. The block internal connections can be arbitrary, whereas the inter-

block connections are hardwired.Following the Maas terminology, the ASIC chip represents the liquid

acting as a non-linear filter upon the input. The ASIC response at a cer-tain time step is called the liquid state x(t). The reconfigurability of theused ANN ASIC allows to explore qualities of physically different liquids.The liquids are generated at random by drawing the weights form a zero-centered Gaussian distribution that is governed by the number of neuronsN, number of incoming connections k per neuron and the variance 2. Thereadouts (linear classifiers) are implemented in SW: v(t) = (wi xi),where weights wi are determined with a least-squares linear regression forthe desired values y(t). The resulting machine performance is evaluated onlinearly non-separable problem of 3-bit parity in time by two theoretical in-formatics measures: 1) mutual information MI between y and v at given timestep t and 2) the sum of MI along the preceding time-steps, which is thememory capacity MC assessing the capability to account for the precedinginputs.

36


38/50

Notably, the liquids major quality to serve as a memory storage is mea-

sured in bits. At every iteration of generation parameter sweep (a dot infig. 4.5(b)), a number of liquids were generated and readouts trained for thesame function. The average MC distinctly peaks along the hyperbolic band.This band shows a sharp transition from the ordered dynamics (area below)to the chaotic behavior (above). To estimate the ability to support multiplefunctions, multiple linear classifiers were experimented on the same liquid.The mean MI shows that the critical dynamics yield a generic (independentof the readout) liquid.

The experiments reproduce the earlier published theoretical and sim-ulation results showing that the linear classifiers can be successful whenthe liquid exhibits the critical dynamics between the order and chaos. The

experiments with this general purpose ANN ASIC allow to explore the nec-essary connectivity and accuracy of future hardware implementations. Thenext step planned is to use area of the ASIC to realize the readout. Such anLSM will be able to operate in real-time on continuous data streams.

4.4 The Blue Brain

The first stage is to try the novel biologically-realistic simulation on theBG, a typical general-purpose supercomputer.

4.4.1 Blue GeneApplication-driven design approach

Financed by taxpayers under American nuclear program pretext, theIBM corporations designers of this machine claim to bridge the gap be-tween cost/performance of existing supercomputers and application-spe-cific machines [28] making it as cheap as a cluster solution. This objec-tive meshes nicely with additional goals of achieving exceptional perfor-mance/power and performance/space ratios. The key enabler to the BG/Lfamily is low power design the machine is made of low-frequency low-power IBM PowerPC chips.

To overcome the challenges in designing the good performance usingmany processors of moderate frequency, the innovations were restricted toscalability enhancement at little cost and the options were estimated on aselected classes of representative applications. The machine is announcedas the first scientific-dedicated computer with the primary goal on DNAand protein folding simulation.

37


39/50

Usability

The networks were designed with extreme scaling in mind. They sup-port short messages (as small as 32 bytes) and HW collective operations(broadcast, reduction, barriers, interrupts, etc.).

Developing the machine at ASIC level allowed to integrate the reliabil-ity, accessibility and survivability (RAS) functions into single-chip nodes,so that the machine would stay reliable and usable even at extreme scales.The feature is crucial, since the probability of machine to start approacheszero as the number of nodes grows. In contrast, clusters typically dontpossess this goodness at all.

The full potential cannot be disclosed without system SW, standard li-

braries and performance monitoring tools. Though BG/L was designedto support both distributed-memory and message-passing programmingmodels efficiently, the architecture is tuned for the dominant MPI interface.From the user perspective the BG/L appears as up to 216 compute node net-work. But this is not an architectural limit. Every 1024 nodes are assembledinto a rack consuming 0.92 1.9 m3 of space and 27.5 kW of power.

Nodes and Networks

Every compute node is 130 nm SoC ASIC containing two PPC440 cores.The cores share 4 MB L3 DRAM cache and 512 MB main memory. It is in-teresting to have (L2 = 2 kB) < (L1 = 32kB). For our purposes it is alsoworth to point out that every core has two double precision (32 bits) FPUs.Running at 700 MHz, nodes jointly deliver 5.6 GFlops at peak and 77 % ofit in benchmarks.

The compute nodes are interconnected through five networks the ma-jor of which is a symmetrical 64 32 32 3D torus. Each node has, there-fore, six independent bidirectional neighbor links. The signaling rate andlatency of the links are 1.4 GB/s and 100 ns correspondingly. Symmetrymeans that the links have the same bandwidth and almost the same latencyregardless of the physical distance whether nodes are located closely onthe same board or on neighbor rack (rack accommodates 85 % of intercon-nections). The maximal network distance is, therefore, 32 + 16 + 16 = 64hops and bandwidth 216 2.1 GB/s = 138 TB/s.

Aggregated into a Gigabit Ethernet network, I/O nodes supply externalparallel file system interface. The number of I/O nodes is configurablewith maximum IO-to-compute node ratio of 1 : 8.

Two other interconnects are collective and barrier networks. Combiningnodes into trees, they are useful for asthmatic reductions and global result

broadcasting from the root back to the nodes.Finally, the control system networks are the various networks such as I2C

and JTAG used to initialize, monitor and control all registers of nodes, tem-

38


40/50

perature sensors, power supplies, clock trees and etc. more than 250 000

endpoints for a 64 k machine. A 100 Mb Ethernet connects them to the host.A partition mechanism, based on link nodes, enables each user to have a

dedicated set of nodes. The same mechanism also isolates any faulty nodes(once fault is isolated, the program restarts at the last RAS checkpoint).

4.4.2 Brain simulation on BG

EPFL has its own neorosimulation computer lab, which produces spe-cial HW, the MANTRA presented above. The BB team does not explainwhy did they choose an inefficient general-purpose computer. Though theyclaim that the computer was developed for their project in mind, BG does

not confirm this in the list of design reference apps. Indeed, the broadcastnetworks and reduced power of nodes seem to be in line with the neoro-processing and neural models teaching us to compute by myriads of ele-mentary processors. However, orientation to Flops and MIMD architectureare considered improper in ANN. Yet, the diversity of biologically-realisticneurons may well require the irregularity of MIMD nodes. We have to con-clude that the existing neurocomputers are not flexible enough to meet thecomplexity of BB the model3.

There is no links on the official site to the simulation laboratory. Yet, Ihave encountered some Goodman Lab4, that develops an MPI-based neo-cortical simulator similar to the PCSIM launched by Maass, has H.Markram

on its list and reports the progress with Blue Gene! Here is a report issuedin 20055.

It describes the tests run their NeoCordical Simulator (NCS) ported on a1024-CPU BG. The neurons are connected at random (the connection prob-ability drops with the distance) and spike activity is observed. The networksize is synapse memory-limited 512 MByte per CPU allow networks of5 billion synapses. This means 100 bytes/synapse, which strikingly con-trasts with all the neurocomputers presented above that limit the weights

below 2 bytes. Trying up to 2500 neurons with 676 Msynapse networks, theSpike-per-CPU measure shows near-linear scalability, which significantlydrops in 1024-processor mode, probably due to the extra communication

workload.Previously, NCS was running on a Beowulf cluster, each computer of

which is treated as a 2-CPU 4GB node (Memory stores synapses and thusbounds the network size to 109 synapses6). Surprisingly, they call this to

3http://brain.cs.unr.edu/publications/gwm.largescalecortex 01.pdf and [29] explainthat it is infeasible to characterize the fine-grain connectionism without large-scale mod-eling on coarse-grain supercomputers

4http://brain.cs.unr.edu/5http://brain.cs.unr.edu/publications/NCS BlugGene report 07Nov04.pdf6In the previous work http://brain.cs.unr.edu/publications/hbfkbgk hardware 02.pdf

39


41/50

utilize a very fine grain parallelism. It is also surprising that this cluster

outperforms the BG! The personnel guess that besides 3x slower CPUs, 1)the Beowulfs Myrinet is better than the supercomputers 3D torus; and2) that the NN distribution is optimized for the cluster. The BG profilingtools show that one order of magnitude SW performance improvement ispossible.

Now, I see that Maass builds a similar simulator. All this suggests thatBB is just a branch of this project. Indeed, [30] confirms that BB runs Good-man simulator. As of 2008, BB reports that 8 kCPU BG has fulfilled the goal rats cortical column has been recreated.

4.5 ConclusionsIn the last decade, the attention has switched to the more realistic spik-

ing NNs, which are theoretically much more powerful than the conven-tional ones. The topology is as important to the capacity of the networkas the size: the optimal quality was found at the edge between order andchaos. Exploiting the local connectivity along with low network activityin the form of event-list and disabling decayed dendrites, the event-drivenneurocomputers made of custom digital chips may deliver almost any sim-ulation performance.

Yet, looking at semiconductor roadmap, the enormous gap between thedigital performance and requirements to simulate large parts of the brainthat cannot be bridged by the digital VLSI. The digital computer is Turingparadigm-based: it is i

neural computation platforms: to the blue brain and beyond

Documents