computing beyond a million processors - bio-inspired massively-parallel architectures

44
ACACES 12 July 2009 1 Computing beyond a Million Processors - bio-inspired massively-parallel architectures Steve Furber The University of Manchester [email protected] SBF is supported by a Royal Society-Wolfson Research Merit Award Andrew Brown The University of Southampton [email protected]

Upload: herrod-acosta

Post on 30-Dec-2015

24 views

Category:

Documents


0 download

DESCRIPTION

Computing beyond a Million Processors - bio-inspired massively-parallel architectures. Andrew Brown The University of Southampton [email protected]. Steve Furber The University of Manchester [email protected]. SBF is supported by a Royal Society- Wolfson Research Merit Award. - PowerPoint PPT Presentation

TRANSCRIPT

ACACES 12 July 2009 1

Computing beyond a Million Processors

- bio-inspired massively-parallel architectures

Steve FurberThe University of [email protected]

SBF is supported by a Royal Society-Wolfson Research Merit Award

Andrew BrownThe University of Southampton

[email protected]

ACACES 12 July 2009 2

Outline

• Computer Architecture Perspective• Building Brains• Living with Failure• Design Principles• The SpiNNaker system• Concurrency• Conclusions

ACACES 12 July 2009 3

Multi-core CPUs• High-end uniprocessors

– diminishing returns from complexity– wire vs transistor delays

• Multi-core processors– cut-and-paste– simple way to deliver more MIPS

• Moore’s Law– more transistors– more cores

… but what about the software?

ACACES 12 July 2009 4

Multi-core CPUS

• General-purpose parallelization– an unsolved problem– the ‘Holy Grail’ of computer science for half a century?– but imperative in the many-core world

• Once solved– few complex cores, or many simple cores?– simple cores win hands-down on power-efficiency!

ACACES 12 July 2009 5

Back to the future

• Imagine…– a limitless supply of (free) processors– load-balancing is irrelevant– all that matters is:

• the energy used to perform a computation• formulating the problem to avoid synchronisation• abandoning determinism

• How might such systems work?

ACACES 12 July 2009 6

Bio-inspiration

• How can massively parallel computing resources accelerate our understanding of brain function?

• How can our growing understanding of brain function point the way to more efficient parallel, fault-tolerant computation?

ACACES 12 July 2009 7

Outline

• Computer Architecture Perspective• Building Brains• Living with Failure• Design Principles• The SpiNNaker system• Concurrency• Conclusions

ACACES 12 July 2009 8

Building brains

• Brains demonstrate– massive parallelism (1012 neurons)– massive connectivity (1015 synapses)– excellent power-efficiency

• much better than today’s microchips

– low-performance components (~ 100 Hz)– low-speed communication (~ metres/sec)– adaptivity – tolerant of component failure– autonomous learning

ACACES 12 July 2009 9

(www.ship.edu/ ~cgboeree/theneuron.html)

Neurons

• Multiple inputs (dendrites)

• Single output (axon)– digital “spike”– fires at 10s to 100s

of Hz– output connects to

many targets• Synapse at

input/output connection

0 20 40 60 80 100 120 140 160 180 200-80

-60

-40

-20

0

20

40

0 20 40 60 80 100 120 140 160 180 200-80

-60

-40

-20

0

20

40

ACACES 12 July 2009 10

Neurons

• A flexible biological control component– very simple

animals have a handful

– bees: 850,000– humans: 1012

(photo courtesy of the Brain Mind Institute, EPFL)

ACACES 12 July 2009 11

• Regular high-level structure– e.g. 6-level cortical

microachitecture• low-level vision, to• language, etc.

• Random low-level structure– adapts over time (f

acu

lty.

wash

ing

ton

.ed

u/

rhevn

er/

Mis

cella

ny.

htm

l)

Neurons

ACACES 12 July 2009 12

Neural Computation

• To compute we need:– Processing– Communication– Storage

• Processing:abstract model– linear sum of

weighted inputs• ignores non-linear

processes in dendrites– non-linear output function– learn by adjusting synaptic weights

w1

x1

w2

x2

w3

x3

w4

x4

yf

ACACES 12 July 2009 13

0&

/

)(

Aset

fireAif

IAA

xwI

ttx

A

A

iii

kiki

• Leaky integrate-and-fire model– inputs are a series of spikes– total input is a weighted

sum of the spikes– neuron activation is the

input with a “leaky” decay– when activation exceeds

threshold, output fires– habituation, refractory

period, …?

Processing

ACACES 12 July 2009 14

• Izhikevich model– two variables, one fast, one slow:

– neuron fires when

v > 30; then:

– a, b, c & d select behaviour

)(

140504.0 2

ubvau

Iuvvv

duu

cv

( www.izhikevich.com )

Processing

0 20 40 60 80 100 120 140 160 180 200-80

-60

-40

-20

0

20

40

0 20 40 60 80 100 120 140 160 180 200-80

-60

-40

-20

0

20

40

0 20 40 60 80 100 120 140 160 180 200-14

-12

-10

-8

-6

-4

-2

0

2

0 20 40 60 80 100 120 140 160 180 200-14

-12

-10

-8

-6

-4

-2

0

2

v

u

ACACES 12 July 2009 15

Communication

• Spikes– biological neurons communicate principally

via ‘spike’ events– asynchronous– information is only:

• which neuron fires, and• when it fires

0 20 40 60 80 100 120 140 160 180 200-80

-60

-40

-20

0

20

40

0 20 40 60 80 100 120 140 160 180 200-80

-60

-40

-20

0

20

40

ACACES 12 July 2009 16

Storage

• Synaptic weights– stable over long periods of time

• with diverse decay properties?– adaptive, with diverse rules

• Hebbian, anti-Hebbian, LTP, LTD, ...

• Axon ‘delay lines’• Neuron dynamics

– multiple time constants• Dynamic network states

ACACES 12 July 2009 17

Outline

• Building Brains• Computer Architecture Perspective• Living with Failure• Design Principles• The SpiNNaker system• Concurrency• Conclusions

ACACES 12 July 2009 18

The Good News...Transistors per Intel chip

0.001

0.01

0.1

1

10

100

1970 1975 1980 1985 1990 1995 2000

Year

Mill

ion

s o

f tr

ansi

sto

rs p

er c

hip

8008

8080

8086

286

386

486

Pentium

4004

Pentium II

Pentium III

Pentium 4

ACACES 12 July 2009 19

...and the Bad News

• Device variability

&

• Component failure

0.0 0.2 0.4 0.6 0.8 1.00.0

0.2

0.4

0.6

0.8

1.0

Vo

ut2(V

)

Vout1

(V)

0.0 0.2 0.4 0.6 0.8 1.00.0

0.2

0.4

0.6

0.8

1.0

Vo

ut2(V

)

Vout1

(V)

ACACES 12 July 2009 20

Atomic Scale devices

The simulationParadigm now

A 4.2 nm MOSFETIn production 2023

A 22 nm MOSFETIn production 2008

ACACES 12 July 2009 21

A view from Intel

• The Good News:– we will have 100 billion transistor ICs

• The Bad News:– billions will fail in manufacture

• unusable due to parameter variations

– billions more will fail over the first year of operation• intermittent and permanent faults

(Shekhar Borkar, Intel Fellow)

ACACES 12 July 2009 22

A view from Intel

• Conclusions:– one-time production test will be out– burn-in to catch infant mortality will be impractical– test hardware will be an integral part of the design– dynamically self-test, detect errors, reconfigure,

adapt, ...(Shekhar Borkar, Intel Fellow)

ACACES 12 July 2009 23

Outline

• Building Brains• Computer Architecture Perspective• Living with Failure• Design Principles• The SpiNNaker system• Concurrency• Conclusions

ACACES 12 July 2009 24

Design principles

• Virtualised topology– physical and logical connectivity are

decoupled• Bounded asynchrony

– time models itself• Energy frugality

– processors are free– the real cost of computation is energy

ACACES 12 July 2009 25

Outline

• Building Brains• Computer Architecture Perspective• Living with Failure• Design Principles• The SpiNNaker system• Concurrency• Conclusions

ACACES 12 July 2009 26

SpiNNaker project

• Multi-core CPU node– 20 ARM968 processors– to model large-scale systems of spiking neurons

• Scalable up to systems with 10,000s of nodes– over a million processors– >108 MIPS total

• Power ~ 25mw/neuron

ACACES 12 July 2009 27

SpiNNaker project

ACACES 12 July 2009 28

• Fault-tolerant architecture for large-scale neural modelling

• A billion neurons in real time

• A step-function increase in the scale of neural computation

• Cost- and energy-efficient

SpiNNaker project

ACACES 12 July 2009 29

SpiNNaker system

ACACES 12 July 2009 30

CMP node

ACACES 12 July 2009 31

ARM968 subsystem

ACACES 12 July 2009 32

GALS organization

• clocked IP blocks• self-timed

interconnect• self-timed inter-

chip links

ACACES 12 July 2009 33

Outline

• Building Brains• Computer Architecture Perspective• Living with Failure• Design Principles• The SpiNNaker system• Concurrency• Conclusions

ACACES 12 July 2009 34

Circuit-level concurrency

• Delay-insensitive comms– 3-of-6 RTZ on chip– 2-of-7 NRZ off chip

• Deadlock resistance– Tx & Rx circuits have high deadlock

immunity– Tx & Rx can be reset independently

• each injects a token at reset• true transition detector filters surplus

token

din

(2 phase)

dout

(4 phase)

¬reset ¬ack

Tx Rxdata

ack

ACACES 12 July 2009 35

System-level concurrency

• Breaking symmetry– any processor can be Monitor Processor

• local ‘election’ on each chip, after self-test

– all nodes are identical at start-up• addresses are computed relative to node with host

connection (0,0)

– system initialised using flood-fill• nearest-neighbour packet type• boot time (almost) independent of system scale

ACACES 12 July 2009 36

Application-level concurrency

• Event-driven real-time software– spike packet arrived

• initiate DMA

– DMA of synaptic data completed

• process inputs• insert axonal delay

– 1ms Timer interrupt

update_Neurons();

update_Stimulus();

sleeping

event

goto_Sleep();

Priority 1 Priority 2Priority 3

Timer MillisecondInterrupt

DMA CompletionInterrupt

Packet ReceivedInterrupt

fetch_Synaptic_

Data();

ACACES 12 July 2009 37

Application-level concurrency

• Cross-system delay << 1ms– hardware

routing– ‘emergency’

routing• failed links• Congestion

– if all else fails• drop packet

ACACES 12 July 2009 38

• Firing rate population codes– N neurons– diverse tuning– collective coding of

a physical parameter– accuracy– robust to neuron

failure(Neural Engineering,

Eliasmith & Anderson 2003)

-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 10

50

100

150

200

250

300

350

400

450

500

firin

g ra

te

parameter

-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 10

50

100

150

200

250

300

350

400

450

500

firin

g ra

te

parameter

N

Biological concurrency

ACACES 12 July 2009 39

Unordered N-of-M

Ordered N-of-M

M-bit binary

Number of codes CNM M!/(M-N)! 2M

e.g. M=100, N=20 1021 1039 1030

e.g. M=1000, N=200 10216 10591 10301

• Single spike/neuron codes– choose N to fire from a population of M– order of firing may or may not matter

Biological concurrency

ACACES 12 July 2009 40

Outline

• Building Brains• Computer Architecture Perspective• Living with Failure• Design Principles• The SpiNNaker system• Concurrency• Conclusions

ACACES 12 July 2009 41

Software progress

• ARM SoC Designer SystemC model

• 4 chip x 2 CPU top-level Verilog model

• Running:• boot code• Izhikevich model• PDP2 codes

• Basic configuration flow

• …it all works!

ACACES 12 July 2009 42

Where might this lead?

• Robots– iCub EU project– open humanoid

robot platform– mechanics, but no

brain!

ACACES 12 July 2009 43

Conclusions

• Many-core processing is coming– soon we will have far more processors than we can

program• When (if?) we crack parallelism…

– more small processors are better than fewer large processors

– synchronization, coherent global memory, determinism, are all impediments

• Biology suggests a way forward!– but we need new theories of biological concurrency

ACACES 12 July 2009 44

UoM SpiNNaker team