lecture 1: feature extraction

61
Part 8: Neural networks EEEN60178: Digital Image Engineering Hujun Yin Cross-section of retina, drawn by Santiago Ramón y Cajal (1853-1934).

Upload: others

Post on 26-Jun-2022

11 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Lecture 1: Feature extraction

Part 8: Neural networks

EEEN60178: Digital Image Engineering Hujun Yin

Cross-section of retina, drawn by Santiago Ramón y Cajal (1853-1934).

Page 2: Lecture 1: Feature extraction

Part 8: Neural networks

• Perception & cognition Retina & visual pathway Artificial intelligence

• Neurons & models

• Learning paradigms & network architectures

• Multilayer perceptrons

• Self-organising networks

• Learning & generalisation

• Genetic algorithm

Page 3: Lecture 1: Feature extraction

8.1 Perception & cognition

• Perception &cognition

Tree

Page 4: Lecture 1: Feature extraction

8.1 Perception & cognition

• Evolution

“Evolution is fascinating to watch. To me it is the most

interesting when one can observe the evolution of a single man.” Shana Alexander

Page 5: Lecture 1: Feature extraction

8.1 Perception & cognition

• Evolution of eyes

Photosensory cell Receptor pigment Cup eye

Pinhole eye Single chambered (lens) eyes

Compound eyes have good time resolving power. Human eyes need 0.05 second to identify objects, while compounds eyes need only 0.01 second.

Page 6: Lecture 1: Feature extraction

8.1 Perception & cognition

• Evolution of eyes

Limulus

Abdomen

Eyes

Head

Ommatidium

Tail

Hartline, et al (1960s) verified receptive field and lateral inhibition in limulus eyes

Page 7: Lecture 1: Feature extraction

8.1 Perception & cognition

• Eyes

+

+

Lateral inhibition

To extract object boundaries!

Mach band effect

Page 8: Lecture 1: Feature extraction

8.1 Perception & cognition

• A system point of view

Spatial frequency

Spatial frequency

Spatial frequency

Spatial position

Spatial position

Input spectrum

Output spectrum

HVS

MTF

Rel

ati

ve s

ensi

tivi

ty

Inp

ut in

ten

sity

O

utp

ut in

ten

sity

Ma

gn

itu

de

Ma

gn

itu

de

Page 9: Lecture 1: Feature extraction

8.1 Perception & cognition

• Human eye

Cones: 6-7 millions

Red (64%)

Green (32%)

Blue (2%)

Rods: 120 millions

Dynamic range (brightest to darkest): 20 billion times (i.e. 206 dB)

Cameras: 8-16 bits

Page 10: Lecture 1: Feature extraction

8.1 Perception & cognition

• Retina

Page 11: Lecture 1: Feature extraction

8.1 Perception & cognition

• Visual pathway

There are about 100 millions of photosensitive cells in human retina, but only 1 million optic nerves connecting between retina and cortex.

Page 12: Lecture 1: Feature extraction

8.1 Perception & cognition

• Visual cortex

• Many region have predetermined responses to visual stimuli. • Sensory experience from the external world can influence how the brain wires itself up after birth.

V1 V2 V6 depth

V5 motion

V4 colour

V3 orientation

Page 13: Lecture 1: Feature extraction

8.1 Perception & cognition

• Cortex

Page 14: Lecture 1: Feature extraction

8.1 Perception & cognition

• Perception

Einstein and ??

- from Kelso (1997)

Page 15: Lecture 1: Feature extraction

8.1 Perception & cognition

• Perception

Robin’s vase

Kanisza triangle

Page 16: Lecture 1: Feature extraction

8.1 Perception & cognition

• Cognition

Cognition is a result of dynamic pattern forming process of the brain, in which activities in the brain and behaviours appear self-organising into coherent statues or patterns and self-organisation provide a paradigm for behaviour and

cognition, as well as the structure and function of nervous system - from Kelso

(1997)

Page 17: Lecture 1: Feature extraction

8.1 Perception & cognition

• Cognition

The human brain has trillions of cells, large enough for holding all events we have ever experienced [as claimed], but why can’t we remember all of them in details? Pinker believes that our brains are compelled to organise, without categories, the mental lift would be a chaos. Categorisations are for inference of objects having the same or similar properties.- from Pinker (1997)

Page 18: Lecture 1: Feature extraction

8.1 Perception & cognition

• Cognition

Psychology

Behaviourism Neurobiology

Cognitive Science Computationism Connectionism

Page 19: Lecture 1: Feature extraction

8.1 Perception & cognition

• Artificial system

Cognition is a complex information processing process (Marr 1982)

Page 20: Lecture 1: Feature extraction

8.1 Perception & cognition

• Computing history

China ancient Abaci

Blaise Pascale 1642 Pascaline or Adding Machine

Joseph Jacquard 1804 Punch card controlled loom

Charles Babbage 1823 Difference Engine

Augusta Ada 1836 Programming

Herman Hollerith 1880 Data Processing Devices (IBM)

UK team 1942 Colossus

USA team 1946 ENIAC

UK teams 1949 EDSAC and Manchester Mark I

Page 21: Lecture 1: Feature extraction

8.1 Perception & cognition

• Artificial intelligence

Alan Turing: Turing Machine,

Universal Turing Machine

Turing Test (Can Machines Think?)

Top-down: Representation/Symbols

Reasoning/Rules/Grammars

Learning/Logical Operations

Page 22: Lecture 1: Feature extraction

Turing test

"Can machines think?" A seminal paper, Computing Machinery and Intelligence, Alan Turing published in 1950 in Mind.

Page 23: Lecture 1: Feature extraction

Singularity?

Page 24: Lecture 1: Feature extraction

8.2 Neurons

• Models of neural networks

(Artificial) Neural Networks Approach (bottom-up approach)

° NNs started on findings and discoveries in neurobilogy and neuroscience. We gradually realise that human brains bear little resemblance to the von Neumann type of computers .

° The human brain has about 100 billion interconneted neurons, Each is a cell that uses biochemical reactions to receive, process and transmit stimulus. Each neuron is a simple processing unit, which receives (through dendrites and synapses) stimuli (input) from roughly 1000 neighbouring neurons (axons), cumulates these stimuli (in soma) and fires through its axon to its output neurons if the aggregated input is over a threshold (hillock). Networks of these cells form the basis of information processing.

Page 25: Lecture 1: Feature extraction

8.2 Neurons

• Biological neurons

~100 billion neurons

The brain has ~100 billion interconnected neurons, each connecting to ~1000 neighbouring neurons.

Page 26: Lecture 1: Feature extraction

8.2 Neurons

• Biological neurons

(60-100 trillion) synapses

Action potential

Protein

molecule

Potassium

Chloride

Sodium

Axon’s action potential opens ion channel, producing Ca2+ that leads to release of neurotransmitter, causing ion-conducting channels to open. Synapses can have either excitatory (depolarizing) or inhibitory (hyperpolarizing) effect on postsynaptic neurons.

Page 27: Lecture 1: Feature extraction

8.2 Neurons

• Biological neurons

Page 28: Lecture 1: Feature extraction

8.2 Neurons

• Models of neurons

bias

Synaptic weights

y=(xTw+b)x

Tw+b

wn

w2

w1

1

xn

x2

x1

b

(.)

Input signals

Adder Activation function

Output

1

v

(v)

0

1

v

(v)

0

Perceptron

Page 29: Lecture 1: Feature extraction

8.2 Neurons

• Models of neural networks

Multilayer Perceptron (MLP)

v3

v2

x1

x2

x3

xn-1

xn

y1

y2

y3

Hidden layer Output layerInput layer

Weights Weights

::

v1

vnh

Page 30: Lecture 1: Feature extraction

8.2 Neurons

• Artificial neural networks: a history

McCulloch and Pitts 1942: Linear Threshold Gate (LTG)

Wiener 1948: “Cybernetics”

Hebb 1949: “Organisation of Behaviour”

Rosenblatt 1958: Perceptron

Minsky and Papert 1969: “Perceptron”

Werbos 1974: Back-propagation algorithm

Hopfield 1982: Hopfield Recurrent Network

Kohonen 1982: Self-Organising Map

Rumelhart, et. al. 1986: Multilayer Perceptron

Page 31: Lecture 1: Feature extraction

Hopfield net

Step 1: Assign Connection Weights

tij is the connection weight from node i to node j and xis

which can be +1 or -1 is element i of the exemplar class s.

Step 2: Initialize with Unknown Input Pattern

µi(0) = xi, 0iN-1,

µi(t) is the output of node i at time t and xi is the element i of

the input pattern.

Step 3: Iterate Until Convergence

f is a activation function such as hard limit. The process is

repeated until the node outputs remain unchanged with further

iterations. The node outputs represent the exemplar pattern

that best matches the input.

Step 4: Repeat by Going to Step 2

1,0 , ,0

,1

0

Njiji

jixxt

M

s

sj

si

ij

10 ,)]([)1(1

0

NjttftN

i

iijj

Page 32: Lecture 1: Feature extraction

8.3 Learning paradigms

• Learning

The process which leads to the modification of behaviour _Oxford Dictionary

“Some growth process or metabolic change takes place” _ Donald Hebb

Learning is a process by which the free parameters of neural networks are adapted through a process of simulation by the environment in which the network is embedded. The type of learning is determined by the manner in which the parameter changes take place. _Simon Haykin

Page 33: Lecture 1: Feature extraction

8.3 Learning paradigms

• Learning

Learning, in neural network terms, is the process of modification of connection weights.

• Supervised Learning

–Error-correction learning

• Reinforcement learning

• Unsupervised Learning

– Hebbian learning

– Competitive learning

– Self-organisation

2| |),(| |min wxyd ),,( xydew

},{max asrewarda

yxw

otherwise 0,

wins neuron if ),,( iyxfw

ii

otherwise 0,

winner toclose neuron if , ixyw

ii

Global order can arise from local

interactions (Turing 1952)

rewardw

Page 34: Lecture 1: Feature extraction

8.4 Network architectures

• Feedforward

v3

v2

x1

x2

x3

xn-1

xn

y1

y2

y3

Hidden layer Output layerInput layer

Weights Weights

::

v1

vnh

• Feed-forward Networks – Perceptron and multilayer perceptron – Radial basis function – Support vector machine

• Recurrent Networks – Hopfield networks – Boltzmann machine

v3

v2

x1

x2

x3

xn

y1

y2

y3

Weights

:

v1

vnh

yn

Page 35: Lecture 1: Feature extraction

8.5 Feedforward networks

• Perceptron learning

Least mean square method:

n

k

Tkk tttwtxty

0

)()()()()( wx

)]([2

1 2 teEJ

)()()(

)()(

tette

tetJ

xww

)(2

1)( 2 tetJ

)]()()()[()()()()()(

)()1( tttdtttetttJ

tt Twxxwxw

www

)()( tytde

(without thresholding or activation function)

+1

-1

y=sign(xTw+b)x

Tw+b

wn

w2

w1

xn

x2

x1

b

Page 36: Lecture 1: Feature extraction

8.5 Feedforward networks

• Perceptron limitations

+1

-1

y=sign(xTw+b)x

Tw+b

wn

w2

w1

xn

x2

x1

b

Linear separation

Sensitive to initial states

Local minima

A local minimum

Global minimum

x

Cost function

Starting point

Page 37: Lecture 1: Feature extraction

8.5 Feedforward networks

• Multiplayer perceptron Back-propagation algorithm

Step 1: Initialisation. Set all weights and nodes’ biases to small random numbers.

Step 2: Presentation of training examples. input x=[x1, x2, …xn]T and

desired output d=[d1, d2,..dm]T.

Step 3: Forward computation. Calculate the outputs of the hidden and output layer,

Step 4: Backward computation and updating weights. Compute the error terms,

Step 5: Iteration. Repeat steps 3 and 4 and stop when the total error reaches the

required level).

)(1

okj

n

j

ojk

okk bvwy

h

)(1

hji

n

i

hij

hjj bxwv

))(1()1()'( kkkkkkok

ok

ok

ok ydyyyyee

m

k

ojk

okjj

m

k

ojk

ok

hj

hj wvvw

11

)1()'(

jok

ojk

ojk vww

ihj

hij

hij xww

v3

v2

x1

x2

x3

xn-1

xn

y1

y2

y3

Hidden layer Output layerInput layer

Weights Weights

::

v1

vnh

Page 38: Lecture 1: Feature extraction

8.5 Feedforward networks

• Radial basis functions

Properties:

•Hidden neurons are Gaussian

•Output neurons are linear similar to Single Layer Perceptrons.

•Easy to train Hidden nodes can be chosen first by clustering; output layer is trained as single layer perceptrons.

•Fast to converge

v3

v2

x1

x2

x3

xn-1

xn

y1

y2

y3

Hidden layer Output layerInput layer

Weights=1 Weights

::

v1

vnh

v

(v)

0

)2

exp(2

2

kkv

mx

Page 39: Lecture 1: Feature extraction

8.5 Feedforward networks

• Radial basis functions

The RBF model came from two fundamental theories

The Universal Approximation Theory: any function can be approximated with arbitrary precision by a weighted sum of a set of non-constant, bounded and monotone-increasing continuous functions,

Cover’s Separability Theory: A complex pattern-classification problem cast in a high-dimensional space nonlinearly is more likely to be linearly separable than in a low high dimensional space.

M

i

iwf

1

),(ˆ x

Page 40: Lecture 1: Feature extraction

8.6 Self-organising networks

• Hebbian learning

When an axon of cell A is near enough to excite a cell B and repeatedly or persistently takes part in firing it, some growth process or metabolic changes take place in one or both cells such that A’s efficiency as one of the cells firing B, is increased.

In mathematical term:

Oja’s rule:

xyw

)()]()()()[()(

})]()()([{

)()()()1( 2

2/1

1

2

twtytxtytw

tytxtw

tytxtwtw iiin

j

jj

iii

Page 41: Lecture 1: Feature extraction

8.6 Self-organising networks

• Hebbian learning

Input Output Weight Changes Weights

x1 x2 1 y w1 w2 b w1 w2 b

0 0 0

1 1 1 1 1 1 1 1 1 1

1 -1 1 -1 -1 1 -1 0 2 0

-1 1 1 -1 1 -1 -1 1 1 -1

-1 -1 1 -1 1 1 -1 2 2 -2

train a perceptron for an AND function, y=b+xiwi

X1

1

X2

1

1

0 .5

.5

Boundary

0

X1

Y

1 X2

1 -1

(.)

-

x1

x2

+

-

-

x1+x2=1

Page 42: Lecture 1: Feature extraction

8.6 Self-organising networks

• Von der Malsburg and Willshaw model (1973,1976)

'

'' )()()()()()(

k

kikk

k

ik

j

iijii tybtyetxtwtcyt

ty

otherwise 0

)( if ,)(*)(*

t*ytyty

jj

j

constant subject to ),(*)()(

ijji

ijwtytx

t

tw

Page 43: Lecture 1: Feature extraction

8.6 Self-organising networks

• Kohonen‘s SOM

nx

x

x

:

:

2

1

x

Otherwise 0,

bubble theinside is j neuron If ,1)1(ty j

(t) j if 0

(t)j if )],()([)()]()([)()()()(

)(

twtxtytwtxtwtytxty

t

tw iji

jijiijjijij

)()(min)()( tttt jj

i xwxw

])()([)1(

i

iijTjj tyhtty xw

Page 44: Lecture 1: Feature extraction

8.6 Self-organising networks

• SOM algorithm

)]()()[,,()()( tttkvtt vk wxw

• Updating the weights of winner and its neighbours.

• At each time t, present an input, x(t), select the winner.

cc

tv wx

)(minarg

• Repeat until the map converges.

Typical neighbourhood function: ])(2

exp[),,(2

2

t

kvtkv

Page 45: Lecture 1: Feature extraction

8.6 Self-organising networks

• SOM clustering

Dove

Hen

Duck Goose

Owl Hawk

Eagle

Fox

Dog Wolf

Cat

TigerLion HorseZebra

Cow

0 2 4 6 8 10

0

1

2

3

4

5

6

7

8

9

10

11

Page 46: Lecture 1: Feature extraction

8.6 Self-organising networks

• SOM Data visualisation/management

Page 47: Lecture 1: Feature extraction

8.6 Self-organising networks

• SOM

Quantisation & topology

Topologically “ordered” map

“Error tolerant” coding -HVQ

(Luttrell, NC 1994)

“Minimum wiring” (Mitchison, NC 1995),

(Durbin&Mitchson, Nature1990)

Page 48: Lecture 1: Feature extraction

8.6 Self-organising networks

• SOM extensions

° HSOM (Miikkulainen 1990), DISLEX (1990, 1997)

° PSOM (Ritter 1993), Hyperbolic SOM (1999), H2SOM

° Temporal Kohonen Map (Chappell &Taylor 1993)

° Neural Gas (Martinetz et al.1991), Growing Grid (Fritzke1995)

° ASSOM (Kohonen 1997)

° Recurrent SOM (Koskela, 1997), RecursiveSOM (Voegtlin2001)

° SOAR (Lampinen&Oja 1989), SOMAR (Ni&Yin, 2007)

° Bayesian SOM & SOMN (Yin&Allinson 1995,1997; Utsugi 1997)

° GTM (Bishop et al. 1998)

° GHSOM (Merkl et al. 2000), TOC(TreeSOM) (Freeman&Yin2004)

° PicSOM (Laaksonen, Oja, et al., 2000)

° ViSOM (Yin 2001, 2002), gViSOM (Yin 2007)

Page 49: Lecture 1: Feature extraction

8.7 Learning & generalisation

• Learning (fitting) & generalisation

0 2 4 6 8 100

1

2

3

4

5

6

X

Y

0 2 4 6 8 100

1

2

3

4

5

6

X

Y

On one hand, we want to get as high classification (or low error) rate as possible for the training data set.

On the other hand, we would also like the trained network to have good performance on unseen (testing) data.

Which is

better?

Page 50: Lecture 1: Feature extraction

8.7 Learning & generalisation

• Bias &Variance dilemma (finite data set D)

dxxpxdxy )()]()([2

1 2

)]()}({)}][({)([2

)]()}({[)}]({)([

)]()}({)}({)([)]()([

22

22

xdxyExyExy

xdxyExyExy

xdxyExyExyxdxy

222 )}]}({)({[)]()}({[})]()({[ xyExyExdxyExdxyE DDD

(bias)2 (variance)

A close fitter/model yields small bias but large variance, while a loose fitter (or a wild guess) gives small variance but large bias.

Page 51: Lecture 1: Feature extraction

8.7 Learning & generalisation

• Bias &Variance dilemma (finite data set D)

)()(})({

)}({)}({

0)]()}({[

2

2

xdxgxgE

xhExyEas

xdxyE

D

D

* Courtesy of http://www.aiaccess.net/

g(x): generating curve

dots: sample

For h1(x):

)()()}({

0)}]}({)({[

1

2

xyxhxyEas

xyExyED

For h2(x):

Low variance, but large bias (error) !

Low bias but large variance ! Poor generalisation !

That is, different samples give different h2(x), thus

large variance! Or if for Sample 2, one still uses

h2(x) obtained from Sample 1, the errors are large.

Page 52: Lecture 1: Feature extraction

8.7 Learning & generalisation

• Bias &Variance dilemma (finite data set D) sin(2x): generating curve

dots: samples

Best representation

Over-fitting !

0th-order 1st-order

9th-order 3rd-order

Page 53: Lecture 1: Feature extraction

8.8 Genetic algorithms

• Historical point of view

- Evolutionary Computation

A computational and optimisation methodology based on genetic and evolutionary mechanisms - proposed by John Holland in 1960s

Page 54: Lecture 1: Feature extraction

8.8 Genetic algorithms

• Genetic view

The human body consists of thousands of millions of cells. Each of the cells' nucleuses contains 46 chromosomes. 23 from each parent making 23 chromosome pairs in each cell of the offspring.

The chromosomes consists of DNA (deoxyribonucleic acid), which is a double helix made of sugar (deoxyribose), phosphorus and four nitrogen bases called A (adenine), T (thymine), C (cytosine) and G (guanine). The bases A and T, and the bases C and G, respectively, always make pairs.

Our DNA consists of 70-100 thousand genes. A gene is a segment of DNA that codes for the production of a specific protein by spelling the sequence of amino acids composing that protein. It is also the genes that give every person their own particular looks. The genes that we inherit from our parents decide which sex we will have, the colour of our skin and eyes, if we will be a tall or short person, and if a gene in our body is defect we can become ill.

Page 55: Lecture 1: Feature extraction

8.8 Genetic algorithms

• Genetic view

23 paired chromosomes of the human (male)

TraitTrait

(allele)(allele)

locuslocus

22 pairs of autosomes22 pairs of autosomes

Sex chromosomes: Sex chromosomes:

1 X and 1 Y 1 X and 1 Y –– male, or male, or

1 pair of X 1 pair of X –– femalefemale

Page 56: Lecture 1: Feature extraction

8.8 Genetic algorithms

• Genetic algorithm (GA)

Population: a set of possible solutions, or chromosomes.

Initial population: can be randomly chosen

Fitness: a fitness of a solution (chromosome) needs to be defined.

Encoding: to encode a solution (chromosomes ) to usually a binary sequence (a number of binary bits).

Page 57: Lecture 1: Feature extraction

8.8 Genetic algorithms

• GA operations

Selection: select a pair of, chromosomes from the population, with the

probabilities in proportion to their fitness.

Crossover: Randomly choose a locus and exchange the bits before and

after that locus between two chosen chromosomes to create two

child chromosomes. For example if the chosen parent strings are

11000100 and 01111110 and locus is chosen at the third bit.

Then the child strings are 11000110 and 01111100.

Mutation: Randomly (with a probability) flip some of the bits in the

chromosomes. For example, the string 00011000 might be

mutated in its fourth position to yield 00010000.

Page 58: Lecture 1: Feature extraction

8.8 Genetic algorithms

• Selection: Roulette wheel

Page 59: Lecture 1: Feature extraction

8.8 Genetic algorithms

• Crossover

• Mutation

11001011 + 11011111 = 11001111

11001111 => 10001011

Page 60: Lecture 1: Feature extraction

8.8 Genetic algorithms

• Example

Find the maximum point of

f(y)=y+|sin(32y)|, 0y<

Using 16-bit encoding for solutions, 0<y<

Population size: 200,

Initial population is randomly chose from the search space,

Fitness: f(y),

Crossover rate: 0.5

Mutation rate: 0.005

run matlab code: GA1.m

Page 61: Lecture 1: Feature extraction

Part 8: Neural networks (Summary)

• Introduction & cognition

• Artificial intelligence

• Retina & visual pathway

• Neurons & models

• Multilayer perceptron

• Self-organising networks

• Genetic algorithm