chapter 7 (part 3) neural networks. what are neural networks? an extremely simplified version of the...

Chapter 7 (part 3)Neural Networks

What are Neural Networks?• An extremely simplified version of

the brain• Essentially a function approximator

Transform inputs into outputs to the best of its ability

inputs outputs

inputs outputsNN

What are Neural Networks?

What are Neural Networks?• For the most part, neural networks are useful in

situations where multiple regression is used.• In a neural network

• the independent variables are called input cells

• the dependent variable is called an output cell (depending on the network model there might be more than one output).

• As in regression there is a certain number of observations (say N).

• Each observation contains a value for each input cell (or independent variable) and the output cell (or dependent variable).

In an illustration of neural network, each circle is a cell of the network. There are three different types of cells:

• Input cells are the cells in the first column of each network. The first column of cells in the network is called the Input Layer.• Output cell is the cell in the last column of the network and the last layer is the Output Layer of the network.• All other layers of the network are called Hidden Layers.

• Cell 0 may be viewed as a special input cell, analogous to the constant term in multiple regression.

• Each other Input cell in the input layer corresponds to an independent variable.

• Each cell in the Output layer represents a dependent variable we want to predict.

• The arc in the network joining cells i and j has an associated weight wij.

0

1

2

3

4

7

5

6

8

The figure above has 4 input variables and one hidden layer. With the exception of cell 0, each cell in the network is connected by an arc to each cell in the next network layer. There is an arc connecting cell 0 with each cell in a Hidden or Output Layer.

Input LayerHidden Layer

Output Layer

INPUT and OUTPUT VALUES For any observation, each cell in the network has an associated input value and output value.

For any observation, each input cell (other than cell 0):

has an input value equal to the value of input i for that observation;has an output value equal to the value of input i for that observation.

For example, if the first input equals 3, then the input and output from node 1 are both 3.

For cell 0, the input and output values always equal 1.

Let INP(j) = input to cell j and OUT(j) = output from cell j.

For now we suppress the dependence of INP(j) and OUT(j) on the particular observation.

For any cell not in the Input Layer( ) (output from cell )ij

all i

INJ j w i

The following equations illustrate the use of the previous formula by computing the inputs to cells 5-7 of the previous neural network figure. Suppose that for a given observation input cell j takes the value Ij. Then:

INP(5) = w05(1) + w15I1 + w25I2 + w35I3 + w45I4INP(6) = w06(1) + w16I1 + w26I2 + w36I3 + w46I4INP(7) = w07(1) + w17I1 + w27I2 + w37I3 + w47I4 0

1

2

3

4

7

5

6

8

Input Layer Output LayerHidden Layer

To determine the output from any cell not in the input layer we need to use a transfer function. We will later discuss the most commonly used transfer functions, but for now the transfer function f may stand for any function. Then for any cell j not in the Input Layer

OUT(j) = f(INP(j))

For our figure, this formula yieldsOUT(5) = f(INP(5))OUT(6) = f(INP(6))OUT(7) = f(INP(7))

By using the formulas and the previous results, we can now determine the input and output for our output cell, cell 8.

INP(8)=w08(1)+w58OUT(5)+w68OUT(6)+

78OUT(7)

OUT(8) = f(INP(8))

For any observation, OUT(8) is our prediction for the output cell.

The complex math involved in neural networks is used to compute weights that produce "good" predictions.

More formally, let

OUTj(8) = output of cell 8 for observation j

Oj = actual value of cell 8 for observation j.

The real work in the neural network analysis is to determine network weights wij that minimize 2

1

( (8) )N

j jj

OUT O

SCALING DATA

For the computations in a neural net to be tractable it is desirable to scale all data (inputs and outputs) so that all inputs lie in either of the following two intervals:

Interval 1: [0, 1] Interval 2: [-1, +1]

If you want to scale your data to lie on [0, 1] then any value x for Input i should be transformed into

Note: Range = Largest value for Input i – Smallest value for input i

and any value x for the output should be transformed into

(smallest value for input )

range for input

x i

i

(smallest value of output)

range for output

x

If you want to scale your data to lie on [-1 +1] then any value x for input i should be transformed into If you want to scale your data to lie on [-1 +1] then any value x for the output should be transformed into

2[ (largest value for input )]1

range for input

x i

i

2[ (largest value for output)]1

range for output

x

THE SIGMOID TRANSFER FUNCTIONVast experience with neural networks indicates that the sigmoid transfer function usually yields the best predictions.If your data has been scaled to lie on [-1, +1] the relevant sigmoid function is given by

1

1

2)(

inpute

inputf

Note that for this function an input near - yields an output near -1 and an input near + yields an output near +1.

If your data has been scaled to lie on [0 1] the relevant sigmoid function is given by

Note that for this function an input near - yields an output near 0 and an input near + yields an output near 1.

inputinput

inputinput

ee

eeinputf

)(

REMARKS1. The sigmoid function is often called the squashing function because it "squashes" values on the interval [-, +] to the unit interval [0 1].2. The slope of the sigmoid function for the [0 1] interval is given by

f'(Input) = f(Input)(1-f(Input))

This implies that the sigmoid function is very steep for intermediate values of the input and very flat for extreme input values.

TESTING AND VALIDATION

When we fit a regression to data, we often use 80%-90% of the data to fit a regression equation, and use the remaining data to "validate" the equation.

The same technique is used in neural networks. We begin by designating 80%-90% of our data as the training or learning data set. Then we "fit" a neural network to this data.

Suppose cell 8 is the output cell.

Let Oj(8) = the actual output for observation j

And

OUTj(8) be the value of the output cell for observation j.

Let AVGOT(8) = average value of the output for the training data.

Analogous to regression define;

SST(Train) = Sum of Squares Total for Training data SST(Train) = (Oj(8) - AVGOT(8))2

and

SSR(Train) = (OUTj(8) - AVGOT(8))2

Define;

R2(Train) = SSR(Train)/SST(Train)

If the network is to be useful for forecasting, the R2 computed from the Test portion of the data should be close to R2(Train).

CONTINUOUS AND BINARY DATA

If your dependent variable assumes only two values (say 0 and 1) we say we have binary data.

In this case the usual procedure is to try and train the network until as many outputs as possible are below 0.1 and above 0.9.Then those observations with predictions less than .1 are classified as 0 and those observations with predictions larger than .9 are classified as 1.

If we do not have binary data we say that the data is continuous.

Examples of the Use of Neural Networks

Example 1: The efficient market hypothesis of financial markets states that the "past history" of a stock's returns yields no information about the future return of the stock. White (1988) examines returns on IBM to see if the market is efficient. He begins by running a multiple regression where the dependent variable is the next day's return on IBM stock and the five independent variables are the return on IBM during each of the last five days.

This regression yielded R2 = .0079, which is consistent with the efficient market hypothesis.White then ran a neural network (containing one hidden layer) with the output cell corresponding to the next day's return on IBM and 5 input cells corresponding to the last five days' return on IBM. This neural network yielded R2 = .179. This implies that the past five days of IBM returns do contain information that can be used to make predictions about tomorrow's return on IBM! According to the October 9, 1993 Economist, Fidelity manages 2.6 billion dollars in assets using neural nets. One of the neural net funds has beat the S&P 500 by 2-7% a quarter for more than three years.

Example 2: Researchers at Carnegie-Mellon University have developed ALVINN 1 (Automated Land Vehicle in a Neural Network), a neural network that can drive a car! It can tell if cars are nearby and then slow down the car. Within ten years a neural network may be driving your car!

Example 3: In finance and accounting it is important to accurately predict whether or not a company will go bankrupt during the next year. Altman(1968) developed a method (Altman's Z-statistic) to predict whether or not a firm will go bankrupt during the next year based on the firm's financial ratios. This method used a version of regression called discriminant analysis. Neural networks using financial ratios as input cells have outperformed Altman's Z.

Example 4: The September 22, 1993 New York Times reported that Otis Elevator uses neural networks to direct elevators. For example, if elevator 1 is on floor 10 and going up, elevator 2 is on floor 6 going down, and elevator 3 is on floor 2 and going up, which elevator should answer a call to go down from floor 7?Example 5: Many banks (Mellon and Chase are two examples) and credit card companies use neural networks to predict (on the basis of past usage patterns) whether or not a credit card transaction should be disallowed. AVCO Financial used a neural net to determine whether or not to lend people money. They increased their loan volume by 25% and decreased their default rate by 20%!

Example 6: "Pen" computers and personal digital assistants often use neural nets to read the user's handwriting. The "inputs" to the network are a binary representation of what the user has written. For example, let a 1 = place where we have written something and a 0 = a place where we have not written something. An input to the network might look like00011110001000000111000000l00001110

The neural net must decide to classify this input as an "s", a "5" or something else.

Lecun tried to have a neural net "read" handwritten zip code digits. 7291 digits were used for training and 2007 for testing. Running the neural net took three days on a Sun workstation. The net correctly classified 99.86% of the Training data and 95.0% of the Test data.

Why Neural Networks Can Beat Regression: The XOR

ExampleThe classical XOR data set can be used to obtain a better understanding of how neural networks work, and why they can pick up patterns that regression often misses. The XOR data set also illustrates the usefulness of a hidden layer. The XOR data set contains two inputs, one output, and four observations.

The data set is given below:

Observation Input 1 Input 2 Output

1 0 0 0

2 1 0 1

3 0 1 1

4 1 1 0

We see that the output equals 1 if either input (but not both) is equal to 1.If we try to use regression to predict the output from the two inputs, we obtain the equation = .5. This equation yields an R2 = 0, which means that linear multiple regression yields poor predictions indeed.

Now let's use the neural network of the following figure to predict the output.

2

0

1 3

We assume that for some θ the transfer function for the output cell is defined by f(x) = 1 if x and f(x) = 0 if x<. Given this transfer function, it is natural to ask whether there are any values for and the wij's that will enable the above

network to make the correct predictions for the data set in the table in previous slide.

From the figure in previous slide we find

INP(3) = w03 + w13(Input 1) + w23(Input 2)andOUT(3) = 1 if INP(3) and OUT(3) = 0 if INP(3)<

This implies that this figure will yield correct predictions for each observation if and only if the following four inequalities hold:

Observation 1: w03<Observation 2: w03 + w13Observation 3: w03 + w23Observation 4: w03 + w13 + w23<

There are no values of w03, w13, w23 and that satisfy the following inequalities:Observation 1: w03<Observation 4: w03 + w13 + w23<

To see this note that togetherObservation 3: w03 + w23Observation 4: w03 + w13 + w23<imply w13<0.

Adding w13<0 to Observation 1: w03<implies that w03 + w13< θ, which contradicts the following inequality

Observation 2: w03 + w13 Thus we have seen that there is no way the neural network of this figure can correctly predict the output for each observation.

Suppose we add a hidden layer with two nodes to the previous figure’s neural net.This yields the neural network in the following figure.

0

1

2

4

3

5

We define the transfer function fi at cell i by fi(x) = 1 if xi and fi(x) = 0 for x<i. If

we choose 3 = .4, 4 = 1.2, 5 = .5, w03 = w04 = w05 = 0, w13 = w14 = w23 = w24 = 1,

w35 = .6, w45 = -.2, then the neural network for the figure above will yield correct

predictions for each observation. We now verify that this is the case.

Observation 1:

Input 1 = Input 2 = 0, Output = 0Cell 3 Input = 1(0) + 1(0) = 0Cell 3 Output = 0 (0<.4)Cell 4 Input = 1(0) + 1(0) = 0Cell 4 Output = 0 (0<1.2)Cell 5 Input = .6(0) - .2(0) = 0Cell 5 Output = 0 (0<.5)

Observation 2:

Input 1 = 1, Input 2 = 0, Output = 1Node 3 Input = 1(1) + 1(0) = 1Node 3 Output = 1 (1.4)Node 4 Input = 1(1) + 1(0) = 1Node 4 Output = 0 (1<1.2)Node 5 Input = .6(1) -.2(0) = .6Node 5 Output = 1 (.6.5)

Observation 3:

Input 1 = 0 , Input 2 = 1, Output = 1Node 3 Input = 1(0) + 1(1) = 1Node 3 Output = 1 (1.4)Node 4 Input = 1(0) + 1(1) = 1Node 4 Output = 0 (1<1.2)Node 5 Input = .6(1) - .2(0) = .6Node 5 Output = 1 (.6.5)

Observation 4:

Input 1 = Input 2 = 1, Output = 0Node 3 Input = 1(1) + 1(1) = 2Node 3 Output = 1 (2.4)Node 4 Input = 1(1) + 1(1) = 2Node 4 Output = 1 (21.2)Node 5 Input = .6(1) - .2(1) = .4Node 5 Output = 0 (.4<.5)

We have found that the hidden layer enables us to perfectly fit the XOR data set!

REMARK

One might think that having more hidden layers will lead to much more accurate predictions. Vast experience shows that more than one hidden layer is rarely a significant improvement over a single hidden layer.

chapter 7 (part 3) neural networks. what are neural networks? an extremely simplified version of the...

Documents