# Introduction

Backpropagation is a popular form of training multi-layer neural networks, and is a classic topic in neural network courses. It has the advantages of accuracy and versatility, despite its disadvantages of being time-consuming and complex. Personally, I think if you can figure out backpropagation, you can handle any neural network design.

In order to overcome the XOR problem encountered with earlier single-layer designs (i.e.: perceptrons), researchers needed to create multi-layer networks. However, the extra layers introduced the problem of how to train them. Backpropagation solved the problem by propagating training values backward from output to hidden and input layers. Why go to all the trouble to make the XOR network? Well, two reasons: (1) a lot of problems in circuit design were solved with the advent of the XOR gate, and (2) the XOR network opened the door to far more interesting neural network and machine learning designs.

# Number Of Training Samples

The number of output nodes must somehow represent the number of possible outcomes the network should have. So, it makes sense that the number of input samples should be as many. If you want the network to have two possible output, you'd need only one output node (i.e.: binary), so you'd need at least two input samples. If you want six possible outputs, you might choose six output nodes (i.e.: only one fires for its category), so you'd need at least six input samples.

Ultimately, we're talking about statistical functions, though, and statistics can only represent reality as well as the sample population. The "richness" and accuracy of the network depends on having as many samples that represent as many possible patterns as the network might encounter.

In some cases it may be impossible to provide every single last input pattern the network can accept. Like visual patterns: the total number of possible patterns is virtually incalculable. You may have to make due with a subset of as many diverse patterns as you can find.

If your input layer has three nodes, that implies 8 different possibilities of input patterns (2^3). Using all 8 would be best. The number of possible input patterns goes up depending on the number of input nodes. Like six nodes: 2^6 = 64 possibilities. Using all 64 will train your network to recognize the most possible patterns.

The short answer is this: it's better to use as many different input patterns as your network can receive and categorize.

# Hidden Layers

The number of nodes in a hidden layer determines the 'expressive power' of the network.  It can be said that hidden layer nodes cause a neural net to fit the noise of the input.

For 'smooth', easy functions with stable, softly changing variables, fewer hidden layer nodes are needed.

But for wildly fluctuating functions, more nodes will be needed.

This really is an important example of moderation. If you have TOO FEW hidden units, the quality of a prediction will drop and the net doesn't have enough "brains". And if you make it TOO MANY - it will have a tendency to "remember" the right answers, rather than predicting them. Then your neural net will work very well only on the familiar data, but will fail on the data that was never presented before. Too many hidden layer nodes causes the network to "specialize", when it really should "generalize". Neural networks are most often applied to real-world problems, which are frought with unknowns, so networks are designed to make "educated guesses" not exact answers.

# Activation Functions

## Sigmoid

Range: 0.0 and 1.0

Format:

y(x) = 1 / (1 + e^-x)

or, Graph: ## Bipolar Sigmoid

Range: 1.0 and -1.0

Format:

y(x) = 2 / (1 + e^-x) - 1

or, Graph: ## Sigmoid Derivative

Range: between 0.0 and 1.0 if Sigmoid function is used.

between -1.0 and 1.0 if Bipolar Sigmois function is used.

Format: note that the derivative function uses either the Sigmoid or Bipolar Sigmoid function from above.

y(x) = (1 / 2) * (1 + Sigmoid(x)) * (1 - Sigmoid(x))

## Hyperbolic Tangent (TANH)

Range: -1.0 and 1.0

Format:

y = (e^x - e^(-x)) / (e^x + e^(-x))

or, Graph: # Learning Rate and Network Paralysis

As the network receives training, the weights are adjusted to large values.  This can force all or most of the neurons to operate at large output values, (i.e.: in a region where F'(Net)-->0). Since the error sent back for training is proportional to F'(Net), the training process comes to a virtual standstill. One way to solve the problem is to reduce the learning rate, which, however, increases the training time.

# Momentum

The term "momentum" comes from the analogy of a rolling ball with high momentum passing over a narrow hole; if it moves fast enough, it will avoid getting stuck in the hole.

In neural network design, it describes a function used to eliminate the problem of getting trapped in local minima, with the goal of eventually arriving at the global minimum. Figure1. Local and global minima are terms used to describe solution states with various levels of error. Learning algorithms seek a resting state where error values no longer increase, these appear as wells in the wavy line above. A global minima is the ideal solution state, where the algorithm rests at a state of zero error. Local minima are the false solutions, places where the algorithm rests and the error level is above zero.

A network with high momentum responds slowly to new training examples that want to reverse the weights. If momentum is low, then the weights are allowed to oscillate more freely.

Back-propagation adjusts the weights to reach the minima of the error function.  However, the network can be trapped in local minima. This can be solved by adding momentum to the training rule or by statistical training methods applied over the back-propagation algorithm.

The addition of the momentum term forces the system to continue moving in the same direction on the error surface, without trapping at local minima.

Momentum is not something that we NEED, but it can speed up calculations significantly.

public void footer() {