XOR Neural Network: A Case Study in Non-Linearity

What we're doing:

Problem Definition:

We have two binary inputs (0 or 1) and one binary output.
The XOR function returns 1 if exactly one of the inputs is 1, and 0 otherwise.
XOR truth table:

| Input A | Input B | Output |
|---------|---------|--------|
|    0    |    0    |   0    |
|    0    |    1    |   1    |
|    1    |    0    |   1    |
|    1    |    1    |   0    |

Network Architecture:

Input Layer: 2 neurons (X1 and X2 in the diagram)
Hidden Layer: 2 neurons (H1 and H2)
Output Layer: 1 neuron (Y)

Training Process:

Initialize weights and biases randomly
Forward propagation: Calculate outputs through the network
Calculate loss: Compare predicted output with actual output
Back propagation: Calculate gradients and update weights
Repeat for many epochs to minimize the error

Activation Function:

We use the sigmoid function as our activation function

Implementation Details:

We use NumPy for efficient matrix operations
The network is trained for 10,000 epochs
We use Mean Squared Error (MSE) as our loss function

The simple diagram shows:

Blue circles: Input neurons (X1, X2)
Green circles: Hidden layer neurons (H1, H2)
Yellow circle: Output neuron (Y)
Lines: Connections between neurons where weights are applied

What is activation function?

In simple terms,

Neuron in brains receives a signal & output a signal. when a neuron in the brain receives an input, neuron takes the signal along with it's biases. Then processes the input to throw an output to the neighboring neuron. This processing is called activation function.

Each neuron receives inputs, either from the data (in the input layer) or from other neurons (in hidden or output layers).
These inputs are multiplied by weights and summed together, along with a bias term.
This sum is then passed through an activation function.

Role of the activation function:

The activation function determines the neuron's output based on its input.
It introduces non-linearity into the network, allowing it to learn complex patterns.

Meaning of activation:

If the activation function outputs a high value (close to 1 for sigmoid), we say the neuron is highly activated.
If it outputs a low value (close to 0 for sigmoid), we say the neuron is less activated or deactivated.

x1 and x2 are inputs
w1 and w2 are weights
Σ represents the summation of weighted inputs
f() is the activation function (like sigmoid)
y is the output (activation) of the neuron

The process:

Inputs are multiplied by weights: w1x1 + w2x2
The sum (plus a bias term) goes into the activation function
The activation function determines the neuron's output

The importance:

It allows neurons to make decisions about which information to pass forward.
It enables the network to represent complex, non-linear relationships in the data.
During training, the network learns which neurons should activate for different patterns of input, essentially learning to recognize features in the data.

Linearity & Non-linearity?

Understanding linearity and non-linearity is crucial in grasping why neural networks are so powerful. Let's break this down step by step.

Linearity:

In simpler terms, a linear function can be represented by a straight line in a 2D plane, or a flat plane in higher dimensions.

A linear function has two key properties:

Additivity:
Additivity ensures that the effect of combined inputs is always the sum of their individual effects.
f(x + y) = f(x) + f(y)
Homogeneity:
Homogeneity ensures that scaling inputs results in proportional scaling of outputs.
f(ax) = af(x), where a is a constant

Limitations of Linear Models:

If we only used linear functions in a neural network, no matter how many layers we added, the entire network would still be equivalent to a single linear function.
This means it could only learn linear relationships in data, which are often too simple for real-world problems.

Non-linearity:

A nonlinear function doesn't follow the properties of linearity. It can have curves, bends, or more complex shapes. Examples of non-linear functions:

f(x) = x^2
f(x) = sin(x)
f(x) = 1 / (1 + e^(-x)) (this is our sigmoid function!)

Power of Non-linearity:

Non-linear activation functions allow neural networks to learn complex, non-linear relationships in data.
They enable the network to approximate almost any function, given enough neurons and layers (this is known as the universal approximation theorem).
Non-linear activation functions allow neural networks to learn hierarchical features.
In deep networks, earlier layers can learn simple features, while later layers combine these into more complex features.

This problem defined earlier in this thread .

The XOR problem is a classic example that can't be solved by a linear model.
It requires a non-linear decision boundary to separate the classes correctly.

In this XOR diagram:

Blue points represent output 0
Red points represent output 1
The green dashed line shows a possible non-linear decision boundary that correctly separates the classes

A linear model could not draw a straight line to separate these points correctly, but a non-linear model can create a curved boundary that solves the problem.

By using non-linear activation functions like sigmoid, our neural network can learn to create these complex decision boundaries, enabling it to solve problems like XOR and many other real-world tasks that require non-linear solutions.

Sigmoid function:

f(x) = 1 / (1 + e^(-x))

The sigmoid function is used as an activation function for several important reasons:

As we discussed earlier, the nonlinear nature of this function is crucial. Sigmoid function is often considered biologically realistic for neural activation. But it's important to understand context and limitation.

Bounded output: Sigmoid produces outputs between 0 & 1, which can be interpreted as a neuron's firing rate normalized between
no firing (0)
maximum firing (1)
Smooth transition:
Provides gradual transition between inactive & active states. meaning the graph is smooth going from y-axis 0 to 1.
Saturation:
At extreme inputs, the sigmoid function saturates, similar to how biological neurons have a maximum firing rate.

Why sigmoid is not always preferred:

1. Vanishing gradient problem:

For very high or low inputs, sigmoid gradient becomes very small, which can slow down learning in deep networks.
In the deep networks, gradients are used to update the weights during backpropagation. If the gradient becomes very small (close to 0), the weight updates will be very tiny, causing the learning to stop or even stop in the early layers of the network,
This is why for very low or very high inputs to a sigmoid function leads to vanishing gradient problem.

2. Not zero-centered:

The sigmoid output is always positive, which can cause zig-zagging dynamics in gradient descent.

This image shows the trajectory of weight updates in a 2D weight space (w₁ and w₂).

The blue line represents the actual path of weight updates.
The green dashed line shows the ideal direct path to the optimum.

The zig-zag pattern occurs because:

The sigmoid function outputs values in the range (0, 1). As a result, the outputs are always positive.
In each iteration, all weights tend to update in the same direction due to the consistent sign of gradients.
This causes the optimization to overshoot the ideal path.
In the next iteration, it corrects by moving in the opposite direction, again overshooting.
This back-and-forth motion creates the zig-zag pattern.

3. Computationally expensive:

Exponential calculations in sigmoid are more costly than ReLU's simple max operation.

Anyways, we are discussing about this Xor problem, we want the network to output :

the result 1, when input are different 1[, 0] or [0, 1]
the result 0, when input are same i.e [1, 1] or [0, 0]

let's see how we can implement a neural network to figure this XOR operation out.

1. Define the neural network architecture
  - Input layer: 2 neurons (X1, X2)
  - Hidden layer: 2 neurons
  - Output layer: 1 neuron (y^)

2. Initialize weights and biases
  - Create synapses (weights) between input and hidden layer
  - Create synapses (weights) between hidden and output layer
  - Initialize biases for hidden and output layers

3. Define activation function
  - Use sigmoid function as the activation function
  - Define sigmoid derivative for backpropagation

4. Forward propagation
  - Calculate hidden layer input: weighted sum of inputs + bias
  - Apply activation function to get hidden layer output
  - Calculate output layer input: weighted sum of hidden layer outputs + bias
  - Apply activation function to get predicted output (y^)

5. Calculate cost function
  - Use Mean Squared Error (MSE): 1/2 * (y^ - y)^2

6. Backpropagation
  - Calculate error at output layer
  - Calculate error at hidden layer
  - Update weights and biases using gradient descent
   - Adjust weights between hidden and output layer
   - Adjust weights between input and hidden layer
   - Adjust biases for hidden and output layers

7. Training loop
  - Repeat steps 4-6 for multiple epochs
  - For each epoch:
   - Perform forward propagation
   - Calculate cost function
   - Perform backpropagation
   - Update weights and biases

8. Evaluate the model
  - Print the final predicted output
  - Compare with expected output to check accuracy

The following code material is taken from CSCE-598 deep learning class from Dr. Maida.

import numpy as np

# XOR input and output pairs
X = np.array([[0, 0],
              [0, 1],
              [1, 0],
              [1, 1]])
y = np.array([[0],
              [1],
              [1],
              [0]])

# Activation function (Sigmoid) and its derivative
def sigmoid(x):
    return 1 / (1 + np.exp(-x))

def sigmoid_derivative(x):
    return x * (1 - x)

# Initialize weights and biases
np.random.seed(42)
input_size = 2
hidden_size = 2
output_size = 1

weights_input_hidden = np.random.uniform(size=(input_size, hidden_size))
bias_hidden = np.random.uniform(size=(1, hidden_size))
weights_hidden_output = np.random.uniform(size=(hidden_size, output_size))
bias_output = np.random.uniform(size=(1, output_size))

# Learning rate
learning_rate = 0.1

# Training loop
for epoch in range(10000):
    # Forward propagation
    hidden_layer_input = np.dot(X, weights_input_hidden) + bias_hidden
    hidden_layer_output = sigmoid(hidden_layer_input)

    output_layer_input = np.dot(hidden_layer_output, weights_hidden_output) + bias_output
    predicted_output = sigmoid(output_layer_input)

    # Calculate loss (Mean Squared Error)
    loss = y - predicted_output
    mse = np.mean(np.square(loss))

    # Backpropagation
    error_output_layer = loss * sigmoid_derivative(predicted_output)
    error_hidden_layer = error_output_layer.dot(weights_hidden_output.T) * sigmoid_derivative(hidden_layer_output)

    # Update weights and biases
    weights_hidden_output += hidden_layer_output.T.dot(error_output_layer) * learning_rate
    bias_output += np.sum(error_output_layer, axis=0, keepdims=True) * learning_rate
    weights_input_hidden += X.T.dot(error_hidden_layer) * learning_rate
    bias_hidden += np.sum(error_hidden_layer, axis=0, keepdims=True) * learning_rate

    # Print loss every 1000 epochs
    if epoch % 1000 == 0:
        print(f'Epoch {epoch}, MSE: {mse}')

# Final output
print("\nFinal predicted output:")
print(predicted_output)

#neuralNetwork #XOR #Linearity #NonLinearity