Neural Network 101. Chapter 2.

We are going to be looking at a real example of property valuation.

Takes some parameters & values the property.

We have independent variables i.e 1 row of a db that defines the property.

In this example of network, we have input and output layer, we don't have hidden layer.

Our input layer
X 1 => Area in square feet
X2 => Number of bedrooms
X3 => Distance from the city
X4 => Age of this property
Output layer is the price of the property.
Input variables are weighted up by the synapses
thus calculating the output layer
price = w1 * x1 + w2 *x2 + w3 *x3 + w4*x4

Here we can use:

Any of the activation functions
logistic regression
Even without the hidden layer, we already have a representation that works for most other machine learning algorithms.
But with neural networks we have flexibility and power which is where that increase in accuracy comes from.

The Power is HIDDEN LAYER

Now let's understand how the hidden layer gives us the extra power?

All of the neurons from input layer have synapses i.e ( tiny gap between neurons ) connecting to each of the neighbouring neuron on hidden layer
These synapses have weights.
Some weights have non zero value.
Some weights will have zero value.
Not all inputs will hold equal value.
some inputs are important
some are just noises/trivial.
For example:
Area of sq. ft and distance to city are important feature for price determination.
Big real estate which are near city expensive in price
Number of bedrooms and age of the estate are not very crucial for price determination.
10 bedrooms real estate located in the farm house could be cheaper than 5 bedroom estate in city.

Another neuron in the hidden layer could be picking 3 attributes like area, number of bedroom and age.

May be the dataset has some relation like
in some cities, some bigger families are looking for bigger space with more number of bedrooms and which are also new.
Somehow there is a preferences of newer properties compared to old ones
This neuron has picked this information.
This neron combines these 3 parameters to brand new parameter that helps with the evaluation of the property.

Some other neuron could just pick the age. Why?

New properties are expensive
Properties older than 100 years old could be deemed as historic property that tells stories.
Price drops until 99 years old and then shoots up

Moreover Neural networks could pick up combinaton and permutation of the 4 features for example:

Just bedrooms and distance to city
Area, bedroom and age
Area, bedroom, distance to city and age
and so on .

These neurons, this whole hidden layer allows neural network to look for very specific things and then in combination that'w where the power come from.

It's like the example of ants, just 1 ant can't build ant hill, but in the group of hundreds and thousands they can do anything.

Each of these neuron by itself cannot predict the price but together they have superpower and they predict the price.

They can do quite an accurate job if trained/set up properly.

How do Neural Networks learn?

There are two fundamental ways to getting a program to do what you want it to do.

Hard coded form, where you actually tell the program specific rules and what outcomes you expect, and you guide it through the whole way.
Neural Network, where you create a facility for the program to be able to understand what it is doing on it's own.
Create this neural network
Provide inputs
You tell it what you want as output.
You let it figure out itself.

How to create a network that learn on it's own?

How do you distinguish between dog and cat?

Imperative learning:

def identify_animal(has_barks, has_meows, likes_bones, likes_fish, wags_tail):
  if has_barks and likes_bones and wags_tail:
    return "Dog"
  elif has_meows and likes_fish and not wags_tail:
    return "Cat"
  else:
    return "Unknown"

Neural Network:

Code the architecture of neural network
Point the neural network at a folders, which are categorized.
folder of dogs
folder of cats
You tell it here are images of cats and dogs,
Go and learn what cat is.
Go learn what dog is
Neural network on its own understand everything it needs to understand,
Once trained, give a new image of cat or dog it will be able to understand what it was.

import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense

# Create a simple neural network
model = Sequential([
  Dense(64, activation='relu', input_shape=(5,)),
  Dense(32, activation='relu'),
  Dense(1, activation='sigmoid')
])

model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

# Example data (you'd normally have much more data)
# Format: [barks, meows, likes_bones, likes_fish, wags_tail]
X_train = [
  [1, 0, 1, 0, 1], # Dog
  [0, 1, 0, 1, 0], # Cat
  [1, 0, 1, 0, 0], # Dog
  [0, 1, 0, 1, 1], # Cat (unusual cat that wags tail)
]
y_train = [1, 0, 1, 0] # 1 for Dog, 0 for Cat

# Convert to numpy arrays
X_train = tf.convert_to_tensor(X_train, dtype=tf.float32)
y_train = tf.convert_to_tensor(y_train, dtype=tf.float32)

# Train the model
model.fit(X_train, y_train, epochs=100, verbose=0)

# Test the model
test_animal = tf.convert_to_tensor([[1, 0, 1, 0, 1]], dtype=tf.float32) # A dog
prediction = model.predict(test_animal)
print(f"Probability of being a dog: {prediction[0][0]:.2f}")

How the neural network works?

This is called a single layer feedforward neural network or perceptron.

why y^? instead of y?

y stands for actual value, in reality

y^ is the predicted value/output value from model.

Perceptron was first invented By Rosenblatt in 1957.

The whole idea was to invent something that can learn itself.

Let's see how perceptron learns?

We have input values that have been suplied to the perceptron
The activation function applies
Then we have an output.

In Order to be able to learn.

We need to compare the output value to the actual value that we want the network to get.

Comparing Y & Y^, we have a difference.

We will then calculate the:

Cost Function

Cost function basically tells us what is the error that you have in your prediction, and our goal is to minimize the cost function. Because lower the cost function, the closer y^ is to y.

Cost function = 1/2 (y^ - y)^2

Once we have the information of cost function, we are then going to feed this information back into the neural network.

It goes back and the weight gets updated. All we can do is to update the weight.

The very thing we have control in this very simple neural network are the weights.

Our goal is to minimize the cost functions, all we can do is to update the weights, tweak them little bit.

Right now throughout the experiment, we are dealing with only 1 row of data.

Again with the same row of the data, goes through multiple iteration until the cost function is adjusted to minimal & the weight gets updated.

One row of data input into our neural network
Input values gets multiplied by weights
Activation function is applied
We get y^
Y^ is compared to Y,
we calculate our cost function
Feed the information back to the neural network
Adjust the weight again
We repeat the same process again and again with the same row of data

What Happens with multiple row of data?

1 epoch is when we go through all the dataset

We calculate the y^ for

First row of data
2nd row of data
3rd row of data
... till nth row of data

we have:

for every single row we have actual value i.e y

Now based on all the difference between y^ and y, we can calculate the cost function.

Which is sum of all of these 1/2(y^ - y) ^ 2

diff of y1^ - y1
diff of y2^ - y2
diff of y3^ - y3
diff of yn^ - yn

Since we have the full cost function, we will go back and update the weights, w1, w2, w3, wn.

We will iterate this process until the cost function is minimized, no matter how many rows we have.

The goal is to minimize the cost function.

This whole iteration and learning is called back propagation.

The question is How?

How can we minimize the cost function?

Brute force approach.
we take lots of different possible weights
try 1000s of weights?
using the formula 1/2 (y^ - y) ^ 2, we have following graph

For this simple neural network

The neural network in the example has 25 weights in total:

4 input neurons × 6 hidden neurons = 24 weights
6 hidden neurons × 1 output neuron = 6 weights
Total: 20 + 5 = 30 weights

This Neural network with just 30 weights, and we want to try 1000 different values for each weight.

If a brute force approach where we try 1000 different values for each weight.
Since we have 30 weights, and we're trying 1000 different values for each weight, we end up with 1000^30 combinations to test.
Total combinations to try: 1000^30 (that's 1 followed by 90 zeros!)

Understanding 1000^30: 1000^30 is an enormously large number.

To give an idea:

Now, let's use the world's fastest supercomputer:

It can do about 1 quintillion (1,000,000,000,000,000,000) calculations per second.
Even if this supercomputer could test one combination in one calculation:
Time to test all combinations: About 28,800,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000 years
To put this in perspective:
The universe is only about 13.8 billion years old.
This is like waiting for the entire lifetime of our universe to pass, then repeating that wait about 2 billion trillion trillion trillion more times!

This mind-boggling amount of time shows why we can't just try every possible combination of weights in a neural network. It's why we need smarter methods, like gradient descent, to train neural networks efficiently.

In essence, as we add more dimensions (weights in this case), the problem becomes exponentially more complex - that's the curse of dimensionality in action!

Gradient Descent

It's a method to find the minimum of a cost function.
The gradient refers to the slope or direction of steepest descent at a given point.

How it works:

Start at a random point on the cost function.
Calculate the slope (gradient) at that point.
Move in the direction of the negative gradient (downhill).
Repeat this process, each time recalculating the gradient and moving downhill.
Analogy:
Imagine you're on a hill and want to reach the bottom.
You look around to see which way is downhill and take a step in that direction.
Keep doing this until you reach the bottom (minimum).
Characteristics:
The path is often zig-zaggy rather than a straight line to the minimum.
The step size (learning rate) can be adjusted - bigger steps at first, smaller as you get closer to the minimum.

Advantages of Gradient Descent:
Much more efficient than trying every possible combination of weights (brute force).
Works well in high-dimensional spaces where brute force is impossible.
Variations:

The passage mentions stochastic gradient descent as a topic for the next discussion, which is a variation of this method.

This method is crucial in machine learning as it provides an efficient way to optimize complex models with many parameters, like neural networks, without having to exhaustively search all possible combinations.