The Battle of Activation Functions. Neural Network 101. Chapter 3

Focusing on the mathematical foundations of neural network activation functions. We'll start with simple rules Transformations on functions, then explore odd, even, sigmoidal & hyperbolic tangent functions.

Simple transformations on functions

Simple Transformations on Functions: The note discusses basic transformations that can be applied to functions:

Flipping about axes (x-axis and y-axis)
Stretching and compressing functions
Shifting functions vertically and horizontally

First, let's address the transformations rules :

-f(x) : Flips the function about the x-axis
f(-x) : Flips the function about y-axis
f(ax) where 0 < a < 1: spreads out the function
f(ax) where 1 < a: compress the function
f(x) ± b: Shifts the function vertically
f(x ± τ): Shifts the function horizontally
Vertical Transformations: y = a * f(x), , where |a| > 1
If 1 < |a|, Vertical stretching
If 0 < |a| < 1: Vertical compression
Vertical Shift: y = f(x) + k
If k > 0: Shift up
If k < 0: Shift down
Horizontal Transformations, Rule: y = f(a * x)
if 1 < |a|, horizontal compression
if 0 < |a| < 1: Horizontal stretching
Horizontal Shift: y = f(x - h)
If h > 0: Shift right
If h < 0: Shift left

Let's see visually how things changes by applying this rule:

Original function: f(x) = sin(x)
Vertical Transformation: y = 2sin(x)
Vertical Stretch: y = 2sin(x)
Vertical Compression: y = 0.5sin(x)
Vertical Shift: y = sin(x) + 1
Horizontal Tranformation:
Horizontal Stretch: y = sin(0.5x)
Horizontal Compression: y = sin(2x)
Horizontal Shift: y = sin(x - π/2)

Now, let's look at the exponential probability density function:

p(x; λ) = 1x≥0 · λe^(-λx)

here, λ (lambda) is used in two places:

As a multiplier λ * e
In the exponent: e^(-λx)

Function is defined for non negative values i.e 0 <= x
function is 0 for x < 0
function is 1 for 0 <= 0
The multiplier scales the height of the function. A larger λ will make the function taller at x = 0.
The λ in the exponent controls how quickly the function decays, A larger λ will make the function decay more quickly as x increase.

What does it mean to "decay"?

from the transformation rule: y = f(ax), where a > 1

In our case, f(x) = e^(-x), and we're looking at f(λx) = e^(-λx)

When λ > 1, it acts as a horizontal compression factor:

It "squeezes" the function horizontally
This makes the function decrease more quickly as x increases

The effect on decay rate:

A larger λ means the exponent (-λx) becomes more negative more quickly as x increases
This causes the function value to approach zero faster

In this graph:

The blue curve represents λ = 0.5 (slower decay)
The red curve represents λ = 1 (medium decay)
The green curve represents λ = 2 (faster decay)

Even Functions:

A function is even if & only if f(x) = f(-x) for all x in the domain of f. Geometrical interpretation of even functions are symmetric about the y-axis. For example cos(x) is an even function

other example of even function:

Now, the same function flipped about the y-axis:

Odd Functions:

A function f(x) is odd if & only if f(-x) = -f(x) for all x in the domain of f.

If you flip an odd function about both the x-axis and y-axis, you get the same function.
Example: sin(x) is an odd function.

Original Function:

Now, let's flip it about the x-axis:

Finally, let's flip it about the y-axis:

meaning we mirror the right part to the left & left to right.

Sigmoidal functions:

These are S-shaped curves used as activation functions in neural networks. They help introduce non-linearity, allowing networks to learn complex patterns.

1. Logistic Sigmoid Functions:

This is how the sigmoidal function looks like:

Formula:
Output range: 0 to 1
This function squashes input values into the range [0, 1]
Centered at (0, 0.5)
Approaches 0 as x approaches negative infinity.
Approaches 1 as x approaches positive infinity

2. Hyperbolic Tangent (tanh) Function:

This function is very similar to the sigmoidal function, as it's graph is also S shaped.

Forumla:
Output range: -1 to +1
Similar to logistic sigmoid, but outputs values in the range [-1, 1]
Centered at (0, 0)
Approaches -1 as x approaches negative infinity
Approaches 1 as x approaches positive infinity.

These graphs highlight the key difference between two functions, particularly their ranges & centers. The sigmoidal function is useful when you need outputs between 0 & 1 like probabilities, while tanh is often preferred in neural network due to it's zero centered nature, which can help with faster convergence during training.

What is convergence anyways?

Convergence in neural network training, we are referring to how quickly & efficiently the network learns to minimise its error or loss function. Faster convergence means the network reaches its optimal performance in fewer training iterations.

Also why this zero-centering matters?

In neural network, the output of one layer becomes the input to the next layer.

When these values are centered around zero, it helps in the following ways.

Avoiding shifting effects:
With non-zero centered activation like sigmoid, all neurons in the next layer receive inputs that are all positive (or may be all negative in similar other functions).
This can cause zig-zagging in the weight updates, slowing down convergence.
Gradient flow:
Zero centered inputs tend to produce larger gradients during back-propagation.
Larger gradient means faster learning, especially in deep networks.

Relationship between logistic sigmoid & tanh:

In this graph:

Blue curve represents derivative of sigmoid function.
Red curve represent the derivative of the tanh function.

Notice, how the tanh derivative is steeper near x = 0 compared to the sigmoid derivate.

This steeper gradient often translates to faster learning in the critical region around zero. While tanh can often lead to faster convergence, modern neural networks frequently use other activation function like Relu(Rectified Linear Unit) or it's variant, which can offer even better performance in many scenarios.

We can say, tanh is a rescaled & shifted version of logistic sigmoid.

Let's break it down mathematically:

Logistic sigmoid function: σ(x) = 1 / (1 + e^(-x))

Hyperbolic tangent function: tanh(x) = (e^x - e^(-x)) / (e^x + e^(-x))

tanh(x/2) = 2 * σ(x) - 1

in next chapter we will cover more on other activation functions.

#NeuralNetworks101 #ActivationFunctions #MachineLearning #DeepLearning #SigmoidFunction #TanhFunction #MathForAI #DataScience #ArtificialIntelligence #ComputerScience