The Building Blocks of Computer Vision: CNN Architecture 101

How We See Things: The Brain's Pattern Recognition Game?

Have you ever looked at clouds and seen shapes in them? Or perhaps you've encountered those fascinating optical illusions where an image can be interpreted in two completely different ways?

These everyday experiences reveal something remarkable about how our brains process visual information - and surprisingly, it's not so different from how modern artificial intelligence works.By breaking down complex concepts into digestible pieces, we'll discover how computers have learned to see in ways that mirror our own visual processing system.

What do you see when you look into this image?

Are you seeing a person looking at the right side?

Or are you seeing a person looking at you?

This just illustrates what our brain is looking for? when we see things?

The answer is features. Depending on the features it sees & features it processes, it categorize things in certain ways.

Young lady with feature on her hair looking away? or old lady with scarf looking down?

If you look at the nose, mouth , eyes and a hair in the above image it identifies these features to conclude it's a face.

But if you look at it as lady, one hand holding her to the mouth, another hand on the ground, legs crossed inside, then it's a lady looking away.

All these examples illustrates a simple idea of how brain works?

It process certain features, whatever you see in real life and tries to classify it.

How convolutional neural network works?

Input image -> goes through Convolutional Neural Network -> Output (classifies label)

Let's consider a scenario, where a neural network has been trained to categorize facial expression, emotions.

give a picture of person smiling -> output happy
give a picture of person frowning -> output sad

How is a neural network able to recognize these features?

let's say you have

Black & white image of 2 * 2 pixels.
Color image of 2 * 2 pixels.

Neural network leverage the fact pixels are converted into array of multiple dimensions.

For colored image:

If we boil it down to preliminary stage & simplify things in it's most basic form.

This will illustrate how an image is converted into a format that a Convolutional Neural Network can process.

where

0 = white
1 = black

The Pixel Chessboard

Imagine a chessboard where each square represents a pixel:

Black & White Image (2x2 pixels): A tiny 2x2 chessboard. Each square can only be either black or white.
White square = 0
Black square = 1
Grayscale Image (2x2 pixels): The same 2x2 chessboard, but each square can have any shade of gray.
Pure white = 0
Pure black = 255
Any shade of gray = A number between 0 and 255
Color Image (2x2 pixels): Picture three 2x2 chessboards stacked on top of each other. Each stack represents a color channel (Red, Green, Blue).
For each square in each stack: 0 = no color, 255 = full color intensity

Convolution :

A convolution is basically a combined integration of two function. Shows how one function modifies the shape of other.

What is convolution in intuitive terms?

What is feature detector?

It's a mostly 3 * 3 matrix or 7 * 7 matrix or could 5 *5.

There are many different terms for feature detector:

Kernel
Filter

We do element wise multiplication = Input matrix * feature detector matrix , to get the feature map (convolved feature) or can also be called activation map. The step at which we are moving this whole filter is called the stride.

What have we done here?

It is used to reduce the size of image.
With a stride of 2, the feature map is going to be even smaller. As stride increase feature map size decreases.

The most important function of feature detector is to make the image smaller which results in easier & faster processing.

Here in this example we have 7 * 7 image. But imagine a matrix of 256 * 256 or 300 * 300 pixels size. It's a huge image. Therefore the stride of 2 is very common.

Do we loose information when we are applying feature detector?

Yes, When a big matrix is reduced to small matrix, we are definitely going to loose some information. But at the same time purpose of the feature detector is to detect certain features that are integral. The highest number in the feature map is when the pattern matches up.

How we recognize things is we don't look at every single pixels, we look at features. For example eyes & scarf to detect face, hands to detect body. Here in the feature map that's what we are preserving. Feature map allows us to do it and get rid of unnecessary information.

we create multiple feature maps, because we use multiple different filters. This is also a way to preserve information regarding that particular image.

Feature Map: Analogy of human:

Even as a human we don't process everything going into our eyes at any given time. Every single pixels & dot being feed into the brain from our eyes, every sense of hearing feed through ear & every senses of smell feed through our nose. We filter most of the information out, mostly unnecessary information and only keep important features.

An example of a feature map: Emboss filter

When we apply feature detector to a proper image:

Original image when applied with emboss feature detector array:

Rectifier:

Once we have our convolutional layer of feature maps. We are going to apply rectifier.

Why do we want to apply rectifier?

We want to increase non-linearity in our network. Rectifier acts as a function which break linearity. The reason we want to break linearity are:

Complex nature of images: Images consist of multiple objects, like people, tree, animals, different colors, backgrounds, borders and so on.
Images can contain objects at various scales. The non-linearity helps the network adapt to these different scales effectively than a linear model.

When we are applying feature detector, we risk we might create something linear so we apply rectifier at the end to make it non-linear.

Pooling (Down sampling):

A technique used in convolutional neural networks (CNNs) that helps introduce spatial invariance to the network. Which enables network to detect features regardless of their exact position or slight distortions in the image.

It involves sliding a small window (often 2x2 pixels) across a feature map and selecting a single value to represent that area. Example of picking up max value:

Pooling Playground

Pooling benefits:

Reduces 75% of information: Pooling reduces the size of the feature maps while preserving the most important information
This reduction in size leads to fewer parameters, which helps prevent overfitting and reduces computational cost.
Introduces spatial invariance.
Reduces computational load
Preserves important features while reducing image size.

Choosing Pooling Parameters:

Size of pooling window (e.g., 2x2)
Stride (usually 2, but can be adjusted)

Different types of pooling:

Sum pooling
Sub sampling or mean or average pooling
Max pooling (most common)

Pooling Layer in CNN Architecture: Typically applied after the convolution layer.

The result after applying pooling is called convolved image.

Next we need to pass our feature map to our neural network as an input.

How are we going to do that?

We are going to take our feature map & flatten it into a column.

What is Flattening?

Basically taking the information row by row and put it in the column sequentially. To input this data into neural network for further processing.

When you have many pooling layers with many pooled feature maps. You put them in one long column sequentially one after other. Then you we will get a huge vector of inputs for neural network

This is how whole architecture connects:

The conclusion:

Just as our brains don't process every detail of what we see, but rather focus on key features to make sense of the world, Convolutional Neural Networks follow a similar path. Through the elegant dance of convolution, rectification, and pooling, these networks can transform raw pixels into meaningful interpretations - whether it's recognizing a smile, detecting a stop sign, or identifying a cat in a photo.

The next time you instantly recognize a friend's face in a crowd or spot shapes in the clouds, remember that your brain is performing an incredibly sophisticated version of the processes we've explored. And while our artificial neural networks may not yet match the incredible complexity of human vision, they're helping us better understand both how we see and how we can teach machines to see the world around us.

#ArtificialIntelligence #MachineLearning #DeepLearning #ComputerVision #CNN #ImageProcessing #DataScience #Python #NeuralNetworks #AIexplained #CNN #DeepLearning #ComputerVision #TensorFlow #Python #NeuralNetworks #ImageProcessing #ML #AI #DataScience