How Restricted Boltzmann Machines Work? | Deep learning for Recommendation System

Here's a standard Boltzmann Machines:

image credit:

In theory it's a great problem, where reach node is connected to every other node.

As we increase the number of nodes, the complexity of computation grows exponentially.

A Restricted Boltzmann Machine:

A simple restriction that hidden nodes cannot connect to each other
Input nodes cannot connect to each other

image credit:

Let's look at example of movies recommendation system.

Let's say we are going to be using 6 movies.
It's a generative type of model, which generates different states of our system.
For example through the training process:
Restricted Boltzmann might identify that genres are important features.
Actors, an award, director, are also important for movie recommendation.
What does it mean, when an identified features is important?
As we are feeding each row into our model, through this training, it understand our system better.
If a user liked movie 2 , movie 3 & didn't like move 4, then it's just a user preference.
Basically data is talking about the preferences of people and how they prefer to view movies or are biased to different movies. This is what our model is trying to model.
If someone likes move 3, movie 4, they liked movie 6
or if someone disliked movie 3 or movie 4, they also didn't like movie 6
Our Restricted Boltzmann would identify this in the training and assign a node to look out for that feature.
Even though our model is just getting 0 or 1. From this only it can establish, that there probably some feature that these movies have in common that is making people like them, not just movie but that feature.
Thus any movie with that feature is highly likely to be enjoyed by those people.
This feature in simple term can be genre or certain actor x.

image credit:

Which of these hidden nodes are going to be activated for this user?
Certain features are going to be light up based on the user preferences.
we don't have data from fight club and the departed
From Forrest Gump it activates the the hidden node i.e drama.
Since this user did not like The Matrix and Pulp Fiction, the action and Tarantino nodes will be red.
Now what happens?
Our Boltzmann machine is going to reconstruct our input
What about the null node?
Fight Club & The Departed?
Based on our user preference if this movie an action movie which user did not like or is this movie directed by Tarantino?
if it falls under user preferences then we light it up green.
else we color it red.

What allows Restricted Boltzmann Machines to learn?

Contrast Divergence.

How does RBM adjust it's weight?

We know in other neural network we had gradient descent process which allowed back propagation of error.

But in this network, we don't have directed network. We have undirected network. How does the weights gets adjusted?

This is where Contrastive Divergence comes in

We have input nodes, with randomly assigned weights it will calculate the hidden nodes.
These hidden nodes will use the exact same weights to reconstruct the input nodes
The key point is weights are exactly the same, they don't change.
The reconstructed inputs are not going to equal the original inputs even though the weights are the same.
Why is it so?
Because each of the hidden nodes that are generated comes from the combination of all the input nodes, thus no single input nodes are matched exactly while reconstructing input nodes from hidden nodes with exactly same weights.

Gibs Sampling:

image credit

We start from left side input nodes
We build our first hidden nodes
then we reconstruct our input nodes
then we reconstruct our 2nd hidden layers of nodes
then again reconstructing our 2nd input layer nodes
the we reconstruct our 3rd hidden layers of nodes.
and so on ...
This process has finally converged at the end.

Let's see in graph:

image credit

In RBM, what does energy mean?

Weights are considered energy. Weights dictates the shape of this energy curve. Through the contrastive divergence process, we are finding what's the values (input & hidden layers) for the system to be in the lowest energy state possible.

At the end of convergence, this brings our system to minimal energy state at the end of this Contrastive Divergence process.

We have to keep repeating the Gibs sampling for converging our RBM?

In 1998, Jeffrey Hinton discovered a shortcut, we don't have to wait until it convergences to the end. We can only take 2 steps i.e first 2 passes. Which will be sufficient to understand how to adjust your curve in the initial stage.

We can refer to it as contrastive divergence one pass : CD1 pass
from here only we know which way the ball is rolling:
Here we have control over the curve, since we are adjusting the weights.
this way we can adjust the weights, the shape of the energy curve is goverened by the weights in the system, that's how we design it.

We design the system with aim to always get to the minimum energy state possible. Through this, system is such, when we input our training value, our system is already going to be at the lowest state possible.