How Long short term memory works in NLP?

A common issue that RNN face is that if you train network in a very large sequence it will begin to forget the very first inputs.

The later and later training batches of alter inputs, things that come towards at the end of the text document is going to start overriding the weights of the very beginning inputs.

We want to make sure that we're not forgetting those first inputs information as we are going through recurrent neural networks.

We need some sort of long term memory for our networks. All the data including the first initial data it was trained on .

LSTM: Long short term memory cell:

LSTM was created to help address these RNN issues.

Let's go through how an LSTM cell works. This is what is used for text generation.

For a typically recurrent neuron.

input (t-1) -> output (t-1)

and this output is also fed back along with t

output(t-1) + input(t) -> output (t)

these output(t) are often called hidden. Instead of saying output (t-1) we can also say H(t-1).

H(t) => typically output of a recurrent neural network.

LSTM:

Input: we have original inputs from a normal recurrent neuron
h(t-1)
x(t)
But here we also have cell state
c(t-1)
Output:
h(t)
c(t)

This is done step by step.

1step is called forget gates layers.

1. Forget Gates Layers:

Decided what information we are going to forget or throw away from the cell state.

f(t) = sigmoid(Wf*[h(t-1), xt] + bf)

we pass H(t-1) and Xt => perform linear transformation with some weights and bias terms into a sigmoid function. Since it's sigmoid layer it's going to output a number between 0 and 1.

1 says keep it
0 says forget about it

For example a language model: we're trying to predict the very next word based on previous ones, a cell state might include the gender of the present subject. so you can pick up correct pronoun, But as you see a new subject, you might wanna forget about the gender of previous/old subject.

2.What are we going to store in the cell state? C(t)

Sigmoid layer
input gate layer I(t)
we take H(t-1) and X(t) => linear transformation => sigmoid function
I(t) = f(t) = sigmoid(Wf*[h(t-1), xt] + bf)
Hyperbolic tangent layer.
H(t-1) and X(t) => linear transformation => hyperbolic tangent
C(t) = tanh(Wc*[h(t-1), xt] + bc)
this formula creates vector i.e new candidate values

3.Update State by combining 1 & 2 (forget gate layer and cell state)

Time to update the old cell state i.e C(t-1) to C(t).

In the previous state we have decided what to forget and what to store, here we are just going to execute it.

Some variance of LSTM:

Peep hole variance: It adds peepholes to all the gates, which allows the cell to see to see the previous cell state.
Another variation of LSTM cell is called gated recurrent unit. GRU. introduced around 2014.
Simplifies things by combining forget and input gates into a single update gate.
Merges cell state and hidden state
Another slight variations off of this, depth gated recurrent neural network.

This resulting Gated recurrent unit is actually simpler than standard LSTM models. because of that it's growing increasingly popular.

The main idea is to understand how LSTM works. and that allows you to quickly learn how these variations works. For text generations LSTMs work best.