# A numerical example of LSTMs

Many posts have gone into detail discussing the forward pass of LSTMs (for example, the very informative post here). However, relatively few go through backpropagation, and numerical examples are even more rare. I did manage to find some good sources, and Alex Graves’ thesis was a big help, but after I answered this datascience post on LSTMs, I thought it would be worth delving into some more details.

Other blogs discuss RNNs, so I won’t get into them. Instead, I’ll focus on:

• The forward pass: how information travels through an LSTM
• The backward pass: how gradient information travels backwards through the LSTM

Let’s take a very simple example and work through the forward pass and the backward pass. I will assume that you have seen backpropagation before. If you haven’t, you may want to start here.

## Basics

LSTM stands for Long Short-Term Memory. It was conceived by Hochreiter and Schmidhuber in 1997 and has been improved on since by many others. The purpose of an LSTM is time series modelling: if you have an input sequence, you may want to map it to an output sequence, a scalar value, or a class. LSTMs can help you do that. The entire LSTM model – all of the gates and the memory cell – are referred to as the LSTM cell. The basic components are an input gate, a forget gate (added after the original LSTM), an output gate, and a memory cell. Broadly:

• The input gate allows new information to flow into the network. It has parameters $W_i, b_i$, where $i$ stands for input.
• The memory cell preserves the hidden units information across time steps. It has parameters $W_c, b_c$, where $c$ stands for cell.
• The forget gate allows information which is no longer pertinent to be discarded. It has parameters $W_f, b_f$, where $f$ stands for forget.
• The output gate allows what information will be output to the screen and what will be propagated forward as part of the new hidden state. It has parameters $W_o, b_o$, where $o$ stands for output.

Interestingly, all of the weights have the same dimension.

This is an LSTM with two cells:

When I was first looking at LSTMs, the presentations that I saw made it difficult to think of them as feed-forward. While this diagram is a bit unconventional, I find it helpful because it illustrates that we are dealing with a special kind of recurrent neural network (which at its core is just a feed-forward neural network replicated over time).

Let’s break that image down a bit. Here is the first cell:

Green circles indicate input. In each cell, we have the current input from the time series, $x_t$, and input from the previous time step, $h_{t-1}$. The last operation in the cell is to calculate the hidden state for the next cell, which is at once part of the output of the current cell and the input of the next cell.

Red circles indicate the memory cell. One of the main differences between vanilla RNNs and LSTMs is the addition of the memory cell. Whereas RNNs have only the hidden state to maintain memory from previous time steps, LSTMs have the hidden state as well as this additional memory cell. This helps with the task of learning long-term dependencies, something that RNNs have struggled with. In the first cell, the memory coming from the previous time step is set to 0 (although is some recent work on initialization strategies).

Orange circles are gates. In real life, a gate can be partially open, fully open, or closed. The same idea applies here. The gates control the flow of information.

The line down the centre of some of the circles indicates that the circle (the neuron) has some net input and an activation function. We can imagine the net input coming in on the left hemisphere, and undergoing a nonlinear transformation (the activation function) on the right hemisphere. For example, the orange neuron $i_1$ takes a linear combination as input and outputs an activation. We’ll get into that more when we discuss the input gate. Naturally, the initial input doesn’t have a net input, so circles for $x_0, h_0, c_0, x_1$ have no dividing lines.

Here is the second cell:

This is not that much different from the first cell. The hidden state from the previous layer is $h_1$, which was the result of a calculation, so this neuron has a dividing line. All of the connections are the same. The hidden state $h_2$ is not input for the next cell because there is no next cell. We go directly from computing the hidden state to computing the output of the LSTM network. We’ll get to that.

## A detailed walk-through: forward

Let’s start with a simple example with one dimensional input and one dimensional output.

Let’s focus on the first cell for now:

Suppose we have a scalar-valued input sequence $x_0 = 0.1, x_1 = 0.2$. In English, this means the input at the beginning of the sequence is 0.1, and the input at the next time step is 0.2. Yes, this isn’t much of a sequence, but to illustrate the computation it’ll do fine. We’ll assume that we initialized our weights and biases to have the following values:

$W = \begin{bmatrix} w_{i1} & w_{i2} & b_i \\ w_{c1} & w_{c2} & b_c \\ w_{f1} & w_{f2} & b_f \\ w_{o1} & w_{o2} & b_o \\ w_y & 0 & b_y \\ \end{bmatrix} = \begin{bmatrix} 0.5 & 0.25 & 0.01 \\ 0.3 & 0.4 & 0.05 \\ 0.03 & 0.06 & 0.002 \\ 0.02 & 0.04 & 0.001 \\ 0.6 & 0 & 0.025 \\ \end{bmatrix}$

This formulation will come in handy later for backpropagation, but you can see that each row of the matrix $W$ has all of the parameters needed for one of the gates. The last row is the linear transformation associated with the output (we’ll get to that).

### The input gate

It seems fashionable to start at the forget gate, but anytime I walk through a model I like to start with the input.

Our input is $[x_0, h_0, c_0]$, the initial sequence input, the initial hidden value, and the initial memory value. This translates to $[0.1, 0, 0]$. Above, I haven’t even bothered to include a column for a multiplication by the initial memory state (I would need, for example, a value $w_{i3}$), since this is rarely anything but 0 – people usually don’t bother considering it as part of the input, and I won’t from this point on.

The image associated with the input gate is:

And the associated equations for the first part are:

${\rm{net}}_{i1} = w_{i1}x_1 + w_{i2}h_0 + b_i$

$i_1 = \sigma({\rm{net}}_{i1}) = 1/(1 + exp(-{\rm{net}}_{i1} ))$

I’ve used ‘net’ to mean the net input to the gate. We take a linear transformation of the input values. Another way to present the linear transformation (using $T$ for transpose) is:

${\rm{net}}_{i1} = [w_{i1} \hspace{1em} w_{i2}] [x_1 \hspace{1em} h_0]^T + b_i$, as done on that first blog I linked to.

The full computation is:

${\rm{net}}_{i1} = 0.5(0.1) + 0.25(0) + 0.01 = 0.06$

$i_1 = \sigma({\rm{net}}_{i1}) = 1/(1 + exp(-0.06)) = 0.515$

This value can be interpreted as the probability that we will allow the information from $x_1$ to enter the memory cell.

The usual practice is to keep that value 0.515 – that is, to keep the gate partially open. Alternatively, we could make a decision as to whether the information will go forward. That is, we could generate a value $u_1 \sim {\rm{Uniform}}(0, 1)$ and, if $u_1 < 0.515$, then allow the information through – open the input gate completely. This is referred to as making a stochastic decision. Depending on the decision, the gate would open (value 1) or close (value 0). For the purposes of this example, let's assume the value is 1 to make the arithmetic easy.

The second part of the input gate is related to the memory cell. It creates a proposal for the inclusion of the new information:

${\rm{net}}_{c1} = w_{c1}x_1 + w_{c2}h_0 + b_c$

$\tilde{c}_1 = \sigma({\rm{net}}_{c1}) = 1/(1 + exp(-{\rm{net}}_{c1} ))$

The full computation is:

${\rm{net}}_{c1} = 0.3(0.1) + 0.4(0) + 0.05 = 0.08$

$\tilde{c}_1 = \tanh({\rm{net}}_{c1}) = 1/(1 + exp(-0.08)) = 0.0798$

Note no stochastic decision is made here – this is the quantity associated with the input that we'll pass to the memory cell. We could make a stochastic decision using a tanh function, and that often happens, but not here. Why? Because this is the input signal! We need this part as it is.

We’ll use both of these pieces together later when we update the memory cell.

### The forget gate

The point of this gate is to decide what information needs to be removed from the network. For example, if you’re making a grocery list for this week’s groceries based on last week’s, and you bought 2 weeks’ worth of apples last week, then you don’t need to buy them this week. You can remove the apples from your list.

The forget gate looks like this:

and takes similar input:

${\rm{net}}_{f1} = w_{f1}x_1 + w_{f2}h_0 + b_f$

$f_1 = \sigma({\rm{net}}_{f1}) = 1/(1 + exp(-{\rm{net}}_{f1} ))$

And the computation is also similar:

${\rm{net}}_{f1} = 0.03(0.1) + 0.06(0) + 0.002 = 0.005$

$f_1 = \sigma({\rm{net}}_{f1}) = 1/(1 + exp(-0.005)) = 0.5012$

Again, a stochastic decision could be made here as to whether the previous information should be forgotten (value 0) or allowed through (value 1). For the purposes of this example, let’s assume the value is 1.

### The memory cell

This is the best part! We combine the new information from the input gate and remove the information we’re forgetting according to the forget gate.

The picture is:

and the update looks like this:

$c_1 = i_1 \circ \tilde{c}_1 + f_1 \circ c_0$

That’s a new symbol! We need an aside.

The Hadamard product is an element-wise product. If we have a vector $a_1 = [1, 2, 3]$ and a vector $b_1 = [9, 10, 11]$, then the Hadamard product would be $c_1 = a_1 \circ b_1 = [(1)(9), (2)(10), (3)(11)] = [9, 20, 33]$.

#### End Aside

$c_1 = 1 \circ 0.0798 + 1 \circ 0 = 0.0798$

Now that we’ve updated the memory state (another name for the memory cell), we have to think about what we want to output.

### The output gate

In a sequence-to-sequence mapping task, like machine translation or image captioning, we might be interested in outputting a value (to the screen or to a file) for each input we see. Here, though, we have a single scalar output – essentially a regression task.

Even though we’re not interested in sending the value to the screen, we still need to compute the output, because it becomes part of the input to the next LSTM cell.

Here’s the image to think of:

By now you should be thoroughly bored with these equations:

${\rm{net}}_{o1} = w_{o1}x_1 + w_{o2}h_0 + b_o$

$o_1 = \sigma({\rm{net}}_{o1}) = 1/(1 + exp(-{\rm{net}}_{o1} ))$

${\rm{net}}_{o1} = 0.02(0.1) + 0.04(0) + 0.001 = 0.003$

$o_1 = \sigma({\rm{net}}_{o1}) = 1/(1 + exp(-0.003)) = 0.5007$

And we’ll make a stochastic decision as to whether we pass this output along. For the purposes of this example, let’s assume the stochastic decision results in a 1.

### The hidden layer (hidden state)

I bet you were wondering when we’d get to this. The hidden layer is separate from the memory cell, but very related. I like to think of it as the part of the memory cell that we want to ensure persists. Here’s how we do it:

$h_1 = o_1 \circ {\rm{tanh}}(c_1)$
$h_1 = 1 \circ 0.0796 = 0.0796$ (yes, I’m rounding. I’ve been doing that a lot.)

See what we did there? The output gate decides whether the signal from the memory cell gets sent forward as part of the input to the next LSTM cell.

### The second LSTM cell

We’ll assume the weights are shared across LSTM cells. The equations are exactly the same, but now we use $x_1$ where before we used $x_0$ and $h_1$ where we used $h_0$ and $c_1$ where we used $c_0$, etc. Let’s say we have input $x_1 =0.2$, and target scalar value $y = 0.08$. Here are all of the familiar computations written out, and final answers given (assuming all stochastic gate decisions result in the signal being propagated forward, and 0 information forgotten):

Input gate:

${\rm{net}}_{i2} = w_{i1}x_2 + w_{i2}h_1 + b_i$
$i_2 = \sigma({\rm{net}}_{i2}) = 1/(1 + exp(-{\rm{net}}_{i2} )) = 0.52875$
${\rm{net}}_{c2} = w_{c1}x_2 + w_{c2}h_1 + b_c$
$\tilde{c}_2 = \sigma({\rm{net}}_{c2}) = 1/(1 + exp(-{\rm{net}}_{c2} )) = 0.11768$

Forget gate:

${\rm{net}}_{f2} = w_{f1}x_2 + w_{f2}h_1 + b_f$
$f_2 = \sigma({\rm{net}}_{f1}) = 1/(1 + exp(-{\rm{net}}_{f2} )) = 0.50231$

Memory cell:

$c_2 = i_2 \circ \tilde{c}_2 + f_2 \circ c_1 = 0.61999$

Output gate:
${\rm{net}}_{o2} = w_{o1}x_1 + w_{o2}h_0 + b_o$
$o_2 = \sigma({\rm{net}}_{o2}) = 1/(1 + exp(-{\rm{net}}_{o2} )) = 0.50145$

Hidden state:
$h_2 = o_2 \circ {\rm{tanh}}(c_2) = 0.5511$

Okay, now we’ve reached the end of our sequence. It’s time to figure out the final output, that we’re going to use for the error calculation. This depends exclusively on the hidden state (remember the memory cell is input to the hidden state, so we only need to use the hidden state to take into account the entire memory of our LSTM).

The calculation:

$\hat{y} = w_y h_2 + b_y = 0.6(0.5511) + 0.025 = 0.335566$

That value $\hat{y}$ is our final output.

But wait, weren’t we aiming for 0.08? We need to make some changes to our model. To do that, we’ll calculate the error and backpropagate the signal to update our weights.

The error (mean squared error, or MSE, but with only one value so ‘mean’ is irrelevant):

$E = \frac{1}{2}(y - \hat{y})^2 = 0.5*(0.335566-0.08)^2 = 0.03266$

## A detailed walkthrough: backpropagation through time

This is going to get messy. There is a whole bunch of chain rule going on here. There are a few paths the derivative can take through the network, which is a great thing, actually, because it means the gradient is less prone to vanishing, but we’ll get to that. First, let’s figure out how to send our error signal back. We need the derivative with respect to our weight matrix, but to get there we have to go through all of our model components. Many thanks to Alex Graves for a beautiful thesis and to the author of this very helpful blog for filling in the gaps in my knowledge here.

The first step is to calculate the gradient of the error with respect to the output:

$\frac{\partial E}{\partial o_t} = \frac{\partial E}{\partial h_t} \frac{\partial h_t}{\partial o_t} = (y - \hat{y})(-w_y){\rm{tanh}}(c_t)$

We can see the dependency on the hidden state by expanding $\hat{y}$:

$\frac{\partial E}{\partial o_t} = (y - [w_y(h_t) + b_y])(-w_y){\rm{tanh}}(c_t) = \delta_{o_t}$

I’ll use $\delta_i$ to refer to the partial derivative of the error with respect to $i$, similar to that blog post.

Now we need to differentiate through the hidden state to get to the next part. Alternatively, we could differentiate through $c_2$ directly – that’s the second path the gradient can take. Actually, going directly through the memory cell saves a step (is shorter) as shown here (pages 12 – 13).

$\frac{\partial E}{\partial c_t} = \frac{\partial E}{\partial h_t} \frac{\partial h_t}{\partial c_t} = (y - [w_y(h_2) + b_y])(-w_y)(o_t)(1 - {\rm{tanh}}^2(c_t)) = \delta_{c_t}$

Now we need to go through the input and forget gates.

The input gate:

$\frac{\partial E}{\partial i_t} = \frac{\partial E}{\partial c_t} \frac{\partial c_t}{\partial i_t} = \delta_{c_t} \tilde{c_t} = \delta_{i_t}$

The forget gate:

$\frac{\partial E}{\partial f_t} = \frac{\partial E}{\partial c_t} \frac{\partial c_t}{\partial f_t} = \delta_{c_t} c_{t-1} = \delta_{f_t}$

The proposal for the new memory state:

$\frac{\partial E}{\partial \tilde{c_t}} = \frac{\partial E}{\partial c_t} \frac{\partial c_t}{\partial \tilde{c_t}} = \delta_{c_t} i_t = \delta_{a_t}$

The previous cell state:

$\frac{\partial E}{\partial c_t} = \frac{\partial E}{\partial c_t} \frac{\partial c_t}{\partial \tilde{c}_t} = \delta_{c_t} f_t = \delta_{c_{t-1}}$

The input to the proposal:

$\frac{\partial E}{\partial {\rm{net}}_{c_t}} = \frac{\partial E}{\partial c_t} \frac{\partial c_t}{\partial \tilde{c}_t} \frac{\partial \tilde{c}_t}{\partial {\rm{net}}_{c_t}} = \delta_{a_t} (1-tanh^2({\rm{net}}_{c_t})) = \delta_{\hat{a}_t}$

The net input to the input gate:

$\frac{\partial E}{\partial {\rm{net}}_{i_t}} = \frac{\partial E}{\partial c_t} \frac{\partial c_t}{\partial i_t} \frac{\partial i_t}{\partial {\rm{net}}_{i_t}} = \delta_{i_t} i_t(1 - i_t) = \delta_{\hat{i}_t}$

because of the derivative of the sigmoid function

The net input to the forget gate:

$\frac{\partial E}{\partial {\rm{net}}_{f_t}} = \frac{\partial E}{\partial c_t} \frac{\partial c_t}{\partial f_t} \frac{\partial f_t}{\partial {\rm{net}}_{f_t}} = \delta_{f_t} f_t(1 - f_t) = \delta_{\hat{f}_t}$

The net input to the output gate:

$\frac{\partial E}{\partial {\rm{net}}_{o_t}} = \frac{\partial E}{\partial c_t} \frac{\partial c_t}{\partial o_t} \frac{\partial o_t}{\partial {\rm{net}}_{o_t}} = \delta_{o_t} o_t(1 - o_t) = \delta_{\hat{o}_t}$

Now we need to recall our definitions from way up top:

$W = \begin{bmatrix} w_{i1} & w_{i2} & b_i \\ w_{c1} & w_{c2} & b_c \\ w_{f1} & w_{f2} & b_f \\ w_{o1} & w_{o2} & b_o \\ w_y & 0 & b_y \\ \end{bmatrix}$

And let $I_t$ be the total input at time $t$: $[x_t, h_{t-1}, 1]^T$.

Then we can define $z_t = W I_t$, and collect all of our ‘lowest’ derivatives:

$\delta_{z_t} = [\delta_{\hat{i}_t}, \delta_{\hat{a}_t}, \delta_{\hat{f}_t}, \delta_{\hat{o}_t}]$

Then our last derivatives are:

$\frac{\partial E}{\partial I_t} = W^T \delta_{z_t}$

and

$\frac{\partial E}{\partial W_t} = \delta_{z_t} X (I_t)^T$.

And there you have it! Backpropagation through an LSTM.

I used the typical activations here – namely, sigmoids and tanh – but this can also be done with ReLUs. I leave that to a future post.

## 4 thoughts on “A numerical example of LSTMs”

1. I’m not sure I made a mistake myself, but I think there’s a mistake
c1 = i1 * ~c1 + f1 * c0 = 1 * 0.0798 + 1 * 0
That is wrong, because i1 = 0.515 and f1 = 0.5012.

The right value is:
c1 = 0.515 * 0.0798 + 0.5012 * 0 = 0.0411

And:
h1 = o1 * tanh(c1) = 0.5007 * tan(0.0411) = 0.0206

Like

1. Hi Frank,

Thanks for pointing that out. At the end of the calculation for i1 and f1 and o1, I said that it’s easier arithmetically to assume the stochastic decision resulted in a 1. That’s why I was using the value 1 throughout, instead of 0.515, 0.5012, and 0.5007. If instead you assume the gates stay partially open, as you have, I think you have the right values.

Like

2. masum says:

nice one in details. would be kind enough to summerise why LSTM is not seeing Gradient Vanishing Problems?

Like

1. Hi masum, thanks for your comment. In short, the LSTM has less of a problem with vanishing gradients because the gradient has more than one pathway to follow during backpropagation. There are several resources available on backprop in LSTMs, but I’d recommend Alex Graves’ thesis from 2012. You can access it here: https://www.cs.toronto.edu/~graves/preprint.pdf Best of luck!

Like