A numerical example of LSTMs

Many posts have gone into detail discussing the forward pass of LSTMs (for example, the very informative post here). However, relatively few go through backpropagation, and numerical examples are even more rare. I did manage to find some good sources, and Alex Graves’ thesis was a big help, but after I answered this datascience post on LSTMs, I thought it would be worth delving into some more details.

Other blogs discuss RNNs, so I won’t get into them. Instead, I’ll focus on:

  • The forward pass: how information travels through an LSTM
  • The backward pass: how gradient information travels backwards through the LSTM

Let’s take a very simple example and work through the forward pass and the backward pass. I will assume that you have seen backpropagation before. If you haven’t, you may want to start here.

Basics

LSTM stands for Long Short-Term Memory. It was conceived by Hochreiter and Schmidhuber in 1997 and has been improved on since by many others. The purpose of an LSTM is time series modelling: if you have an input sequence, you may want to map it to an output sequence, a scalar value, or a class. LSTMs can help you do that. The entire LSTM model – all of the gates and the memory cell – are referred to as the LSTM cell. The basic components are an input gate, a forget gate (added after the original LSTM), an output gate, and a memory cell. Broadly:

  • The input gate allows new information to flow into the network. It has parameters W_i, b_i, where i stands for input.
  • The memory cell preserves the hidden units information across time steps. It has parameters W_c, b_c, where c stands for cell.
  • The forget gate allows information which is no longer pertinent to be discarded. It has parameters W_f, b_f, where f stands for forget.
  • The output gate allows what information will be output to the screen and what will be propagated forward as part of the new hidden state. It has parameters W_o, b_o, where o stands for output.

Interestingly, all of the weights have the same dimension.

This is an LSTM with two cells:
FullLSTM

When I was first looking at LSTMs, the presentations that I saw made it difficult to think of them as feed-forward. While this diagram is a bit unconventional, I find it helpful because it illustrates that we are dealing with a special kind of recurrent neural network (which at its core is just a feed-forward neural network replicated over time).

Let’s break that image down a bit. Here is the first cell:

FirstCell

Green circles indicate input. In each cell, we have the current input from the time series, x_t, and input from the previous time step, h_{t-1}. The last operation in the cell is to calculate the hidden state for the next cell, which is at once part of the output of the current cell and the input of the next cell.

Red circles indicate the memory cell. One of the main differences between vanilla RNNs and LSTMs is the addition of the memory cell. Whereas RNNs have only the hidden state to maintain memory from previous time steps, LSTMs have the hidden state as well as this additional memory cell. This helps with the task of learning long-term dependencies, something that RNNs have struggled with. In the first cell, the memory coming from the previous time step is set to 0 (although is some recent work on initialization strategies).

Orange circles are gates. In real life, a gate can be partially open, fully open, or closed. The same idea applies here. The gates control the flow of information.

The line down the centre of some of the circles indicates that the circle (the neuron) has some net input and an activation function. We can imagine the net input coming in on the left hemisphere, and undergoing a nonlinear transformation (the activation function) on the right hemisphere. For example, the orange neuron i_1 takes a linear combination as input and outputs an activation. We’ll get into that more when we discuss the input gate. Naturally, the initial input doesn’t have a net input, so circles for x_0, h_0, c_0, x_1 have no dividing lines.

Here is the second cell:

SecondCell

This is not that much different from the first cell. The hidden state from the previous layer is h_1, which was the result of a calculation, so this neuron has a dividing line. All of the connections are the same. The hidden state h_2 is not input for the next cell because there is no next cell. We go directly from computing the hidden state to computing the output of the LSTM network. We’ll get to that.

A detailed walk-through: forward

Let’s start with a simple example with one dimensional input and one dimensional output.

Let’s focus on the first cell for now:

Suppose we have a scalar-valued input sequence x_0 = 0.1, x_1 = 0.2. In English, this means the input at the beginning of the sequence is 0.1, and the input at the next time step is 0.2. Yes, this isn’t much of a sequence, but to illustrate the computation it’ll do fine. We’ll assume that we initialized our weights and biases to have the following values:

W = \begin{bmatrix} w_{i1} & w_{i2} & b_i \\ w_{c1} & w_{c2} & b_c \\ w_{f1} & w_{f2} & b_f \\ w_{o1} & w_{o2} & b_o \\ w_y & 0 & b_y \\ \end{bmatrix} = \begin{bmatrix} 0.5 & 0.25 & 0.01 \\ 0.3 & 0.4 & 0.05 \\ 0.03 & 0.06 & 0.002 \\ 0.02 & 0.04 & 0.001 \\ 0.6 & 0 & 0.025 \\ \end{bmatrix}

This formulation will come in handy later for backpropagation, but you can see that each row of the matrix W has all of the parameters needed for one of the gates. The last row is the linear transformation associated with the output (we’ll get to that).

The input gate

It seems fashionable to start at the forget gate, but anytime I walk through a model I like to start with the input.

Our input is [x_0, h_0, c_0], the initial sequence input, the initial hidden value, and the initial memory value. This translates to [0.1, 0, 0]. Above, I haven’t even bothered to include a column for a multiplication by the initial memory state (I would need, for example, a value w_{i3}), since this is rarely anything but 0 – people usually don’t bother considering it as part of the input, and I won’t from this point on.

The image associated with the input gate is:

InputGate

And the associated equations for the first part are:

{\rm{net}}_{i1} = w_{i1}x_1 + w_{i2}h_0 + b_i

i_1 = \sigma({\rm{net}}_{i1}) = 1/(1 + exp(-{\rm{net}}_{i1} ))

I’ve used ‘net’ to mean the net input to the gate. We take a linear transformation of the input values. Another way to present the linear transformation (using T for transpose) is:

{\rm{net}}_{i1} = [w_{i1} \hspace{1em} w_{i2}] [x_1 \hspace{1em} h_0]^T + b_i, as done on that first blog I linked to.

The full computation is:

{\rm{net}}_{i1} = 0.5(0.1) + 0.25(0) + 0.01 = 0.06

i_1 = \sigma({\rm{net}}_{i1}) = 1/(1 + exp(-0.06)) = 0.515

This value can be interpreted as the probability that we will allow the information from x_1 to enter the memory cell.

The usual practice is to keep that value 0.515 – that is, to keep the gate partially open. Alternatively, we could make a decision as to whether the information will go forward. That is, we could generate a value u_1 \sim {\rm{Uniform}}(0, 1) and, if u_1 < 0.515, then allow the information through – open the input gate completely. This is referred to as making a stochastic decision. Depending on the decision, the gate would open (value 1) or close (value 0). For the purposes of this example, let's assume the value is 1 to make the arithmetic easy.

The second part of the input gate is related to the memory cell. It creates a proposal for the inclusion of the new information:

{\rm{net}}_{c1} = w_{c1}x_1 + w_{c2}h_0 + b_c

\tilde{c}_1 = \sigma({\rm{net}}_{c1}) = 1/(1 + exp(-{\rm{net}}_{c1} ))

The full computation is:

{\rm{net}}_{c1} = 0.3(0.1) + 0.4(0) + 0.05 = 0.08

\tilde{c}_1 = \tanh({\rm{net}}_{c1}) = 1/(1 + exp(-0.08)) = 0.0798

Note no stochastic decision is made here – this is the quantity associated with the input that we'll pass to the memory cell. We could make a stochastic decision using a tanh function, and that often happens, but not here. Why? Because this is the input signal! We need this part as it is.

We’ll use both of these pieces together later when we update the memory cell.

The forget gate

The point of this gate is to decide what information needs to be removed from the network. For example, if you’re making a grocery list for this week’s groceries based on last week’s, and you bought 2 weeks’ worth of apples last week, then you don’t need to buy them this week. You can remove the apples from your list.

The forget gate looks like this:

ForgetGate

and takes similar input:

{\rm{net}}_{f1} = w_{f1}x_1 + w_{f2}h_0 + b_f

f_1 = \sigma({\rm{net}}_{f1}) = 1/(1 + exp(-{\rm{net}}_{f1} ))

And the computation is also similar:

{\rm{net}}_{f1} = 0.03(0.1) + 0.06(0) + 0.002 = 0.005

f_1 = \sigma({\rm{net}}_{f1}) = 1/(1 + exp(-0.005)) = 0.5012

Again, a stochastic decision could be made here as to whether the previous information should be forgotten (value 0) or allowed through (value 1). For the purposes of this example, let’s assume the value is 1.

The memory cell

This is the best part! We combine the new information from the input gate and remove the information we’re forgetting according to the forget gate.

The picture is:

MemoryCell

and the update looks like this:

c_1 = i_1 \circ \tilde{c}_1 + f_1 \circ c_0

That’s a new symbol! We need an aside.

Aside: Hadamard product

The Hadamard product is an element-wise product. If we have a vector a_1 = [1, 2, 3] and a vector b_1 = [9, 10, 11], then the Hadamard product would be c_1 = a_1 \circ b_1 = [(1)(9), (2)(10), (3)(11)] = [9, 20, 33].

End Aside

c_1 = 1 \circ 0.0798 + 1 \circ 0 = 0.0798

Now that we’ve updated the memory state (another name for the memory cell), we have to think about what we want to output.

The output gate

In a sequence-to-sequence mapping task, like machine translation or image captioning, we might be interested in outputting a value (to the screen or to a file) for each input we see. Here, though, we have a single scalar output – essentially a regression task.

Even though we’re not interested in sending the value to the screen, we still need to compute the output, because it becomes part of the input to the next LSTM cell.

Here’s the image to think of:

OutputGate

By now you should be thoroughly bored with these equations:

{\rm{net}}_{o1} = w_{o1}x_1 + w_{o2}h_0 + b_o

o_1 = \sigma({\rm{net}}_{o1}) = 1/(1 + exp(-{\rm{net}}_{o1} ))

{\rm{net}}_{o1} = 0.02(0.1) + 0.04(0) + 0.001 = 0.003

o_1 = \sigma({\rm{net}}_{o1}) = 1/(1 + exp(-0.003)) = 0.5007

And we’ll make a stochastic decision as to whether we pass this output along. For the purposes of this example, let’s assume the stochastic decision results in a 1.

The hidden layer (hidden state)

HiddenState

I bet you were wondering when we’d get to this. The hidden layer is separate from the memory cell, but very related. I like to think of it as the part of the memory cell that we want to ensure persists. Here’s how we do it:

h_1 = o_1 \circ {\rm{tanh}}(c_1)
h_1 = 1 \circ 0.0796 = 0.0796 (yes, I’m rounding. I’ve been doing that a lot.)

See what we did there? The output gate decides whether the signal from the memory cell gets sent forward as part of the input to the next LSTM cell.

The second LSTM cell

We’ll assume the weights are shared across LSTM cells. The equations are exactly the same, but now we use x_1 where before we used x_0 and h_1 where we used h_0 and c_1 where we used c_0, etc. Let’s say we have input x_1 =0.2, and target scalar value y = 0.08. Here are all of the familiar computations written out, and final answers given (assuming all stochastic gate decisions result in the signal being propagated forward, and 0 information forgotten):

Input gate:

{\rm{net}}_{i2} = w_{i1}x_2 + w_{i2}h_1 + b_i
i_2 = \sigma({\rm{net}}_{i2}) = 1/(1 + exp(-{\rm{net}}_{i2} )) = 0.52875
{\rm{net}}_{c2} = w_{c1}x_2 + w_{c2}h_1 + b_c
\tilde{c}_2 = \sigma({\rm{net}}_{c2}) = 1/(1 + exp(-{\rm{net}}_{c2} )) = 0.11768

Forget gate:

{\rm{net}}_{f2} = w_{f1}x_2 + w_{f2}h_1 + b_f
f_2 = \sigma({\rm{net}}_{f1}) = 1/(1 + exp(-{\rm{net}}_{f2} )) = 0.50231

Memory cell:

c_2 = i_2 \circ \tilde{c}_2 + f_2 \circ c_1 = 0.61999

Output gate:
{\rm{net}}_{o2} = w_{o1}x_1 + w_{o2}h_0 + b_o
o_2 = \sigma({\rm{net}}_{o2}) = 1/(1 + exp(-{\rm{net}}_{o2} )) = 0.50145

Hidden state:
h_2 = o_2 \circ {\rm{tanh}}(c_2) = 0.5511

Okay, now we’ve reached the end of our sequence. It’s time to figure out the final output, that we’re going to use for the error calculation. This depends exclusively on the hidden state (remember the memory cell is input to the hidden state, so we only need to use the hidden state to take into account the entire memory of our LSTM).

The calculation:

\hat{y} = w_y h_2 + b_y = 0.6(0.5511) + 0.025 = 0.335566

That value \hat{y} is our final output.

But wait, weren’t we aiming for 0.08? We need to make some changes to our model. To do that, we’ll calculate the error and backpropagate the signal to update our weights.

The error (mean squared error, or MSE, but with only one value so ‘mean’ is irrelevant):

E = \frac{1}{2}(y - \hat{y})^2 = 0.5*(0.335566-0.08)^2 = 0.03266

A detailed walkthrough: backpropagation through time

This is going to get messy. There is a whole bunch of chain rule going on here. There are a few paths the derivative can take through the network, which is a great thing, actually, because it means the gradient is less prone to vanishing, but we’ll get to that. First, let’s figure out how to send our error signal back. We need the derivative with respect to our weight matrix, but to get there we have to go through all of our model components. Many thanks to Alex Graves for a beautiful thesis and to the author of this very helpful blog for filling in the gaps in my knowledge here.

The first step is to calculate the gradient of the error with respect to the output:

\frac{\partial E}{\partial o_t} = \frac{\partial E}{\partial h_t} \frac{\partial h_t}{\partial o_t} = (y - \hat{y})(-w_y){\rm{tanh}}(c_t)

We can see the dependency on the hidden state by expanding \hat{y}:

\frac{\partial E}{\partial o_t} = (y - [w_y(h_t) + b_y])(-w_y){\rm{tanh}}(c_t) = \delta_{o_t}

I’ll use \delta_i to refer to the partial derivative of the error with respect to i, similar to that blog post.

Now we need to differentiate through the hidden state to get to the next part. Alternatively, we could differentiate through c_2 directly – that’s the second path the gradient can take. Actually, going directly through the memory cell saves a step (is shorter) as shown here (pages 12 – 13).

\frac{\partial E}{\partial c_t} = \frac{\partial E}{\partial h_t} \frac{\partial h_t}{\partial c_t} = (y - [w_y(h_2) + b_y])(-w_y)(o_t)(1 - {\rm{tanh}}^2(c_t)) = \delta_{c_t}

Now we need to go through the input and forget gates.

The input gate:

\frac{\partial E}{\partial i_t} = \frac{\partial E}{\partial c_t} \frac{\partial c_t}{\partial i_t} = \delta_{c_t} \tilde{c_t} = \delta_{i_t}

The forget gate:

\frac{\partial E}{\partial f_t} = \frac{\partial E}{\partial c_t} \frac{\partial c_t}{\partial f_t} = \delta_{c_t} c_{t-1} = \delta_{f_t}

The proposal for the new memory state:

\frac{\partial E}{\partial \tilde{c_t}} = \frac{\partial E}{\partial c_t} \frac{\partial c_t}{\partial \tilde{c_t}} = \delta_{c_t} i_t = \delta_{a_t}

The previous cell state:

\frac{\partial E}{\partial c_t} = \frac{\partial E}{\partial c_t} \frac{\partial c_t}{\partial \tilde{c}_t} = \delta_{c_t} f_t = \delta_{c_{t-1}}

The input to the proposal:

\frac{\partial E}{\partial {\rm{net}}_{c_t}} = \frac{\partial E}{\partial c_t} \frac{\partial c_t}{\partial \tilde{c}_t} \frac{\partial \tilde{c}_t}{\partial {\rm{net}}_{c_t}} = \delta_{a_t} (1-tanh^2({\rm{net}}_{c_t})) = \delta_{\hat{a}_t}

The net input to the input gate:

\frac{\partial E}{\partial {\rm{net}}_{i_t}} = \frac{\partial E}{\partial c_t} \frac{\partial c_t}{\partial i_t} \frac{\partial i_t}{\partial {\rm{net}}_{i_t}} = \delta_{i_t} i_t(1 - i_t) = \delta_{\hat{i}_t}

because of the derivative of the sigmoid function

The net input to the forget gate:

\frac{\partial E}{\partial {\rm{net}}_{f_t}} = \frac{\partial E}{\partial c_t} \frac{\partial c_t}{\partial f_t} \frac{\partial f_t}{\partial {\rm{net}}_{f_t}} = \delta_{f_t} f_t(1 - f_t) = \delta_{\hat{f}_t}

The net input to the output gate:

\frac{\partial E}{\partial {\rm{net}}_{o_t}} = \frac{\partial E}{\partial c_t} \frac{\partial c_t}{\partial o_t} \frac{\partial o_t}{\partial {\rm{net}}_{o_t}} = \delta_{o_t} o_t(1 - o_t) = \delta_{\hat{o}_t}

Now we need to recall our definitions from way up top:

W = \begin{bmatrix} w_{i1} & w_{i2} & b_i \\ w_{c1} & w_{c2} & b_c \\ w_{f1} & w_{f2} & b_f \\ w_{o1} & w_{o2} & b_o \\ w_y & 0 & b_y \\ \end{bmatrix}

And let I_t be the total input at time t: [x_t, h_{t-1}, 1]^T.

Then we can define z_t = W I_t, and collect all of our ‘lowest’ derivatives:

\delta_{z_t} = [\delta_{\hat{i}_t}, \delta_{\hat{a}_t}, \delta_{\hat{f}_t}, \delta_{\hat{o}_t}]

Then our last derivatives are:

\frac{\partial E}{\partial I_t} = W^T \delta_{z_t}

and

\frac{\partial E}{\partial W_t} = \delta_{z_t} X (I_t)^T.

And there you have it! Backpropagation through an LSTM.

I used the typical activations here – namely, sigmoids and tanh – but this can also be done with ReLUs. I leave that to a future post.

Advertisements

4 thoughts on “A numerical example of LSTMs”

  1. I’m not sure I made a mistake myself, but I think there’s a mistake
    c1 = i1 * ~c1 + f1 * c0 = 1 * 0.0798 + 1 * 0
    That is wrong, because i1 = 0.515 and f1 = 0.5012.

    The right value is:
    c1 = 0.515 * 0.0798 + 0.5012 * 0 = 0.0411

    And:
    h1 = o1 * tanh(c1) = 0.5007 * tan(0.0411) = 0.0206

    Like

    1. Hi Frank,

      Thanks for pointing that out. At the end of the calculation for i1 and f1 and o1, I said that it’s easier arithmetically to assume the stochastic decision resulted in a 1. That’s why I was using the value 1 throughout, instead of 0.515, 0.5012, and 0.5007. If instead you assume the gates stay partially open, as you have, I think you have the right values.

      Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s