Hyperas for LSTMs

One of the many facets of deep learning is the selection of appropriate model hyper parameters. How do we decide, for example, the number of hidden units in each layer? Or even the number of layers in our model? Or, arguably one of the most important hyperparameters, the learning rate? These types of architectural choices are governed by hyperparameter optimization techniques such as grid search, random search, and Bayesian optimization. In this post, I’ll cover how to do random search using Hyperas.

Hyperas is a wrapper of Hyperopt for Keras. It’s constantly evolving, so there are relatively few examples online right now aside from those provided in the readme. What happens if you want to build a more complicated model?

In my example below, the task is multiclass classification of epidemic curves. For example, each row of the training data looks like this:

[4, 5, 10, 15, 6, 2, 0, …, 0]

where the last column is the label of the epidemic curve (the example above belongs to class 0) and the epidemic runs for T days, where T is the number of elements in the vector minus 1 (for the class label). Each value is the count of the number of infectious individuals on each day. Day 1 has 4 infectious individuals; day 2 has 5, etc. The goal is to classify the epidemic curve into one of a set of pre-defined categories based on the shape of the curve. In this example, there are 8 categories.

I’ve provided an example using an LSTM. Keras is incredibly versatile and relatively user-friendly. If you’d rather use a GRU for example, all you have to do is exchange ‘LSTM’ for ‘GRU’ in this code.

To run the code below, you will need to have installed numpy, scipy, sklearn, keras, and hyperas. You will also need a separate file in the same directory:

The file globalvars.py contains one line: globalVar=0

More on the global variable later.

from __future__ import print_function
from hyperopt import Trials, STATUS_OK, tpe, space_eval
import keras.optimizers, keras.initializers
from keras.regularizers import l2
from keras.callbacks import EarlyStopping, ModelCheckpoint
from keras.layers import LSTM
from keras.layers.core import Dense, Dropout, Activation
from keras.models import Sequential
from keras.utils import np_utils
from sklearn.preprocessing import MinMaxScaler

from hyperas import optim
from hyperas.distributions import choice, uniform, conditional
import numpy, json
import globalvars

def data():
    train_file='train_pl.csv'
    trainset = numpy.loadtxt(train_file, delimiter=",")

    X = trainset[:, 0:(trainset.shape[1]-2)]
    Y = (trainset[:,trainset.shape[1]-1]).astype(int)

    scaler = MinMaxScaler(feature_range=(0, 1))
    X_scaled = scaler.fit_transform(X)

    y_binary = np_utils.to_categorical(Y)

    num_per_class = int(float(X.shape[0])/(float(y_binary.shape[1])))

    to_take = numpy.random.choice(num_per_class, int(num_per_class*0.2), replace=False)
    class_split = numpy.array_split(X_scaled, y_binary.shape[1])
    val_list = [x[to_take] for x in class_split]
    big_list = [item for sublist in val_list for item in sublist]
    val_X = numpy.asarray(big_list)
    label_set = numpy.arange(0, y_binary.shape[1])
    val_Y = numpy.repeat(label_set, int(num_per_class*0.2))
    val_Y = np_utils.to_categorical(val_Y)

    setdiffval = set(range(num_per_class)) - set(to_take)
    setdiffval = list(setdiffval)

    X_train_vals = [x[setdiffval] for x in class_split]
    X_train = [item for sublist in X_train_vals for item in sublist]
    X_train = numpy.asarray(X_train)
    Y_train = numpy.repeat(label_set, int(num_per_class*0.8))
    Y_train = np_utils.to_categorical(Y_train)

    x_train = X_train
    x_test = val_X
    y_train = Y_train
    y_test = val_Y

    return (x_train, y_train, x_test, y_test)


def model(x_train, y_train, x_test, y_test):
    model = Sequential()
    model.add(LSTM({{choice([10, 20, 50, 100])}}, input_shape=(x_train.shape[1], x_train.shape[2]), return_sequences=False))
    model.add(Activation('relu'))

    if conditional({{choice(['one', 'two'])}}) == 'two':
        model.add(Dense({{choice([10, 20, 30, 100])}}))
        model.add(Activation('relu'))

    model.add(Dense(8))
    model.add(Activation('softmax'))

    adam = keras.optimizers.Adam(lr={{choice([10**-6, 10**-5, 10**-4, 10**-3, 10**-2, 10**-1])}}, clipnorm=1.)
    rmsprop = keras.optimizers.RMSprop(lr={{choice([10**-6, 10**-5, 10**-4, 10**-3, 10**-2, 10**-1])}}, clipnorm=1.)
    sgd = keras.optimizers.SGD(lr={{choice([10**-6, 10**-5, 10**-4, 10**-3, 10**-2, 10**-1])}}, clipnorm=1.)

    choiceval = {{choice(['adam', 'sgd', 'rmsprop'])}}
    if choiceval == 'adam':
        optim = adam
    elif choiceval == 'rmsprop':
        optim = rmsprop
    else:
        optim = sgd
    model.compile(loss='categorical_crossentropy', metrics=['accuracy'],
                  optimizer=optim)

    globalvars.globalVar += 1

    filepath = "weights_mlp_hyperas" + str(globalvars.globalVar) + ".hdf5"
    earlyStopping = EarlyStopping(monitor='val_loss', min_delta=0, patience=20, verbose=0, mode='auto')
    checkpoint = ModelCheckpoint(filepath, monitor='val_acc', verbose=1, save_best_only=True, mode='max')
    callbacks_list = [earlyStopping, checkpoint]

    hist = model.fit(x_train, y_train,
              batch_size={{choice([64, 128])}}, epochs=100,verbose=2,validation_data=(x_test, y_test), callbacks=callbacks_list)
    h1 = hist.history
    acc_ = numpy.asarray(h1['acc'])
    loss_ = numpy.asarray(h1['loss'])
    val_loss_ = numpy.asarray(h1['val_loss'])
    val_acc_ = numpy.asarray(h1['val_acc'])

    acc_and_loss = numpy.column_stack((acc_, loss_, val_acc_, val_loss_))
    save_file_mlp = 'mlp_run_' + '_' + str(globalvars.globalVar) + '.txt'
    with open(save_file_mlp, 'w') as f:
            numpy.savetxt(save_file_mlp, acc_and_loss, delimiter=" ")
    score, acc = model.evaluate(x_test, y_test, verbose=0)
    print('Final validation accuracy:', acc)

    return {'loss': -acc, 'status': STATUS_OK, 'model': model}

if __name__ == '__main__':
    trials=Trials()
    best_run, best_model = optim.minimize(model=model, data=data,algo=tpe.suggest,max_evals=500,trials=trials)
    X_train, Y_train, X_test, Y_test = data()
    print("Evaluation of best performing model:")
    print(best_model.evaluate(X_test, Y_test))
    print("Best performing model chosen hyper-parameters:")
    print(best_run)
    json.dump(best_run, open("best_run.txt", 'w'))

Let’s do a quick walkthrough.

First, a few tips:

  • Hyperas is a string parser. It doesn’t take kindly to comments, which is unfortunate.
  • When you import a python module, keep it in its default namespace. For example, use `import numpy` not `import numpy as np`
  • There is a data() function and a model() function. You need both of these to use Hyperas.
  • One evolution that’s coming up is the ability to keep track of the evaluation you’re currently on. Since that’s still in the works (although the most recent pull request looks promising), I’m doing so manually with a global variable I define separately.
  • You’ll notice in `hist=model(…)` I don’t use `validation_split`, which is used in several other blog posts on Keras. The problem with `validation_split` is it selects the last x percent of the training data to use as the validation data. If you shuffle your training data then this isn’t a problem, but since I haven’t in the code above, I’ve used a different method (for illustration – I still advocate shuffling your data!)
  • Now for the walkthrough.

    data()

    In the data() function, I’ve read in my training data, which had 6400 rows.

    Since the daily observed counts of infectious individuals are integer-based, and deep learning models perform better with real-valued input, I centre and scale my input values.

    Then, I format my output (y_train and y_test) each as a matrix of categories. That is, for class 0, 1, …, 7 I create an identity matrix with the same number of rows as my training set and 8 columns that identifies the class of each epidemic (to_categorical is needed to do this).

    Also, I wanted to use a portion of my training data for my validation set. There are a few ways to accomplish this, but I’ve chosen to do the splitting here in the data() function.

    Since this is a time series model, I have to reshape my data to be compatible with a Keras time series model – the last two lines of the data() function ensure this compatibility.

    model()

    Now for the fun part!

    The model definition is fairly straightforward in Keras. It’s all very modular, and modules can be swapped as necessary to achieve your desired model. Hyperas is also very intuitive: you can define a set of categories to perform random search for each desired parameter value.

    For example, {{choose[10, 20]}} means select either 10 or 20 with equal probability.

    You can also allow the parameter value to be generated according to a distribution, such as {{uniform(0, 1)}}.

    Now, about the globalVar part: this keeps track of which evaluation we’re on. For example, I define max_evals=500 in the main function, which means I’ll try 500 different hyperparameter sets, so globalVar will iterate from 0 to 499.

    Helpfully, Keras can be used with either the TensorFlow or the Theano backend. If you’re using the TensorFlow backend, you can probably send your results to TensorBoard. I’m using the Theano backend because I’ve read it’s faster (especially for time series models), so I have a separate way of visualizing my results, which I’ll go through below.

    To run the code below, you will need to have installed matplotlib and h5py in addition to the modules above. This is called after_hyperas.py in my script (which I’ll show later)

    #!/usr/bin/env python
    
    import numpy as np
    import sys, keras
    from keras import optimizers
    from keras.models import Sequential
    from keras.callbacks import EarlyStopping, ModelCheckpoint
    from keras.layers import Dense, LSTM, Activation
    from keras.wrappers.scikit_learn import KerasClassifier
    from keras.regularizers import l2
    from sklearn.preprocessing import MinMaxScaler, LabelEncoder
    from keras.utils import np_utils
    from keras.utils.np_utils import to_categorical
    import matplotlib
    matplotlib.use('pdf')
    import matplotlib.pyplot as plt
    import matplotlib.patches as mpatches
    import argparse, os, re, string, sys
    import numpy as np
    import json
    
    
    def load_data(train_file, test_file):
            trainset = np.loadtxt(train_file, delimiter=",")
    
            # split into input (X) and output (Y) variables
            X = trainset[:, 0:(trainset.shape[1]-2)] #0:61 (count from 0, there are 62 cols)
            Y = (trainset[:,trainset.shape[1]-1]).astype(int) #column 63 counting from 0
    
            scaler = MinMaxScaler(feature_range=(0, 1))
            X_scaled = scaler.fit_transform(X)
    
            y_binary = to_categorical(Y)
    
            testset = np.loadtxt(test_file, delimiter=",")
            X_test = testset[:,0:(testset.shape[1]-2)]
            X_test_scaled = scaler.fit_transform(X_test)
    
            Y_test = (testset[:,testset.shape[1]-1]).astype(int)
    
            ytest_binary = to_categorical(Y_test)
            return (X_scaled, y_binary, X_test_scaled, ytest_binary) 
    
    def plot_acc_loss(filename, savefigname):
    
        with open(filename, 'r') as f:
            acc_loss_file = np.loadtxt(filename, delimiter=" ", unpack=True)
    
        acc_fig = plt.figure()
        plt.plot(acc_loss_file[0,:], 'r', acc_loss_file[2,:], 'b')
        plt.title('Training and validation accuracy')
        plt.xlabel('Epoch')
        plotmax = np.array([acc_loss_file[0,:], acc_loss_file[2,:]])
        plotmax = plotmax.flatten()
        plt.axis([0, 2, 0, max(plotmax)])
        red_line = mpatches.Patch(color='red', label='training')
        blue_line = mpatches.Patch(color='blue', label='validation')
        plt.legend(handles=[red_line, blue_line], loc='lower right')
        savefig_acc = savefigname + '_acc.png'
        acc_fig.savefig(savefig_acc)
    
        loss_fig = plt.figure()
        plt.plot(acc_loss_file[1,:], 'r', acc_loss_file[3,:], 'b')
        plt.title('Training and validation loss')
        plt.xlabel('Epoch')
        plotmax = np.array([acc_loss_file[0,:], acc_loss_file[2,:]])
        plotmax = plotmax.flatten()
        plt.axis([0, 2, 0, max(plotmax)])
        red_line = mpatches.Patch(color='red', label='training')
        blue_line = mpatches.Patch(color='blue', label='validation')
        plt.legend(handles=[red_line, blue_line], loc='lower right')
        savefig_loss = savefigname + '_loss.png'
        loss_fig.savefig(savefig_loss)
    
        return 0
    
    def get_confusion_matrix_one_hot(model_results, truth):
        '''model_results and truth should be for one-hot format, i.e, have >= 2 columns,
        where truth is 0/1, and max along each row of model_results is model result
        '''
        assert model_results.shape == truth.shape
        num_outputs = truth.shape[1]
        confusion_matrix = np.zeros((num_outputs, num_outputs), dtype=np.int32)
        predictions = np.argmax(model_results,axis=1)
        assert len(predictions)==truth.shape[0]
    
        for actual_class in range(num_outputs):
            idx_examples_this_class = truth[:,actual_class]==1
            prediction_for_this_class = predictions[idx_examples_this_class]
            for predicted_class in range(num_outputs):
                count = np.sum(prediction_for_this_class==predicted_class)
                confusion_matrix[actual_class, predicted_class] = count
        assert np.sum(confusion_matrix)==len(truth)
        assert np.sum(confusion_matrix)==np.sum(truth)
        return confusion_matrix
    
    def lstm_repeat(X, Y, Xtest, Ytest, params_to_use, num_reps=10):
            #Take the best parameters and run through the entire training set
    
            choiceval = params_to_use['choiceval']
            batch_s = params_to_use['batch_size']
            lr_choice_1 = params_to_use['lr']
            lr_choice_2 = params_to_use['lr_1']
            lr_choice_3 = params_to_use['lr_2']
            condval = params_to_use['conditional']
    
            if batch_s == 0:
                    batch_s = 64
            else:
                    batch_s = 128
    
            possible_lr = [10**-6, 10**-5, 10**-4, 10**-3, 10**-2, 10**-1]
            possible_layers = [10, 20, 30, 100]
    
            if choiceval == 0:
                    optim = keras.optimizers.Adam(lr=possible_lr[lr_choice_1], clipnorm=1.)
            elif choiceval == 1:
                    optim = keras.optimizers.RMSProp(lr=possible_lr[lr_choice_2], clipnorm=1.)
            else:
                    optim = keras.optimizers.SGD(lr=possible_lr[lr_choice_3], clipnorm=1.)
    
            h_dim = params_to_use['LSTM']
            h_2_dim = params_to_use['Dense']
    
            hid_val = possible_layers[h_dim]
            hid_2_val = possible_layers[h_2_dim]
    
            if condval == 1:
                    condval = True
    
    def lstm_model(hid_dim=10, hid_2_dim=10, optimizer='sgd', condval=False)
        model = Sequential()
        model.add(LSTM(hid_val, input_shape=(x_train.shape[1], x_train.shape[2]), return_sequences=False))
        model.add(Activation('relu'))
    
        if condval == 1:
            model.add(Dense(hid_2_val))
            model.add(Activation('relu'))
    
        model.add(Dense(8))
        model.add(Activation('softmax'))
    
        model.compile(loss='categorical_crossentropy', metrics=['accuracy'],
                      optimizer=optim)
    
            #Need this for all classification errors
            class_error = []
    
            #X has 64000 examples, 800 of each class (8 classes). Take 20% of each class and make this the validation set.
            to_take = np.random.choice(800, int(800*0.2), replace=False)
            class_split = np.array_split(X, 8) #16 equal slices of 800 elements each
            val_list = [x[to_take] for x in class_split]
            big_list = [item for sublist in val_list for item in sublist]
            val_X = np.asarray(big_list)
            label_set = np.arange(0, 8) #0 to 4
            val_Y = np.repeat(label_set, int(800*0.2))
            val_Y = to_categorical(val_Y)
    
            setdiffval = set(range(800)) - set(to_take)
            setdiffval = list(setdiffval)
    
            X_train_vals = [x[setdiffval] for x in class_split]
            X_train = [item for sublist in X_train_vals for item in sublist]
            X_train = np.asarray(X_train)
            Y_train = np.repeat(label_set, int(800*0.8))
            Y_train = to_categorical(Y_train)
    
           for j in range(num_reps):
    
                    print 'on rep #', j
    
                    filepath = "weights_lstm_2" + str(j) + ".hdf5" #for saving the best weights
    
                    model = lstm_model(hid_dim=hid_val, hid_2_dim=hid_2_val, optimizer=optim, condval=condval)
                    earlyStopping = EarlyStopping(monitor='val_loss', min_delta=0, patience=100, verbose=0, mode='auto')
                    checkpoint = ModelCheckpoint(filepath, monitor='val_acc', verbose=1, save_best_only=True, mode='max')
                    callbacks_list = [earlyStopping, checkpoint]
                    hist = model.fit(X_train, Y_train, validation_data=(val_X, val_Y), epochs=500, callbacks=callbacks_list)
    
                    model.load_weights(filepath) #everything subsequent should use the best weights - this works according to test.py
                    scores = model.evaluate(X, Y)
                    print("%s: %.2f%%" % (model.metrics_names[1], scores[1]*100)) #training loss at the end of training. https://github.com/fchollet/keras/issues/605
    
                    h1 = hist.history
                    acc_ = np.asarray(h1['acc'])
                    loss_ = np.asarray(h1['loss'])
                    val_loss_ = np.asarray(h1['val_loss'])
                    val_acc_ = np.asarray(h1['val_acc'])
    
                    #Save the accuracy and loss, and plot
                    acc_and_loss = np.column_stack((acc_, loss_, val_acc_, val_loss_))
                    save_file_lstm = 'lstm_' + str(j) + '.txt'
                    with open(save_file_lstm, 'w') as f:
                            np.savetxt(save_file_lstm, acc_and_loss, delimiter=" ")
                    print 'saved file', save_file_lstm
    
                    if j == 1:
                            #Plot the training and validation accuracy and loss for the entire training set.
                            plot_acc_loss(save_file_lstm, 'testing')
    
                    #Run the final, trained model on the test set and return a confusion matrix
    
                    test_scores = model.evaluate(Xtest, Ytest) #evaluate returns the metric 'accuracy' on the test set
                    print 'eval_scores', test_scores[1]*100
    
                    predict = model.predict(Xtest) #I think predict returns the class probabilities
                    con_mat = get_confusion_matrix_one_hot(predict, Ytest)
                    print con_mat
    
                    save_con_mat = 'confusionmatrix_lstm' + '_' + str(j) + '.txt'
                    np.savetxt(save_con_mat, con_mat, fmt='%i', delimiter=",")
    
    
                    #Get the classification error per class, and the average classification error
                    class_error_j = []
                    for i in range(0, con_mat.shape[1]): #for the number of classes
                            class_i_correct = float(con_mat[i][i])/float(sum(con_mat[i]))
                            class_error_j.append(1. - class_i_correct)
    
                    print class_error_j
                    class_error_j.append(sum(class_error_j)/con_mat.shape[1]) #the average classification error
                    class_error.append(class_error_j) #each class's classification error, and the average classification error for this run
    
            print class_error
            print np.asarray(class_error)
    
            np.savetxt('class_error.txt', class_error, fmt = '%1.3f', delimiter=' ')
    
    
            return 0
    
    if __name__=='__main__':
        best_params = json.load(open('best_run.txt'))
        X, Y, Xtest, Ytest = load_data('train_pl.csv', 'test_pl.csv')
        lstm_repeat(X, Y, Xtest, Ytest, best_params)</code>
    

    The matplotlib function allows me to plot confusion matrices, one for each run. For an explanation of confusion matices, see the highest voted answer here.

    When I run this entire set, I have a separate script called hyperas_and_after.sh

    #/bin/bash
    python hyperas_test_mlp.py && python after_hyperas.py

    My full process for running all code is:
    screen
    – create a virtual environment using venv name_of_my_env
    source activate name_of_my_env
    – install the necessary packages (may need to preload libgfortran)
    ./hyperas_and_after.sh

    A disclaimer is important at this point: a lot of the code seen here is from various sources, compiled from answers in StackExchange and readmes. This post is mostly here to serve as a personal reminder of how to use Hyperas. Some of the places I got code from are:

    get_confusion_matrix_one_hot: https://github.com/davidslac/mlearntut/blob/master/ex02_keras_train.py

    LSTMs in Keras: http://machinelearningmastery.com/sequence-classification-lstm-recurrent-neural-networks-python-keras/

    Hyperas: https://github.com/maxpumperla/hyperas/blob/master/README.md

    A numerical example of LSTMs

    Many posts have gone into detail discussing the forward pass of LSTMs (for example, the very informative post here). However, relatively few go through backpropagation, and numerical examples are even more rare. I did manage to find some good sources, and Alex Graves’ thesis was a big help, but after I answered this datascience post on LSTMs, I thought it would be worth delving into some more details.

    Other blogs discuss RNNs, so I won’t get into them. Instead, I’ll focus on:

    • The forward pass: how information travels through an LSTM
    • The backward pass: how gradient information travels backwards through the LSTM

    Let’s take a very simple example and work through the forward pass and the backward pass. I will assume that you have seen backpropagation before. If you haven’t, you may want to start here.

    Basics

    LSTM stands for Long Short-Term Memory. It was conceived by Hochreiter and Schmidhuber in 1997 and has been improved on since by many others. The purpose of an LSTM is time series modelling: if you have an input sequence, you may want to map it to an output sequence, a scalar value, or a class. LSTMs can help you do that. The entire LSTM model – all of the gates and the memory cell – are referred to as the LSTM cell. The basic components are an input gate, a forget gate (added after the original LSTM), an output gate, and a memory cell. Broadly:

    • The input gate allows new information to flow into the network. It has parameters W_i, b_i, where i stands for input.
    • The memory cell preserves the hidden units information across time steps. It has parameters W_c, b_c, where c stands for cell.
    • The forget gate allows information which is no longer pertinent to be discarded. It has parameters W_f, b_f, where f stands for forget.
    • The output gate allows what information will be output to the screen and what will be propagated forward as part of the new hidden state. It has parameters W_o, b_o, where o stands for output.

    Interestingly, all of the weights have the same dimension.

    This is an LSTM with two cells:
    FullLSTM

    When I was first looking at LSTMs, the presentations that I saw made it difficult to think of them as feed-forward. While this diagram is a bit unconventional, I find it helpful because it illustrates that we are dealing with a special kind of recurrent neural network (which at its core is just a feed-forward neural network replicated over time).

    Let’s break that image down a bit. Here is the first cell:

    FirstCell

    Green circles indicate input. In each cell, we have the current input from the time series, x_t, and input from the previous time step, h_{t-1}. The last operation in the cell is to calculate the hidden state for the next cell, which is at once part of the output of the current cell and the input of the next cell.

    Red circles indicate the memory cell. One of the main differences between vanilla RNNs and LSTMs is the addition of the memory cell. Whereas RNNs have only the hidden state to maintain memory from previous time steps, LSTMs have the hidden state as well as this additional memory cell. This helps with the task of learning long-term dependencies, something that RNNs have struggled with. In the first cell, the memory coming from the previous time step is set to 0 (although is some recent work on initialization strategies).

    Orange circles are gates. In real life, a gate can be partially open, fully open, or closed. The same idea applies here. The gates control the flow of information.

    The line down the centre of some of the circles indicates that the circle (the neuron) has some net input and an activation function. We can imagine the net input coming in on the left hemisphere, and undergoing a nonlinear transformation (the activation function) on the right hemisphere. For example, the orange neuron i_1 takes a linear combination as input and outputs an activation. We’ll get into that more when we discuss the input gate. Naturally, the initial input doesn’t have a net input, so circles for x_0, h_0, c_0, x_1 have no dividing lines.

    Here is the second cell:

    SecondCell

    This is not that much different from the first cell. The hidden state from the previous layer is h_1, which was the result of a calculation, so this neuron has a dividing line. All of the connections are the same. The hidden state h_2 is not input for the next cell because there is no next cell. We go directly from computing the hidden state to computing the output of the LSTM network. We’ll get to that.

    A detailed walk-through: forward

    Let’s start with a simple example with one dimensional input and one dimensional output.

    Let’s focus on the first cell for now:

    Suppose we have a scalar-valued input sequence x_0 = 0.1, x_1 = 0.2. In English, this means the input at the beginning of the sequence is 0.1, and the input at the next time step is 0.2. Yes, this isn’t much of a sequence, but to illustrate the computation it’ll do fine. We’ll assume that we initialized our weights and biases to have the following values:

    W = \begin{bmatrix} w_{i1} & w_{i2} & b_i \\ w_{c1} & w_{c2} & b_c \\ w_{f1} & w_{f2} & b_f \\ w_{o1} & w_{o2} & b_o \\ w_y & 0 & b_y \\ \end{bmatrix} = \begin{bmatrix} 0.5 & 0.25 & 0.01 \\ 0.3 & 0.4 & 0.05 \\ 0.03 & 0.06 & 0.002 \\ 0.02 & 0.04 & 0.001 \\ 0.6 & 0 & 0.025 \\ \end{bmatrix}

    This formulation will come in handy later for backpropagation, but you can see that each row of the matrix W has all of the parameters needed for one of the gates. The last row is the linear transformation associated with the output (we’ll get to that).

    The input gate

    It seems fashionable to start at the forget gate, but anytime I walk through a model I like to start with the input.

    Our input is [x_0, h_0, c_0], the initial sequence input, the initial hidden value, and the initial memory value. This translates to [0.1, 0, 0]. Above, I haven’t even bothered to include a column for a multiplication by the initial memory state (I would need, for example, a value w_{i3}), since this is rarely anything but 0 – people usually don’t bother considering it as part of the input, and I won’t from this point on.

    The image associated with the input gate is:

    InputGate

    And the associated equations for the first part are:

    {\rm{net}}_{i1} = w_{i1}x_1 + w_{i2}h_0 + b_i

    i_1 = \sigma({\rm{net}}_{i1}) = 1/(1 + exp(-{\rm{net}}_{i1} ))

    I’ve used ‘net’ to mean the net input to the gate. We take a linear transformation of the input values. Another way to present the linear transformation (using T for transpose) is:

    {\rm{net}}_{i1} = [w_{i1} \hspace{1em} w_{i2}] [x_1 \hspace{1em} h_0]^T + b_i, as done on that first blog I linked to.

    The full computation is:

    {\rm{net}}_{i1} = 0.5(0.1) + 0.25(0) + 0.01 = 0.06

    i_1 = \sigma({\rm{net}}_{i1}) = 1/(1 + exp(-0.06)) = 0.515

    This value can be interpreted as the probability that we will allow the information from x_1 to enter the memory cell.

    The usual practice is to keep that value 0.515 – that is, to keep the gate partially open. Alternatively, we could make a decision as to whether the information will go forward. That is, we could generate a value u_1 \sim {\rm{Uniform}}(0, 1) and, if u_1 < 0.515, then allow the information through – open the input gate completely. This is referred to as making a stochastic decision. Depending on the decision, the gate would open (value 1) or close (value 0). For the purposes of this example, let's assume the value is 1 to make the arithmetic easy.

    The second part of the input gate is related to the memory cell. It creates a proposal for the inclusion of the new information:

    {\rm{net}}_{c1} = w_{c1}x_1 + w_{c2}h_0 + b_c

    \tilde{c}_1 = \sigma({\rm{net}}_{c1}) = 1/(1 + exp(-{\rm{net}}_{c1} ))

    The full computation is:

    {\rm{net}}_{c1} = 0.3(0.1) + 0.4(0) + 0.05 = 0.08

    \tilde{c}_1 = \tanh({\rm{net}}_{c1}) = 1/(1 + exp(-0.08)) = 0.0798

    Note no stochastic decision is made here – this is the quantity associated with the input that we'll pass to the memory cell. We could make a stochastic decision using a tanh function, and that often happens, but not here. Why? Because this is the input signal! We need this part as it is.

    We’ll use both of these pieces together later when we update the memory cell.

    The forget gate

    The point of this gate is to decide what information needs to be removed from the network. For example, if you’re making a grocery list for this week’s groceries based on last week’s, and you bought 2 weeks’ worth of apples last week, then you don’t need to buy them this week. You can remove the apples from your list.

    The forget gate looks like this:

    ForgetGate

    and takes similar input:

    {\rm{net}}_{f1} = w_{f1}x_1 + w_{f2}h_0 + b_f

    f_1 = \sigma({\rm{net}}_{f1}) = 1/(1 + exp(-{\rm{net}}_{f1} ))

    And the computation is also similar:

    {\rm{net}}_{f1} = 0.03(0.1) + 0.06(0) + 0.002 = 0.005

    f_1 = \sigma({\rm{net}}_{f1}) = 1/(1 + exp(-0.005)) = 0.5012

    Again, a stochastic decision could be made here as to whether the previous information should be forgotten (value 0) or allowed through (value 1). For the purposes of this example, let’s assume the value is 1.

    The memory cell

    This is the best part! We combine the new information from the input gate and remove the information we’re forgetting according to the forget gate.

    The picture is:

    MemoryCell

    and the update looks like this:

    c_1 = i_1 \circ \tilde{c}_1 + f_1 \circ c_0

    That’s a new symbol! We need an aside.

    Aside: Hadamard product

    The Hadamard product is an element-wise product. If we have a vector a_1 = [1, 2, 3] and a vector b_1 = [9, 10, 11], then the Hadamard product would be c_1 = a_1 \circ b_1 = [(1)(9), (2)(10), (3)(11)] = [9, 20, 33].

    End Aside

    c_1 = 1 \circ 0.0798 + 1 \circ 0 = 0.0798

    Now that we’ve updated the memory state (another name for the memory cell), we have to think about what we want to output.

    The output gate

    In a sequence-to-sequence mapping task, like machine translation or image captioning, we might be interested in outputting a value (to the screen or to a file) for each input we see. Here, though, we have a single scalar output – essentially a regression task.

    Even though we’re not interested in sending the value to the screen, we still need to compute the output, because it becomes part of the input to the next LSTM cell.

    Here’s the image to think of:

    OutputGate

    By now you should be thoroughly bored with these equations:

    {\rm{net}}_{o1} = w_{o1}x_1 + w_{o2}h_0 + b_o

    o_1 = \sigma({\rm{net}}_{o1}) = 1/(1 + exp(-{\rm{net}}_{o1} ))

    {\rm{net}}_{o1} = 0.02(0.1) + 0.04(0) + 0.001 = 0.003

    o_1 = \sigma({\rm{net}}_{o1}) = 1/(1 + exp(-0.003)) = 0.5007

    And we’ll make a stochastic decision as to whether we pass this output along. For the purposes of this example, let’s assume the stochastic decision results in a 1.

    The hidden layer (hidden state)

    HiddenState

    I bet you were wondering when we’d get to this. The hidden layer is separate from the memory cell, but very related. I like to think of it as the part of the memory cell that we want to ensure persists. Here’s how we do it:

    h_1 = o_1 \circ {\rm{tanh}}(c_1)
    h_1 = 1 \circ 0.0796 = 0.0796 (yes, I’m rounding. I’ve been doing that a lot.)

    See what we did there? The output gate decides whether the signal from the memory cell gets sent forward as part of the input to the next LSTM cell.

    The second LSTM cell

    We’ll assume the weights are shared across LSTM cells. The equations are exactly the same, but now we use x_1 where before we used x_0 and h_1 where we used h_0 and c_1 where we used c_0, etc. Let’s say we have input x_1 =0.2, and target scalar value y = 0.08. Here are all of the familiar computations written out, and final answers given (assuming all stochastic gate decisions result in the signal being propagated forward, and 0 information forgotten):

    Input gate:

    {\rm{net}}_{i2} = w_{i1}x_2 + w_{i2}h_1 + b_i
    i_2 = \sigma({\rm{net}}_{i2}) = 1/(1 + exp(-{\rm{net}}_{i2} )) = 0.52875
    {\rm{net}}_{c2} = w_{c1}x_2 + w_{c2}h_1 + b_c
    \tilde{c}_2 = \sigma({\rm{net}}_{c2}) = 1/(1 + exp(-{\rm{net}}_{c2} )) = 0.11768

    Forget gate:

    {\rm{net}}_{f2} = w_{f1}x_2 + w_{f2}h_1 + b_f
    f_2 = \sigma({\rm{net}}_{f1}) = 1/(1 + exp(-{\rm{net}}_{f2} )) = 0.50231

    Memory cell:

    c_2 = i_2 \circ \tilde{c}_2 + f_2 \circ c_1 = 0.61999

    Output gate:
    {\rm{net}}_{o2} = w_{o1}x_1 + w_{o2}h_0 + b_o
    o_2 = \sigma({\rm{net}}_{o2}) = 1/(1 + exp(-{\rm{net}}_{o2} )) = 0.50145

    Hidden state:
    h_2 = o_2 \circ {\rm{tanh}}(c_2) = 0.5511

    Okay, now we’ve reached the end of our sequence. It’s time to figure out the final output, that we’re going to use for the error calculation. This depends exclusively on the hidden state (remember the memory cell is input to the hidden state, so we only need to use the hidden state to take into account the entire memory of our LSTM).

    The calculation:

    \hat{y} = w_y h_2 + b_y = 0.6(0.5511) + 0.025 = 0.335566

    That value \hat{y} is our final output.

    But wait, weren’t we aiming for 0.08? We need to make some changes to our model. To do that, we’ll calculate the error and backpropagate the signal to update our weights.

    The error (mean squared error, or MSE, but with only one value so ‘mean’ is irrelevant):

    E = \frac{1}{2}(y - \hat{y})^2 = 0.5*(0.335566-0.08)^2 = 0.03266

    A detailed walkthrough: backpropagation through time

    This is going to get messy. There is a whole bunch of chain rule going on here. There are a few paths the derivative can take through the network, which is a great thing, actually, because it means the gradient is less prone to vanishing, but we’ll get to that. First, let’s figure out how to send our error signal back. We need the derivative with respect to our weight matrix, but to get there we have to go through all of our model components. Many thanks to Alex Graves for a beautiful thesis and to the author of this very helpful blog for filling in the gaps in my knowledge here.

    The first step is to calculate the gradient of the error with respect to the output:

    \frac{\partial E}{\partial o_t} = \frac{\partial E}{\partial h_t} \frac{\partial h_t}{\partial o_t} = (y - \hat{y})(-w_y){\rm{tanh}}(c_t)

    We can see the dependency on the hidden state by expanding \hat{y}:

    \frac{\partial E}{\partial o_t} = (y - [w_y(h_t) + b_y])(-w_y){\rm{tanh}}(c_t) = \delta_{o_t}

    I’ll use \delta_i to refer to the partial derivative of the error with respect to i, similar to that blog post.

    Now we need to differentiate through the hidden state to get to the next part. Alternatively, we could differentiate through c_2 directly – that’s the second path the gradient can take. Actually, going directly through the memory cell saves a step (is shorter) as shown here (pages 12 – 13).

    \frac{\partial E}{\partial c_t} = \frac{\partial E}{\partial h_t} \frac{\partial h_t}{\partial c_t} = (y - [w_y(h_2) + b_y])(-w_y)(o_t)(1 - {\rm{tanh}}^2(c_t)) = \delta_{c_t}

    Now we need to go through the input and forget gates.

    The input gate:

    \frac{\partial E}{\partial i_t} = \frac{\partial E}{\partial c_t} \frac{\partial c_t}{\partial i_t} = \delta_{c_t} \tilde{c_t} = \delta_{i_t}

    The forget gate:

    \frac{\partial E}{\partial f_t} = \frac{\partial E}{\partial c_t} \frac{\partial c_t}{\partial f_t} = \delta_{c_t} c_{t-1} = \delta_{f_t}

    The proposal for the new memory state:

    \frac{\partial E}{\partial \tilde{c_t}} = \frac{\partial E}{\partial c_t} \frac{\partial c_t}{\partial \tilde{c_t}} = \delta_{c_t} i_t = \delta_{a_t}

    The previous cell state:

    \frac{\partial E}{\partial c_t} = \frac{\partial E}{\partial c_t} \frac{\partial c_t}{\partial \tilde{c}_t} = \delta_{c_t} f_t = \delta_{c_{t-1}}

    The input to the proposal:

    \frac{\partial E}{\partial {\rm{net}}_{c_t}} = \frac{\partial E}{\partial c_t} \frac{\partial c_t}{\partial \tilde{c}_t} \frac{\partial \tilde{c}_t}{\partial {\rm{net}}_{c_t}} = \delta_{a_t} (1-tanh^2({\rm{net}}_{c_t})) = \delta_{\hat{a}_t}

    The net input to the input gate:

    \frac{\partial E}{\partial {\rm{net}}_{i_t}} = \frac{\partial E}{\partial c_t} \frac{\partial c_t}{\partial i_t} \frac{\partial i_t}{\partial {\rm{net}}_{i_t}} = \delta_{i_t} i_t(1 - i_t) = \delta_{\hat{i}_t}

    because of the derivative of the sigmoid function

    The net input to the forget gate:

    \frac{\partial E}{\partial {\rm{net}}_{f_t}} = \frac{\partial E}{\partial c_t} \frac{\partial c_t}{\partial f_t} \frac{\partial f_t}{\partial {\rm{net}}_{f_t}} = \delta_{f_t} f_t(1 - f_t) = \delta_{\hat{f}_t}

    The net input to the output gate:

    \frac{\partial E}{\partial {\rm{net}}_{o_t}} = \frac{\partial E}{\partial c_t} \frac{\partial c_t}{\partial o_t} \frac{\partial o_t}{\partial {\rm{net}}_{o_t}} = \delta_{o_t} o_t(1 - o_t) = \delta_{\hat{o}_t}

    Now we need to recall our definitions from way up top:

    W = \begin{bmatrix} w_{i1} & w_{i2} & b_i \\ w_{c1} & w_{c2} & b_c \\ w_{f1} & w_{f2} & b_f \\ w_{o1} & w_{o2} & b_o \\ w_y & 0 & b_y \\ \end{bmatrix}

    And let I_t be the total input at time t: [x_t, h_{t-1}, 1]^T.

    Then we can define z_t = W I_t, and collect all of our ‘lowest’ derivatives:

    \delta_{z_t} = [\delta_{\hat{i}_t}, \delta_{\hat{a}_t}, \delta_{\hat{f}_t}, \delta_{\hat{o}_t}]

    Then our last derivatives are:

    \frac{\partial E}{\partial I_t} = W^T \delta_{z_t}

    and

    \frac{\partial E}{\partial W_t} = \delta_{z_t} X (I_t)^T.

    And there you have it! Backpropagation through an LSTM.

    I used the typical activations here – namely, sigmoids and tanh – but this can also be done with ReLUs. I leave that to a future post.

    LSTMs in excruciating detail

    I read the blog post by Colah on Understanding LSTMs, and I mostly understood it, but I find that I learn much better when I try to explain a difficult concept to someone else.

    Here’s a more detailed walkthrough of that very informative blog post. I wrote this as an answer on StackExchange, so if you’d like to see the original question please head there. All of the images I’m using in this post come from Colah’s blog. Someday I’ll make my own images.

    Starting from the end:

    ct

    The difference between the cell state and the output can be seen from the diagram below:

    The cell state is the bold line travelling west to east across the top. The entire green block is called the ‘cell’.

    The hidden state from the previous time step is treated as part of the input at the current time step.

    However, it’s a little harder to see the dependence between the two without doing a full walkthrough. I’ll do that here, to provide another perspective, but heavily influenced by the blog. My notation will be the same, and I’ll use images from the blog in my explanation.

    I like to think of the order of operations a little differently from the way they were presented in the blog. Personally, like starting from the input gate. I’ll present that point of view below, but please keep in mind that the blog may very well be the best way to set up an LSTM computationally and this explanation is purely conceptual.

    Here’s what’s happening from the beginning:

    The input gate

    input

    The input at time t is x_t and h_{t-1}. These get concatenated and fed into a nonlinear function (in this case a sigmoid). This sigmoid function is called the ‘input gate’, because it acts as a stopgap for the input. It decides stochastically which values we’re going to update at this timestep, based on the current input.

    That is, (following your example), if we have an input vector x_t = [1, 2, 3] and a previous hidden state h_t = [4, 5, 6], then the input gate does the following:

    a) Concatenate x_t and h_{t-1} to give us [1, 2, 3, 4, 5, 6]

    b) Compute W_i times the concatenated vector and add the bias (in math: W_i \cdot [x_t, h_{t-1}] + b_i, where W_i is the weight matrix from the input vector to the nonlinearity; b_i is the input bias).

    Let’s assume we’re going from a six-dimensional input (the length of the concatenated input vector) to a three-dimensional decision on what states to update. That means we need a 3×6 weight matrix and a 3×1 bias vector. Let’s give those some values:

    W_i = \begin{bmatrix} 1 & 1 & 1 & 1 & 1 & 1 \\ 2 & 2 & 2 & 2 & 2 & 2 \\ 3 & 3 & 3 & 3 & 3 & 3\end{bmatrix}

    b_i = \begin{bmatrix} 1 \\ 1 \\ 1 \end{bmatrix}

    The computation would be:

    \begin{bmatrix} 1 & 1 & 1 & 1 & 1 & 1 \\ 2 & 2 & 2 & 2 & 2 & 2 \\ 3 & 3 & 3 & 3 & 3 & 3\end{bmatrix} \cdot \begin{bmatrix} 1 \\ 2 \\ 3 \\ 4 \\5 \\6 \end{bmatrix} + \begin{bmatrix} 1 \\ 1 \\ 1 \end{bmatrix} = \begin{bmatrix} 22 \\ 42 \\ 62 \end{bmatrix}

    c) Feed that previous computation into a nonlinearity: i_t = \sigma (W_i \cdot [x_t, h_{t-1}] + b_i)

    \sigma(x) = \frac{1}{1 + exp(-x)} (we apply this elementwise to the values in the vector x)

    \sigma(\begin{bmatrix} 22 \\ 42 \\ 62 \end{bmatrix}) = [\frac{1}{1 + exp(-22)}, \frac{1}{1 + exp(-42)}, \frac{1}{1 + exp(-62)}] = [1, 1, 1]

    In English, that means we’re going to update all of our states.

    The input gate has a second part:

    d) \tilde{C_t} = tanh(W_C[x_t, h_{t-1}] + b_C)

    The point of this part is to compute how we would update the state, if we were to do so. It’s the contribution from the new input at this time step to the cell state. The computation follows the same procedure illustrated above, but with a tanh unit instead of a sigmoid unit.

    The output \tilde{C_t} is multiplied by that binary vector i_t, but we’ll cover that when we get to the cell update.

    Together, i_t tells us which states we want to update, and \tilde{C_t} tells us how we want to update them. It tells us what new information we want to add to our representation so far.

    Then comes the forget gate, which was the crux of your question.

    The forget gate

    forget

    The purpose of the forget gate is to remove previously-learned information that is no longer relevant. The example given in the blog is language-based, but we can also think of a sliding window. If you’re modelling a time series that is naturally represented by integers, like counts of infectious individuals in an area during a disease outbreak, then perhaps once the disease has died out in an area, you no longer want to bother considering that area when thinking about how the disease will travel next.

    Just like the input layer, the forget layer takes the hidden state from the previous time step and the new input from the current time step and concatenates them. The point is to decide stochastically what to forget and what to remember. In the previous computation, I showed a sigmoid layer output of all 1’s, but in reality it was closer to 0.999 and I rounded up.

    The computation looks a lot like what we did in the input layer:

    f_t = \sigma(W_f [x_t, h_{t-1}] + b_f)

    This will give us a vector of size 3 with values between 0 and 1. Let’s pretend it gave us:

    [0.5, 0.8, 0.9]

    Then we decide stochastically based on these values which of those three parts of information to forget. One way of doing this is to generate a number from a uniform(0, 1) distribution and if that number is less than the probability of the unit ‘turning on’ (0.5, 0.8, and 0.9 for units 1, 2, and 3 respectively), then we turn that unit on. In this case, that would mean we forget that information.

    Quick note: the input layer and the forget layer are independent. If I were a betting person, I’d bet that’s a good place for parallelization.

    Updating the cell state

    cellstate

    Now we have all we need to update the cell state. We take a combination of the information from the input and the forget gates:

    C_t = f_t \circ C_{t-1} + i_t \circ \tilde{C_t}

    Now, this is going to be a little odd. Instead of multiplying like we’ve done before, here \circ indicates the Hadamard product, which is an entry-wise product.

    Aside: Hadamard product

    For example, if we had two vectors x_1 = [1, 2, 3] and x_2 = [3, 2, 1] and we wanted to take the Hadamard product, we’d do this:

    x_1 \circ x_2 = [(1 \cdot 3), (2 \cdot 2), (3 \cdot 1)] = [3, 4, 3]

    End Aside.

    In this way, we combine what we want to add to the cell state (input) with what we want to take away from the cell state (forget). The result is the new cell state.

    The output gate

    output

    This will give us the new hidden state. Essentially the point of the output gate is to decide what information we want the next part of the model to take into account when updating the subsequent cell state. The example in the blog is again, language: if the noun is plural, the verb conjugation in the next step will change. In a disease model, if the susceptibility of individuals in a particular area is different than in another area, then the probability of acquiring an infection may change.

    The output layer takes the same input again, but then considers the updated cell state:

    o_t = \sigma(W_o [x_t, h_{t-1}] + b_o)

    Again, this gives us a vector of probabilities. Then we compute:

    h_t = o_t \circ tanh(C_t)

    So the current cell state and the output gate must agree on what to output.

    That is, if the result of tanh(C_t) is [0, 1, 1] after the stochastic decision has been made as to whether each unit is on or off, and the result of o_t is [0, 0, 1], then when we take the Hadamard product, we’re going to get [0, 0, 1], and only the units that were turned on by both the output gate and in the cell state will be part of the final output.

    The diagram shows that h_t goes to two places: the next cell, and to the ‘output’ – to the screen. I think that second part is optional.

    There are a lot of variants on LSTMs, but that covers the essentials!

    Still don’t get it? Leave a comment and I’ll expand my explanation.

    [1]: https://i.stack.imgur.com/Ay51D.png
    [2]: https://i.stack.imgur.com/qQlpQ.png
    [3]: https://i.stack.imgur.com/H48IQ.png
    [4]: https://i.stack.imgur.com/kZm6J.png
    [5]: https://i.stack.imgur.com/jB3EI.png

    Autoencoders and matrix factorization

    In the paper put on arXiv earlier (and currently under review), authors Florian Strub, Jérémie Mary, and Romaric Gaudel explain the relationship between autoencoders and matrix factorization. I thought it was a neat result. Here are some more details about the connection, in the context of recommender systems.

    Suppose we have M movies and N users.

    We know that each user u_i could have rated a maximum of M movies. Therefore u_i \in \mathcal{R}^M. We know, however, that for a sufficiently large database of movies, it’s impossible for any one user to have seen them all. We’re interested in predicting which movies user u_i would rate highly, based on known ratings from movies they’ve watched.

    The classical way to approach this problem is through matrix factorization. The goal is to retrieve information that can lead us to make good recommendations of movies to users. This information is encoded in the input (the vector of ratings associated with the user), and possibly some side information. The side information could be things like the user’s current mood – not something that’s captured explicitly already in the ratings. More on that later.

    Say we have a matrix R that has users as rows and movies as columns. Each user has only seen (and rated) a few of these movies. We have N total users and M total movies, so R \in \mathcal{R}^{N {\rm{x}} M}. Using matrix factorization, we will decompose this matrix R into two separate matrices: U \in \mathcal{R}^{N {\rm{x}} k}, which encodes user information, and V \in \mathcal{R}^{M {\rm{x}} k}, which encodes movie information. These two matrices will be dense – that is, even though the original matrix of ratings R has missing information, we’re going to find two rank k matrices U and V such that \hat{R} = UV^T.

    The objective function for this technique is:

    where:

    • \mathcal{K}(R) is the set of known ratings
    • r_{ij} is the true (known) movie rating that user i gave to movie j
    • \bar{u}_i is a dense, low-rank row of U that encodes user information
    • \bar{v}_j is a dense, low-rank row of V that encodes movie information
    • \lambda is a regularization parameter
    • {\lVert \cdot \rVert}_F is the Frobenius norm.

    The matrices U and V are the optimal solutions to this objective function. This is essentially saying that we want to find a dot product between a user vector u_i and an item vector v_j that’s as close as possible to what the true rating r_{ij} is, in the training set, in terms of a squared distance. We penalize the Frobenius norm of the vectors (a form of regularization) so that we can keep everything under control, and that penalty term has a `fudge factor’ (or learning rate) \lambda. The \lambda allows you to play with how much you want to penalize the size of u_i and v_j. One weakness of this objective (I think – please feel free to correct me!) is that you have to control the sizes of these vectors together – the use of \lambda controls both norms at the same time. It might be more convenient to let more user information through, for example, and less item information, if you happen to have more user information than item information.

    Now, matrix factorization is all well and good, but we might be able to do better. Recent work has highlighted the use of autoencoders in collaborative filtering. Stacked denoising autoencoders in particular have shown success. One interesting question is: how are matrix factorization and autoencoders related?

    First, let’s quickly review denoising autoencoders. Since I’m a statistics nerd, I’ll link them to something stats-y: PCA.

    Traditionally, autoencoders are unsupervised, deterministic learning tools. They take an input x \in \mathcal{R}^{N} and perform dimensionality reduction to a hidden state h \in \mathcal{R}^{k}, where k \ll N. Then, they use this more compressed latent representation to reconstruct the input \hat{x} \in \mathcal{R}^{N}. The point is data compression: if we can store the useful/informative features of x in a smaller space, then we save memory.

    Statisticians may notice a link to Principal Components Analysis (PCA) here, and there is one: a linear autoencoder accomplishes the same objective as PCA.

    Statisticians often look at PCA as a way of summarizing the data by finding properties (or linear combinations of properties) that vary the most between observations. Another way to think of PCA is as a way to reconstruct your data, if you suddenly lost your dataset but retained the lower-dimensional information. It’s this second point of view that provides the link to autoencoders. Statisticians would say we’re maximizing the variance; machine learners would say we’re minimizing the reconstruction error. The objective is exactly the same (want more details? Look here). PCA and autoenoders have many uses, any time data need to be projected into a lower dimensional space. They’re used in image compression, for example, but today we’re going to focus on the problem of finding the most important features (principal components!) that can reconstruct the user/movie pairs of ratings. We’re also going to be talking about autoencoders with nonlinear transformations, so the connection with PCA is not preserved.

    While the technique of matrix factorization allows us to discover some latent (hidden) features that contribute to a rating, we also want to know what to do when a rating is missing. Denoising autoencoders allow us to infer missing ratings in a more dynamic manner.

    If we denote a neural network (the denoising autoencoder, in this case) by nn, we can write the objective for the collaborative filtering task per Strub et al 2016 as:

    where:

    • x is some input
    • \tilde{x} is a corrupted version of the input
    • \alpha is a regularizing parameter to indicate to the network how much to focus on denoising the input
    • \beta is a regularizing parameter to indicate to the network how much to focus on predicting missing input
    • K(x) is the set of all known inputs
    • C(\tilde{x}) is the set of all corrupted inputs
    • C(\tilde{x}) \cap K(x) is the set of all corrupted inputs for which we know the true input (the set of all deliberately corrupted inputs)
    • nn(\tilde{x}) is the output of the denoising autoencoder, when corrupted input \tilde{x} is provided

    Notice this is accomplishing a mixture of unsupervised and supervised learning: the first term of the objective takes input that we have deliberately corrupted from the set of known input, and tries to reconstruct a clean version of the input. Since we know the true value of the input, this is the supervised portion of the objective. The second term takes missing items and tries to infer their value (the unsupervised portion). The last term in the objective is a regularizer.

    However, we don’t have just one set of things to learn: we have both user and item information, which means we can use TWO denoising autoencoders – one to learn information about the users, and one for the items. This overcomes the issue I pointed out before about the regularization term applying to both the users and the items at the same time in matrix factorization, as there will be two separate autoencoders and thus two separate objectives. Unfortunately, it also means that we have double the work to do, so training may take longer.

    So how are the objective functions for matrix factorization and this modified denoising autoencoder related? We need to go a little further into the math of what’s happening with the denoising autoencoder to see what’s really going on here.

    If we provide an input u_i \in \mathcal{R}^{M} to a denoising autoencoder, then the model will transform that input to a latent space h \in \mathcal{R}^k via a nonlinear transformation: \sigma(W_1 u_i + b_1), where \sigma could be a sigmoid function, for example, and W_1 \in \mathcal{R}^{M {\rm{x}} k}, b_1 \in \mathcal{R}^{k}. Now, this latent representation preserves only the k most necessary pieces of information. We use these k pieces to reconstruct the input via another transformation. For the sake of argument, let’s suppose it’s a linear transform: W_2 h + b_2, where W_2 \in \mathcal{R}^{k {\rm{x}} M}, b_2 \in \mathcal{R}^{M}. We can write this in another way:

    So that we learn a matrix transformation from the hidden space to the output vector: f: \mathcal{R}^{k} \to \mathcal{R}^{M}. We thus learn a matrix V = W_2 I_N \in \mathcal{R}^{M {\rm{x}} k}. Now we have a reconstructed version \hat{u}_i of our input u_i, which has been denoised and filled in as necessary.

    We can do the same thing, without loss of generality, for the movies, learning a matrix U.

    The objective for the matrix factorization routine has a dot product between the user information and the item information, and is trying to find two matrices U and V to make the result as close as possible to the true rating.

    The objective for Strub et al 2016’s denoising autoencoder looks directly at the input, and constructs two separate autoencoders: one for the users and one for the movies. If you were to separate out what’s going on in the matrix factorization objective, it would look a lot like the autoencoder, especially when you consider the linear transformation that is the output of the neural network.

    So that’s the story! Autoencoders and matrix factorization are really closely linked, and provide a nice transition to deep recommender systems.

    References:

    Strub, F. et al. Hybrid Collaborative Filtering with Autoencoders. arXiv: http://arxiv.org/abs/1603.00806

    Sparse matrices: indptr

    Python is becoming an indispensable tool, but still poses challenges (especially to new programmers). If you’re like me and new to the whole scipy sparse matrix game, you might find yourself stymied by the ‘indptr’ array, which can be used to instantiate a csc_matrix or a csr_matrix object.

    The website gives an example (thanks to Python Conquers The Universe for how to post Python code):

    data = np.array([1, 4, 5, 2, 3, 6])
    indices = np.array([0, 2, 2, 0, 1, 2])
    indptr = np.array([0, 2, 3, 6])
    mtx = sparse.csc_matrix((data, indices, indptr), shape=(3, 3))
    

    It gives you a matrix that looks like this:

    mtx.todense()
    matrix([[1, 0, 2],[0, 0, 3],[4, 5, 6]])
    

    The explanation on the site is: “indptr points to column starts in indices and data"

    But what it’s doing is this:

    indices[indptr[i]:indptr[i+1]] tells you where the data in each column starts and stops.

    For example:

    indptr[0] = 0

    indptr[1] = 2

    indices[0:2] = 0, 2 (we only get indices[0] and indices[1] using this command – indices[0:2] is exclusive of 2)

    So the 0th column has nonzero entries at position 0 and at position 2

    indptr[1] = 2

    indptr[2] = 3

    indices[2:3] = 2

    So the 1st column has nonzero entries at position 2

    indptr[2] = 3

    indptr[3] = 6

    indices[3:6] = 0, 1, 2

    So the 2nd column has nonzero entries at positions 0, 1, and 2