A while ago I wrote a post describing how to use Pylearn2 for training neural networks. By my very modest standards it became quite popular so I thought I should follow it with a more advanced example that introduces more advanced termination criteria, momentum and learning rate adjustment.
It should be noted that this post is mostly aimed at people that do not want to use Pylearn2 in the recommended way, which is to use YAML files for setting up the training configuration. If you have the choice, using YAML is much easier and documentation for that is available on Pylearn2 website.
This tutorial will show how to use some more advanced techniques available in Pylearn2 for solving the Pima Indians Diabetes problem. This problem is a binary classification problem and to run the code you will need to download the dataset which can be found here.
We create the dataset by reading it from file. Here we also have defined a split-method for splitting the dataset into two parts:
def __init__(self, X=None, y=None):
X = X
y = y
if X is None:
X = 
y = 
with open(PIMA_DATASET) as f:
for line in f:
features, label = line.rsplit(',', 1)
if int(label) == 0:
X = np.asarray(X)
X = scaler.fit_transform(X)
y = np.asarray(y)
super(Pima, self).__init__(X=X, y=y)
def split(self, prop=.8):
cutoff = int(len(self.y) * prop)
X1, X2 = self.X[:cutoff], self.X[cutoff:]
y1, y2 = self.y[:cutoff], self.y[cutoff:]
return Pima(X1, y1), Pima(X2, y2)
return itertools.izip_longest(self.X, self.y)
We create three datasets for training, validation and test, and initialise a hidden sigmoid layer with 20 neurons and a 2-neuron softmax output layer with the name “output”.
ds_train = Pima()
ds_train, ds_valid = ds_train.split(0.7)
ds_valid, ds_test = ds_valid.split(0.7)
hidden_layer = mlp.Sigmoid(layer_name='hidden', dim=20, irange=.05, init_bias=1.)
output_layer = mlp.Softmax(2, 'output', irange=.05)
In my previous post I described the simplest possible termination criteria that just stops after a given number of epochs. A more interesting criteria is to halt training after it has stopped improving. This can be done by using a monitor-based criteria that listens on a certain channel and measures how some value on that channel is changing. In this example our monitor-based criteria measures the classification error on the output layer, (the channel name for this is “<layer name>_misclass” so in our case the name is “output_misclass”), and stops after 50 epochs without any improvement.
termination_criterion = MonitorBased(channel_name='output_misclass',
(It should be noted at this point that I have not put any work on trying to find optimal values for any of the hyper parameters in this tutorial so they might be quite far from optimal.)
Momentum is used to preserve some of the networks weight values from one epoch to the following and can be useful to help avoiding getting stuck in a local minima. A momentum of m means that a fraction m of the previous weight state is added to the current. Is is common to adjust the momentum during training by starting with a lower momentum and increase it as the training leads to a (hopefully) more stable global minima.
In Pylearn2 this is easily done by defining a momentum learning rule and a momentum adjustor:
initial_momentum = .5
final_momentum = .99
start = 1
saturate = 50
momentum_adjustor = learning_rule.MomentumAdjustor(final_momentum, start,
momentum_rule = learning_rule.Momentum(initial_momentum)
We start with a momentum of 0.5 and gradually adjust it between epochs 1 and 50 so that it reaches 0.99 at epoch 50.
Learning rate adjustment
As in the previous post, we will use the Stochastic Gradient Descent algorithm for training the network. This algorithm has a learning rate-parameter that determines the size of the weight changes from one iteration to another. A smaller learning rate makes the learning slower but more precise since it takes smaller steps on each iteration. A larger value gives faster learning but can cause it to overshoot and miss the optima.
A good strategy can therefore be to start with a larger learning rate that decreases in value as the learning gets closer to the optima.
In Pylearn2 this is done by using a learning rate adjustor:
start = 1
saturate = 50
decay_factor = .1
learning_rate_adjustor = sgd.LinearDecayOverEpoch(start, saturate, decay_factor)
This adjustor starts at epoch 1 and decreases the learning rate by 10% each epoch until epoch 50 is reached.
We create the trainer like this:
trainer = sgd.SGD(learning_rate=.05, batch_size=10, monitoring_dataset=ds_valid,
Note that we have added the validation dataset as monitoring_dataset, so the monitor-based termination criteria is measured against it. If we had used the training set as monitoring_dataset it could easily lead to overfitting since we then would both train and measure performance against the same data.
We need a way to keep track of the best found model during training, (measured against the validation set). This is done by having a monitor that saves the best model in a file that we later can read to get the globally best model:
monitor_save_best = best_params.MonitorBasedSaveBest('output_misclass',
We are now ready to start training:
monitor_save_best.on_monitor(ann, ds_valid, trainer)
if not trainer.continue_learning(ann):
momentum_adjustor.on_monitor(ann, ds_valid, trainer)
learning_rate_adjustor.on_monitor(ann, ds_valid, trainer)
After each epoch we:
- Check if the current found model is the globally best and then saves it.
- Check if the termination criteria has been met, then we are finished.
- Adjust momentum and learning rate for next epoch.
When the training is done we need to load the globally best model:
ann = serial.load('/tmp/best.pkl')
And then we can finally evaluate this model on the datasets. For this we create a helper function that classifies an input vector and returns the prediction, and a function for testing a models accuracy on a given dataset:
inp = np.asarray(inp)
inp.shape = (1, ds_train.nr_inputs)
return np.argmax(ann.fprop(theano.shared(inp, name='inputs')).eval())
nr_correct = 0
for features, label in dataset:
if classify(features) == np.argmax(label):
nr_correct += 1
print '%s/%s correct' % (nr_correct, len(dataset))
Running the complete code, (which can be found here), yields the following result (at least for me):
Accuracy of train set:
Accuracy of validation set:
Accuracy of test set: