6  Optimizers and Learning loops

6.1 Introduction to Optimizers

In previous chapters, we saw how to load data and trained a linear regression model using mini-batch gradient descent. In practice, we don’t need to write our own implementation of gradient descent as Pytorch provides various inbuilt optimizers algorithm. There are many different optimizers available in PyTorch, and each one has its own set of hyperparameters that can be tuned. Some of the most popular optimizers include:

  • SGD (Stochastic Gradient Descent): This is a simple optimizer that updates the model’s parameters using the gradient of the loss with respect to the parameters
  • Adam (Adaptive Moment Estimation): This optimizer is based on the concept of momentum, which can help the optimizer to converge more quickly to a good solution. Adam also includes adaptive learning rates, which means that the optimizer can automatically adjust the learning rates of different parameters based on the historical gradient information
  • RMSprop (Root Mean Square Propagation): This optimizer is similar to Adam, but it uses a different weighting for the gradient history
  • Adagrad (Adaptive Gradient Algorithm): This optimizer is designed to handle sparse data, and it adjusts the learning rate for each parameter based on the historical gradient information
  • Adadelta: This optimizer is an extension of Adagrad that seeks to reduce its aggressive, monotonically declining learning rate

6.2 Exercise: Linear Regression

Let’s look at how we can start using Pytorch’s optimizer by continuing the previous linear regression example. Notice, this time we will use four input features instead of one in our previous examples.

# Importing required functions
import torch
import numpy as np
from sklearn.datasets import make_regression
from torch.utils.data import Dataset, DataLoader
from sklearn.model_selection import train_test_split

# Generate dataset with linear property
X, y, coef = make_regression(
    n_samples=1500,
    n_features=4,  # Using four features
    n_informative=4,
    noise=0.3,
    coef=True,
    random_state=0,
    bias=2
)

print(f'Input feature size: {X.shape}')
Input feature size: (1500, 4)

Now we will create a custom Dataset class.

# Creating our custom TabularDataset
class TabularDataset(Dataset):
    def __init__(self, data, targets):
        self.data = data
        self.targets = targets

    def __len__(self):
        return self.data.shape[0]

    def __getitem__(self, idx):
        current_sample = self.data[idx]
        current_target = self.targets[idx]
        return {
            "X": torch.tensor(current_sample, dtype=torch.float),
            "y": torch.tensor(current_target, dtype=torch.float)
        }

We have modified the TabularDataset class to handle additional features. Now, the class takes two inputs: data which includes our four features, and targets which is our target variable.

# Making a train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.33)

# Creating Tabular Dataset
train_dataset = TabularDataset(X_train, y_train)
test_dataset = TabularDataset(X_test, y_test)

# Creating Dataloaders
train_dataloader = DataLoader(train_dataset, batch_size=64, shuffle=True)
test_dataloader = DataLoader(test_dataset, batch_size=64, shuffle=False)

We have divided our sample into a training set and a test set and used the TabularDataset class to create train and test objects. Finally, we created data loaders for the training set and test set using these objects.

Note

In the code, the training data is shuffled using the Dataloader while the testing data is not. This is a common practice when training a machine learning model.

class Linear:
    def __init__(self, n_in, n_out):
        self.w = torch.randn(n_in, n_out).requires_grad_(True)
        self.b = torch.randn(n_out).requires_grad_(True)
        self.params = [self.w, self.b]

    def forward(self, x):
        return x @ self.w + self.b


# Initializing model
torch.manual_seed(4)
model = Linear(X.shape[1], 1)

print(f"Shape of weights: {model.w.shape}")
print(f"Shape of bias: {model.b.shape}")
Shape of weights: torch.Size([4, 1])
Shape of bias: torch.Size([1])

We are using the same linear model as last time but this time it will take four inputs instead of one input.

optimizer = torch.optim.SGD(model.params, lr=1e-3)

Next, we will define our optimizer. We will use PyTorch’s implementation of stochastic gradient descent (SGD) by initializing torch.optim.SGD. Here we are passing the model parameters which need to get modified during the training process and a hyperparameter learning rate (lr) of 1e-3.

For more information about other available optimizers and their hyperparameters, you can refer to the PyTorch optimizer documentation at this link.

def train_one_epoch(model, data_loader, optimizer):
    for batch in iter(data_loader):
        # Taking one mini-batch
        y_pred = model.forward(batch['X']).squeeze()
        y_true = batch['y']

        # Calculation mean square error per min-batch
        loss = torch.square(y_pred - y_true).sum()

        # Computing gradients per mini-batch
        loss.backward()

        # Update model parameters and zero grad
        optimizer.step()
        optimizer.zero_grad()


def validate_one_epoch(model, data_loader, optimizer):
    loss = 0
    with torch.no_grad():
        for batch in iter(data_loader):
            y_pred = model.forward(batch['X']).squeeze()
            y_true = batch['y']
            loss += torch.square(y_pred - y_true).sum()
    return loss/len(data_loader)

For the training loop (defined in train_one_epoch), we will go through each mini-batch and do the following:

  • Use the model to make a prediction
  • Calculate the Mean Squared Error (MSE) and the gradients
  • Update the model parameters using the optimizer’s step() function
  • Reset the gradients to zero for the next mini-batch using the optimizer’s zero_grad() function”

In the validation loop (defined in validate_one_epoch), we will process each mini-batch as follows:

  • Use the trained model to make a prediction
  • Calculate the Mean Squared Error (MSE) loss and return the overall MSE at the end

Now let’s run through some epochs and train our model.

for epoch in range(10):
    # run one training loop
    train_one_epoch(model, train_dataloader, optimizer)
    # run validation loop on training to compute training loss
    train_loss = validate_one_epoch(model, train_dataloader, optimizer)
    # run validation loop on testing to compute test loss
    test_loss = validate_one_epoch(model, test_dataloader, optimizer)

    print(f"Epoch {epoch},Train MSE: {train_loss:.4f} Test MSE: {test_loss:.3f}")

print(f"Actual coefficients are: \n{np.round(coef,4)} \nTrained model weights are: \n{np.round(model.w.squeeze().detach().numpy(),4)}")
print(f"Actual Bias term is {2} \nTrained model bias term is \n{model.b.squeeze().detach().numpy().item():.4f}")
Epoch 0,Train MSE: 13657.7461 Test MSE: 16039.912
Epoch 1,Train MSE: 267.4445 Test MSE: 319.128
Epoch 2,Train MSE: 11.0232 Test MSE: 11.422
Epoch 3,Train MSE: 5.9071 Test MSE: 5.284
Epoch 4,Train MSE: 5.8251 Test MSE: 5.184
Epoch 5,Train MSE: 5.8193 Test MSE: 5.183
Epoch 6,Train MSE: 5.8243 Test MSE: 5.176
Epoch 7,Train MSE: 5.8181 Test MSE: 5.243
Epoch 8,Train MSE: 5.8192 Test MSE: 5.192
Epoch 9,Train MSE: 5.8160 Test MSE: 5.230
Actual coefficients are: 
[63.0061 44.1452 84.3648  9.3378] 
Trained model weights are: 
[63.0008 44.1527 84.3725  9.3218]
Actual Bias term is 2 
Trained model bias term is 
1.9968

As shown above, our model has fit the data well. The actual coefficients and bias used to generate the random data roughly match the weights and bias terms of our model.

6.3 Conclusion

In PyTorch, optimizers are used to update the parameters of a model during training. Optimizers adjust the parameters of the model based on the gradients of the loss function with respect to the parameters, in order to minimize the loss.

There are many different optimizers available in PyTorch, including SGD, Adam, RMSprop, and more. You can choose the optimizer that works best for your specific problem and model architecture.

6.4 References