5  Datasets and Dataloaders

5.1 Introduction to Pytorch Dataset

For training any machine learning models we need data. Typically, this data needs to be represented in form of a PyTorch tensor in order to be fed into a model. In PyTorch, a Dataset is an abstract class that represents a dataset. It provides a way to access the data and defines the way the data should be processed. The Dataset class is an abstract class and you need to create a subclass to use it. If you are not familiar with OOPs fundamentals like abstract base class and subclass, I suggest you to read this blog.

The main use of a dataset in PyTorch is to provide a way to access the data that you want to use to train a machine learning model. By creating a subclass of the Dataset class, you can define how the data should be loaded and processed. Once you have created a Dataset subclass, you can use it to create a PyTorch DataLoader, which is an iterator that will yield batches of data from your dataset. You can then use the DataLoader to train a model in PyTorch.

Let’s look at the Dataset documentation.

from torch.utils.data import Dataset
print(Dataset.__doc__)
An abstract class representing a :class:`Dataset`.

    All datasets that represent a map from keys to data samples should subclass
    it. All subclasses should overwrite :meth:`__getitem__`, supporting fetching a
    data sample for a given key. Subclasses could also optionally overwrite
    :meth:`__len__`, which is expected to return the size of the dataset by many
    :class:`~torch.utils.data.Sampler` implementations and the default options
    of :class:`~torch.utils.data.DataLoader`.

    .. note::
      :class:`~torch.utils.data.DataLoader` by default constructs a index
      sampler that yields integral indices.  To make it work with a map-style
      dataset with non-integral indices/keys, a custom sampler must be provided.
    

As we can see above Dataset is an abstract base class that requires us to implement the __getitem__ function and optionally overwrite the __len__ method for returning the size of the dataset.

5.2 Exercise: Creating our first custom dataset class

In this exercise, we will continue from our previous linear regression example where we trained a linear regression using batch gradient descent and replace it with mini-batch gradient descent using Dataset and Dataloaders.

Let’s start by importing the required libraries and creating our linear data.

Code
## Importing required functions
import torch
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import make_regression
import seaborn as sns
import time
from IPython.display import clear_output
sns.set_style("dark")
%matplotlib inline

def plot_data(x, y, y_pred=None, label=None):
    clear_output(wait=True)
    sns.scatterplot(x = X.squeeze(), y=y)
    if y_pred is not None:
        sns.lineplot(x = X.squeeze(), y=y_pred.squeeze(), color='red')
    plt.xlabel("Input")
    plt.ylabel("Target")
    if label: 
        plt.title(label)
    plt.show()
    time.sleep(0.5)

## Generate dataset with linear property
X, y, coef = make_regression(
    n_samples=1500,
    n_features=1,
    n_informative=1,
    noise=0.3,
    coef=True,
    random_state=0,
    bias=2
)
## Converting it into a Pandas dataframe
data = pd.DataFrame({"X":X.squeeze(), "y":y})

## Visualizing the relationship b/w X and Y
plot_data(data.X, data.y, label=f"Coefficient: {coef:.2f}, Bias:{2}")

## Printing top 5 rows
print(data.head())

Fig 5.1. Visualizing our linear data

          X         y
0 -0.234216  1.901007
1 -2.030684  1.274535
2  0.651781  1.832122
3  2.014060  2.936113
4  0.829986  2.488750

Let’s create a custom Dataset class named TabularDataset by inheriting the Dataset abstract base class and implementing our __len__ and __getitem__ functions.

Note

Methods with double underscores (also known as “dunder” methods) are special methods in Python. When the len() function is called on an object, Python will automatically call the object’s __len__ method to get the length of the object. Similarly, when the object is indexed using the square bracket operator (e.g. obj[key]), Python will call the object’s __getitem__ method to retrieve the value at the specified index.

class TabularDataset(Dataset):
    def __init__(self, data):
        self.data = data.X
        self.targets = data.y

    def __len__(self): 
        return self.data.shape[0]

    def __getitem__(self, idx):
        current_sample = self.data.iloc[idx]
        current_target = self.targets[idx]
        return {
            "X": torch.tensor(current_sample, dtype=torch.float), 
            "y": torch.tensor(current_target, dtype=torch.float)
        }

The TabularDataset class has three methods: __init__, __len__, and __getitem__.

  • The __init__ method is called when the class is instantiated and takes a pandas dataframe as input.
  • The __len__ method returns the number of samples in the dataset.
  • The __getitem__ method returns a sample from the dataset at a given index idx, in the form of a dictionary with keys X and y.

We can create an object from the TabularDataset class using our regression example, and then call the __len__ and__getitem__ methods on it.

# create an object of the TabularDataset class
custom_dataset = TabularDataset(data)

# get the length of the dataset
size = len(custom_dataset)
print(f'Dataset size: {size} \n')

# get the sample at index 0
sample = custom_dataset[0]
print(f'Indexing on 0 index: \n {sample}')
Dataset size: 1500 

Indexing on 0 index: 
 {'X': tensor(-0.2342), 'y': tensor(1.9010)}

5.3 Dataloaders

While training a machine learning model, it is often more efficient to pass a group of samples, or a “mini-batch,” to the model at once, rather than processing one sample at a time. Additionally, we may want to reshuffle the data at the end of each epoch and use multiple threads to speed up the data loading process.

The PyTorch DataLoader class helps us achieve these goals by creating an iterable from our Dataset object. The DataLoader can be used to efficiently batch and shuffle the data, and it can use multiple threads to speed up the data loading process.

Let’s create a dataloader object from our TabularDataset object.

from torch.utils.data import DataLoader
custom_dataloader = DataLoader(custom_dataset, batch_size=64, shuffle=True)

Now let’s look at one minibatch:

batch = next(iter(custom_dataloader))
batch
{'X': tensor([ 1.5328,  0.4394,  1.1542, -0.6743,  0.3194, -0.5863,  0.8216, -0.9489,
         -1.5408, -1.0546,  0.9501,  0.3382, -0.0357, -0.4675,  0.7231,  0.9694,
          0.8526, -1.4466, -1.0994, -1.2141, -0.7999,  1.3750, -1.1268, -0.7923,
          0.0940, -0.1043, -0.0393,  1.2961, -0.4961,  1.0170, -0.6677, -0.7946,
          0.9364,  2.5944, -0.2201, -0.5376,  1.6581,  0.2348,  0.5766, -1.6326,
          0.0175, -0.3328, -1.7442, -1.4464,  0.1047,  0.0633, -0.5963,  0.7775,
         -0.3005, -0.7565, -0.7994, -0.9605,  0.2461, -0.7047,  0.3769,  0.5410,
         -0.6524,  1.5430,  1.0480, -0.5028,  1.3676, -0.2904,  0.2671,  1.3014]),
 'y': tensor([2.2464, 2.5879, 2.7877, 1.5178, 2.2373, 1.9258, 2.1885, 2.2265, 1.4833,
         1.4586, 2.7604, 2.4890, 1.9327, 1.5933, 2.4738, 2.4766, 2.4160, 1.3819,
         1.4487, 0.8635, 1.4181, 2.8232, 1.2373, 2.0373, 1.7182, 2.0764, 2.1702,
         2.8312, 1.7150, 2.3457, 1.9804, 1.5520, 2.5604, 3.3382, 1.9031, 1.2880,
         2.9112, 1.9802, 2.0943, 1.3462, 2.0327, 1.9207, 1.2720, 1.8974, 2.5618,
         2.4288, 2.0103, 2.5764, 1.4878, 1.6772, 1.6701, 1.5360, 2.3156, 1.7014,
         2.3102, 2.1018, 2.4023, 2.0447, 2.8422, 1.3625, 2.6827, 1.9267, 2.1790,
         2.7582])}
print(f"Input feature shape: {batch['X'].shape}")
print(f"Target  shape: {batch['y'].shape}")
Input feature shape: torch.Size([64])
Target  shape: torch.Size([64])

As we can see above, we got our first batch of 64 data samples.

5.4 Exercise: Linear regression with mini-batch gradient descent

Let’s get our model definition from the last chapter.

class Linear:
    def __init__(self, n_in, n_out):
        self.w = torch.randn(n_in, n_out).requires_grad_(True)
        self.b = torch.randn(n_out).requires_grad_(True)
        self.params = [self.w, self.b]
        
    def forward(self, x):
        return x @ self.w + self.b
    
## Initializing model
torch.manual_seed(4)
model = Linear(X.shape[1], 1)

## Making a random prediction
loss = 0
with torch.no_grad():
    for batch in iter(custom_dataloader):
        y_pred = model.forward(batch['X'].unsqueeze(-1)).numpy()
        y_true = batch['y'].numpy()
        loss += sum((y_pred.squeeze() - y_true.squeeze())**2)
print(f"MSE loss: {loss/len(custom_dataset):.4f}")
MSE loss: 7.2129

This MSE of 7.2129 is bad considering in the last chapter we were able to achieve 0.09. Let’s update the previous chapter step function to take mini-batches.

def step(custom_dataloader, model, lr = 5e-3):
    ## Iterate through mini-batch
    for batch in iter(custom_dataloader):
        ## Taking one mini-batch
        y_pred = model.forward(batch['X'].unsqueeze(-1))
        y_true = batch['y']
        
        ## Calculation mean square error per min-batch
        loss = sum((y_pred.squeeze() - y_true.squeeze())**2)
    
        ## Computing gradients per mini-batch
        loss.backward()
    
        ## Updating parameters per mini-batch
        with torch.no_grad():
            for param in model.params:
                param -= lr*param.grad.data
                param.grad.data.zero_()
                
    ## Compute loss for the epoch
    loss = 0
    with torch.no_grad():
        for batch in iter(custom_dataloader):
            y_pred = model.forward(batch['X'].unsqueeze(-1))
            y_true = batch['y']
            loss += sum((y_pred.squeeze() - y_true.squeeze())**2)
    return loss/len(custom_dataset)

Let’s run few epochs.

model = Linear(1,1)
for epoch in range(3):
    loss = step(custom_dataloader, model)
    print(f"Epoch: {epoch}, MSE: {loss:.4f}")
    
print(f"\nTrue coefficient is {coef.item():.2f} and predicted coefficient is {model.w.item():.2f}.")
print(f"True bias term is {2} and predicted coefficient is {model.b.item():.2f}.")
Epoch: 0, MSE: 0.0879
Epoch: 1, MSE: 0.0881
Epoch: 2, MSE: 0.0885

True coefficient is 0.48 and predicted coefficient is 0.47.
True bias term is 2 and predicted coefficient is 1.97.

Let’s visualize the fit.

y_pred = []
with torch.no_grad():
        for batch in iter(DataLoader(custom_dataset, batch_size=64, shuffle=False)):
            y_pred.append(model.forward(batch['X'].unsqueeze(-1)).detach().numpy())      
plot_data(X, y, y_pred=np.concatenate(y_pred))

Fig 5.2. Visualizing our fit

From the results above, it appears that the model’s performance improved with each epoch, as the mean squared error (MSE) consistently decreased. The performance of the model is now similar to the performance we observed in the last chapter.

5.5 Conclusion

In PyTorch, a Dataset is an abstract class that represents a dataset. It provides a way to access the data and defines the way the data should be processed. The Dataset class is an abstract class and you need to create a subclass to use it.

A DataLoader is an iterator that provides access to a dataset. It can be used to efficiently batch and shuffle the data, and it can use multiple threads to speed up the data loading process.

The Dataset and DataLoader classes are an important part of PyTorch’s data loading and processing functionality. They are often used together to train machine learning model, because they provide a convenient and efficient way to access and process data.

5.6 References