Deep learning

It is a machine learning techniques which uses neural networks with more than one layer.

Neural Network

A Neural network processes numerical representation of unstructured data, injected in an 'input layer' (called "feature vector") to generate result as part of the output layer by using a mathematical construct of network of hidden layers (See YouTube video: "Neural Network the ground up"). "Deep" learning means many hidden layers.

A classical learning example of neural network usage, is to classify images, like the hand written digits of the NIST dataset.

A neuron holds a function that returns a number between 0 and 1. For example in simple image classification, neuron may hold the grey value of a pixel of a 28x28 pixels image (784 neurons). The number is called activation. At the output layer, the number in the neuron represents the percent of one output being the expected response. Neurons are connected and each connection is weighted.

Convolutional neural networks (CNNs) allows input size to change without retraining. The output of the regression neural network is numeric, and the classification output is a class.

The value of the neuron 'j' in the next layer is computed by the classical logistic equation taking into account previous layer neurons (a) (from 1 to n (i being the index on the number of input)) and the weight of the connection (a(i) to neuron(j)):

To get the activation between 0 and 1, it uses the sigmoid function, the bias is a number to define when the neuron should be active.

Modern neural network does not use sigmoid function anymore but the Rectifier Linear unit function.

Neural networks input and output can be an image, a series of numbers that could represent text, audio, or a time series...

The simplest architecture is the perceptron, represented by the following diagram:

There are four types of neurons in a neural network:

  1. Input Neurons - We map each input neuron to one element in the feature vector.
  2. Hidden Neurons - Hidden neurons allow the neural network to be abstract and process the input into the output. Each layer receives all the output of previous layer.
  3. Output Neurons - Each output neuron calculates one part of the output.
  4. Bias Neurons - Work similar to the y-intercept of a linear equation. It introduces a 1 as input.

Neurons is also named nodes, units or summations. See the sigmoid play notebook to understand the effect of bias and weights

Training refers to the process that determines good weight values.

It is possible to use different Activation functions,(or transfer functions), such as hyperbolic tangent, sigmoid/logistic, linear activation function, Rectified Linear Unit (ReLU), Softmax (used for the output of classification neural networks), Linear (used for the output of regression neural networks (or 2-class classification)).

ReLU activation function is popular in deep learning because the gradiant descend function needs to take the derivative of the activation function. With sigmoid function, the derivative quickly saturates to zero as moves from zero, which is not the case for ReLU.

The two most used Python frameworks for deep learning are TensorFlow/Keras (Google) or PyTorch (Facebook).

Classification neural network architecture

The general architecture of a classification neural network.

Hyperparameter Classification
Input layer shape (in_features) Same as number of features
Hidden layer(s) Problem specific, minimum = 1, maximum = unlimited
Neurons per hidden layer Problem specific, generally 10 to 512
Output layer shape (out_features) for binary 1 class, for multi-class: 1 per class
Hidden layer activation Usually ReLU but can be many others
Output activation For binary: Sigmoid, for multi-class: Softmax
Loss function Binary cross entropy. For multi-class Cross entropy
Optimizer SGD (stochastic gradient descent), Adam (see torch.optim for more options)

Below is an example of very simple NN in PyTorch:

from torch import nn

model_0 = nn.Sequential(
    nn.Linear(in_features=2, out_features=5),  # layer 1
    nn.Linear(in_features=5, out_features=1)   # layer 2


Or use a subclass of pyTorch nn.Module as demonstrated in classifier.ipynb notebook, to search for the circle classes in sklearn circles dataset, or a multi classes classification in multiclass-classifier.ipynb.


Same as previous ML problems, we can use supervised ( picture and corresponding class) and unsupervised learning. For image or voice, the 'self-supervised learning' uses to generate supervisory signals for training data sets by looking at the relationships in the input data.

Transfer learning is used to get what a first neural network as learn as input to a second NN.


  1. When the training loss is way lower than the test loss, it means "overfitting" and so loosing time.
  2. When both losses are identical, time will be wasted if we try to regularize the model.
  3. To optimize deep learning we need to maximize the compute-bound processing by reducing time spent on memory transfer and other things. Bandwidth cost is by moving the data from CPU to GPU, from one node to another, or even from CUDA global memory to CUDA shared memory.

Computer Image

Address how a computer sees, images.

Convolutional Neural Network

A Neural Network to process images by assigning learnable weights and biases to various aspects/objects in the image, and be able to differentiate one from the other. It can successfully capture the spatial and temporal dependencies in an image through the application of relevant filters. Image has three matrices of values matching the size of the picture (H*W) and the RGB value. CNN reduces the size of the matrices without loosing the meaning. For that it uses the concept of Kernel, a window, shifting over the image.

A typical structure of a convolutional neural network:

Input layer -> [Convolutional layer -> activation layer -> pooling layer] -> Output layer

The layers between [] can be replicated.

Every layer in a neural network is trying to compress data from higher dimensional space to lower dimensional space. Below is an example of those method

# Convolutional layer
nn.Conv2d(in_channels=input_shape, out_channels=hidden_units, kernel_size=3, stride=1, padding=1),
nn.ReLU(),  # activation layer
# pooling layer
nn.MaxPool2d(kernel_size=2, stride=2),    
  • Conv2d is compressing the information stored in the image to a smaller dimension image
  • MaxPool2d takes the maximum value from a portion of a tensor and disregard the rest.

See this CNN explainer tool.

Simple image dataset using the Fashion NIST and the PyTorch Image Models as a collection of image models, layers, utilities, optimizers, schedulers, data-loaders / augmentations, and reference training / validation scripts

The non-linear classifier and one CNN is in

MIT - Convolutional Neural Network presentation - video

Transfer Learning

Take an existing pre-trained model, and use it on our own data to fine tune the parameters. It helps to get better results with less data, and lesser cost and time. In Computer Vision, Image Net includes million of images on which models were trained.

PyTorch has pre-trained models, Hugging Face too, PyTorch Image Models - Timm is a collection of image models, layers, utilities, optimizers, schedulers, data-loaders / augmentations, and reference training / validation scripts. Paper with code is a collection of the latest state-of-the-art machine learning papers with code implementations attached.

The custom data going into the model needs to be prepared in the same way as the original training data that went into the model:

weights = torchvision.models.EfficientNet_B0_Weights.DEFAULT
# Get the transforms used to create our pretrained weights
transformer= weights.transforms()

The transformer is used to create the data loaders:

train_dl,test_dl, classes=data_setup.create_data_loaders(

And then take an existing model. Often bigger are better but it may be linked to the type of device used and hardware capacity. efficientnet_b0 has 288,548 parameters.

efficientnet_b0 parts

efficientnet_b0 comes in three main parts:

  • features: A collection of convolutional layers and other various activation layers to learn a base representation of vision data.
  • avgpool: Takes the average of the output of the features layer(s) and turns it into a feature vector.
  • classifier: Turns the feature vector into a vector with the same dimensionality as the number of required output classes (since efficientnet_b0 is pretrained on ImageNet with 1000 classes.

The process of transfer learning usually goes: freeze some base layers of a pretrained model (typically the features section) and then adjust the output layers (also called head/classifier layers) to suit your needs.

for param in model.features.parameters():  # Freeze the features
    param.requires_grad = False

model.classifier = torch.nn.Sequential(
        torch.nn.Dropout(p=0.2, inplace=True), 

Dropout layers randomly remove connections between two neural network layers with a probability of p. This practice is meant to help regularize (prevent overfitting) a model by making sure the connections that remain learn features to compensate for the removal of the other connections.

See PyTorch transfer learning for image classification code.

Sources of information