Técnicas de IA para Biologia

3 - Activation, Loss and Optimization

André Lamúrias

Introduction

Summary

  • The vanishing gradients problem
  • ReLU to the rescue
  • Different activations: when and how
  • Loss functions
  • Optimizers
  • Overfitting and model selection
  • Regularization methods in ANN

Activation, Loss and Optimization

Vanishing gradients

Vanishing gradients

Backpropagation in Activation and Loss

  • Output neuron $n$ of layer $k$ receives input from $m$ from layer $i$ through weight $j$
  • $$\begin{array}{rcl} \Delta w_{mkn}^j&=& - \eta \frac{\delta E_{kn}^j}{\delta s_{kn}^j} \frac{\delta s_{kn}^j}{\delta net_{kn}^j}\frac{\delta net_{kn}^j}{\delta w_{mkn}} &=& \eta (t^j - s_{kn}^j) s_{kn}^j (1-s_{kn}^j) s_{im}^j = \eta \delta_{kn} s_{im}^j \end{array}$$
  • For a weight $m$ on hidden layer $i$, we must propagate the output error backwards from all neurons ahead
  • $$\Delta w_{min}^j= - \eta \left( \sum\limits_{p} \frac{\delta E_{kp}^j}{\delta s_{kp}^j} \frac{\delta s_{kp}^j}{\delta net_{kp}^j}\frac{\delta net_{kp}^j}{\delta s_{in}^j} \right) \frac{\delta s_{in}^j}{\delta net_{in}^j}\frac{\delta net_{in}^j}{\delta w_{min}} $$
  • If $\delta s$ is small (vanishing gradient) backpropagation becomes ineffective as we increase depth
  • This happens with sigmoid activation (or similar, such as TanH)

Vanishing gradients

  • Single hidden layer, sigmoid, works fine here

Vanishing gradients

  • Single hidden layer, sigmoid, doesn't work here with 8 neurons

Vanishing gradients

  • Increasing depth does not seem to help

Vanishing gradients

  • Increasing depth does not seem to help

Vanishing gradients

  • Increasing depth does not seem to help
  • Sigmoid activation saturates and gradients vanish with large coefs.

Activation, Loss and Optimization

Rectified Linear Unit

ReLU

Rectified Linear Unit (ReLU)

  • Sigmoid activation units saturate
$$y_i = \frac{1}{1+e^{-x_i}}$$

ReLU

Rectified Linear Unit (ReLU)

  • The same happens with hyperbolic tangent
$$y_i = \frac{e^x - e^{-x}}{e^x + e^{-x}}$$

ReLU

Rectified Linear Unit (ReLU)

  • Rectified linear units do not have this problem
$$y_i= \left\{ \begin{array}{ll} x_i & x_i \gt 0 \\ 0 & x_i \leq 0 \\ \end{array} \right.$$

ReLU

  • Sigmoid activation, 3 layers

ReLU

  • ReLU activation, 3 layers

ReLU

  • ReLU activation, 4 layers

ReLU

Rectified Linear Unit (ReLU)

  • Advantages of ReLU activation:
    • Fast to compute
    • Does not saturate for positive values, and gradient is always 1
  • Disadvantage:
    • ReLU units can "die" if training makes their weights very negative
    • The unit will output 0 and the gradient will become 0, so it will not "revive"
  • There are variants that try to fix this problem

ReLU

(Some) ReLU variants

  • Simple ReLU can die if coefficients are negative
$$y_i= \left\{ \begin{array}{ll} x_i & x_i \gt 0 \\ 0 & x_i \leq 0 \\ \end{array} \right.$$

ReLU

ReLU variant: Leaky ReLU

  • Leaky ReLU gradient is never 0
$$y_i= \left\{ \begin{array}{ll} x_i & x \gt 0 \\ \frac{x_i}{a_i} & x_i \leq 0 \\ \end{array} \right.$$

ReLU

ReLU variant: Leaky ReLU

  • Note: in Tensorflow
$$y_i= \left\{ \begin{array}{ll} x_i & x \gt 0 \\ a_i x_i & x_i \leq 0 \\ \end{array} \right.$$

ReLU

ReLU variant: Parametric ReLU

  • Same as leaky, but $a_i$ is also learned
$$y_i= \left\{ \begin{array}{ll} x_i & x \gt 0 \\ \frac{x_i}{a_i} & x_i \leq 0 \\ \end{array} \right.$$

ReLU

ReLU variant: Randomized Leaky ReLU

  • Similar, but $a_i \sim U(l,u)$ (average of $l,u$ in test)
$$y_i= \left\{ \begin{array}{ll} x_i & x \gt 0 \\ a_i x_i & x_i \leq 0 \\ \end{array} \right.$$

ReLU

Comparing ReLU variants


Empirical Evaluation of Rectified Activations in Convolution Network (Xu et. al. 2015)

  • Compared on 2 data sets
    • CIFAR-10: 60000 32x32 color images in 10 classes of 6000 each
    • CIFAR-100: 60000 32x32 color images in 100 classes of 600 each

CReLU

  • Concatenated ReLU combine two ReLU for $x$ and $-x$
  • $$y_i= \left\{ \begin{array}{ll} x_i & x_i \gt 0 \\ 0 & x_i \leq 0 \\ \end{array} \right. \qquad z_i= \left\{ \begin{array}{ll} 0 & x_i \gt 0 \\ -x_i & x_i \leq 0 \\ \end{array} \right.$$

Shang et. al., Understanding and Improving CNN via CReLUs, 2016

ELU

Exponential Linear Unit

  • Exponential in negative part
$$y_i= \left\{ \begin{array}{ll} x_i & x_i \gt 0 \\ a(e^{x_i}-1) & x_i \leq 0 \\ \end{array} \right.$$

Clevert et. al. Fast and Accurate Deep Network Learning by ELUs, 2015

Activation, Loss and Optimization

Activations: which, when, why?

Choosing activations

Hidden layer activations

  • Hidden layers perform nonlinear transformations
    • Without nonlinear activation functions, all layers would just amount to a single linear transformation
  • Activation functions should be fast to compute
  • Activation functions should avoid vanishing gradients
  • This is why ReLU (esp. leaky variants) are the recommended choice for hidden layers
    • Except for specific applications.
      • E.g. LSTM, Long short-term memory recurrent networks

Choosing activations

Output layer activations

  • Output layers are a different case.
    • Choice depends on what we want the model to do
  • For regression, output should generally be linear
    • We do not want bounded values and there is little need for nonlinearity in the last layer
  • For binary classification, sigmoid is a good choice
    • The output value $[0,1]$ is useful as a representation of the probability of $C_1$, like in logistic regression
  • Sigmoid is also good for multilabel classification
    • One example may fit with several labels at the same time
    • Use one sigmoid output per label

Choosing activations

Output layer activations

  • For multiclass classification, use softmax:
    • Note: multiclass means each example fits only one of several classes
    • $$\sigma:\mathbb{R}^K \rightarrow [0,1]^K \qquad \sigma(\vec{x})_j= \frac{e^{x_j}}{\sum\limits_{k=1}^K e^{x_k}}$$
  • Softmax returns a vector where $\sigma_j \in [0,1]$ and $\sum\limits_{k=1}^K \sigma_k = 1$
  • This can fit a probability of example belonging to each class $C_j$
  • Softmax is a generalization of the logistic function
    • It combines the activations of several neurons

Activation, Loss and Optimization

Loss and likelihood

Likelihood

Basic concepts

  • We have a set of labelled data
  • $$\left\{(\vec{x}^1,y^1), ..., (\vec{x}^n,y^n)\right\}$$
  • We want to approximate some function $F(X) : X \rightarrow Y$ by fitting our parameters
  • Given some training set, what are the best parameter values?

Simple example, linear regression

$$y = \theta_1 x_1 + \theta_2 x_2 + ... + \theta_{n+1}$$
  • We have a set of $(x,y)$ examples and want to fit the best line:
  • $$y = \theta_1 x + \theta_2$$

Likelihood

What to optimize?

Likelihood

What to optimize?

  • Assume $y$ is a function of $x$ plus some error: $$ y = F(x) + \epsilon $$
  • We want to approximate $F(x)$ with some $g(x,\theta)$
  • Assuming $\epsilon \sim N(0,\sigma^2)$ and $g(x,\theta) \sim F(x)$, then:
  • $$p(y|x)\sim\mathcal{N}(g(x,\theta),\sigma^2) $$
  • Given $\mathcal{X}=\{ x^t,y^t \}_{t=1}^{N}$ and knowing that $p(x,y)=p(y|x)p(x)$
  • $$p(X,Y)=\prod_{t=1}^{n}p(x^t,y^t)= \prod_{t=1}^{n}p(y^t|x^t)\times\prod_{t=1}^{n}p(x^t)$$

Likelihood

What to optimize?

  • The probability of $(X,Y)$ given $g(x,\theta)$ is the likelihood of $\theta$:
  • $$l(\theta|\mathcal{X})=\prod_{t=1}^{n}p(\vec{x}^t,y^t)= \prod_{t=1}^{n}p(y^t|x^t)\times\prod_{t=1}^{n}p(x^t)$$

Likelihood

  • The examples $(\vec{x},y)$ are randomly sampled from all possible values
  • But $\theta$ is not a random variable
  • Find the $\theta$ for which the data is most probable
    • In other words, find the $\theta$ of maximum likelihood

Likelihood

Maximum likelihood for linear regression

$$l(\theta|\mathcal{X})=\prod_{t=1}^{n}p(x^t,y^t)= \prod_{t=1}^{n}p(y^t|x^t)\times\prod_{t=1}^{n}p(x^t)$$
  • First, take the logarithm (same maximum)
  • $$L(\theta|\mathcal{X})=log\left(\prod_{t=1}^{n}p(y^t|x^t)\times\prod_{t=1}^{n}p(x^t)\right)$$
  • We ignore $p(X)$, since it's independent of $\theta$
  • $$L(\theta|\mathcal{X}) \propto log\left(\prod_{t=1}^{n}p(y^t|x^t)\right)$$
  • Replace the expression for the normal distribution:
  • $$\mathcal{L}(\theta|\mathcal{X})\propto log\prod_{t=1}^{n}\frac{1}{\sigma \sqrt {2\pi } } e^{- [y^t - g(x^t|\theta)]^2 /2\sigma^2 }$$

Likelihood

Maximum likelihood for linear regression

$$\mathcal{L}(\theta|\mathcal{X})\propto log\prod_{t=1}^{n}\frac{1}{\sigma \sqrt {2\pi } } e^{- [y^t - g(x^t|\theta)]^2 /2\sigma^2 }$$
  • Simplify:
  • $$\mathcal{L}(\theta|\mathcal{X})\propto log\prod_{t=1}^{n}e^{- [y^t - g(x^t|\theta)]^2}$$ $$\mathcal{L}(\theta|\mathcal{X})\propto -\sum_{t=1}^{n} [y^t - g(x^t|\theta)]^2$$

Likelihood

Maximum likelihood for linear regression

$$\mathcal{L}(\theta|\mathcal{X})\propto -\sum_{t=1}^{n} [y^t - g(x^t|\theta)]^2$$
  • Max(likelihood) = Min(squared error)
    • Note: the squared error is often written like this for convenience:
    • $$E(\theta|\mathcal{X})=\frac{1}{2}\sum_{t=1}^{n} [y^t - g(x^t|\theta)]^2$$

Likelihood

  • Having the Loss function, we do gradient descent

Likelihood

  • Having the Loss function, we do gradient descent

Likelihood

  • Having the Loss function, we do gradient descent

Maximum Likelihood

Finding a loss function by ML

$$\theta_{ML} = \underset{\theta}{\operatorname{arg}\,\operatorname{max}}\;P(Y|X;\theta) = \underset{\theta}{\operatorname{arg}\,\operatorname{max}}\;\sum\limits_{i=1}^{m}\operatorname{log}\,P(y^i|\vec{x}^i;\theta)$$
  • We want to maximize likelihood
  • This means minimizing cross entropy between model and data
  • Loss function depends on the model output:
    • Regression: linear output, mean squared error
    • Binary classification: class probability, sigmoid output, logistic loss
      • (Also for multilabel classification, with probability for each label)
    • N-ary classification, use softmax and the softmax cross entropy:
    • $$-\sum\limits_{c=1}^{C} y_c \operatorname{log}\frac{e^{a_c}}{\sum_{k=1}^{C} e^{a_k}}$$

Activation, Loss and Optimization

Optimizers

Optimizers

Minimizing the loss function

  • We want to minimize the loss function (e.g. cross-entropy for ML) to obtain $\theta$ from some data
  • Numerical optimization is outside the scope of this course
    • But it's useful to have some knowledge of the optimizers

Optimizers

Minimizing the loss function

  • So far we saw tf.optimizers.SGD
    • Basic gradient descent algorithm, single learning rate.
    • Stochastic gradient descent: use gradient computed at each example, selected at random
    • Mini-batch gradient descent: updates after computing the total gradient from a batch of randomly selected examples.
    • Can include momentum (and you should use momentum, in general)
  • This is just an alias for the tf.keras.optimizers.SGD class
    • We'll be using Keras explicitely from now on

Optimizers

Minimizing the loss function:

  • Different parameters may best be changed at different rates
    •  tf.keras.optimizers.Adagrad
    • Keeps sum of past (squared) gradients for all parameters
    • Divides learning rate of each parameter by this sum
    • Parameters with small gradients will have larger learning rates, and vice-versa
    • Since Adagrad sums previous gradients, learning rates will shrink
      • (good for convex problems)

Optimizers

Minimizing the loss function:

  • Some parameters may be left with too large or too small gradients
    •  tf.keras.optimizers.RMSProp
    • Keeps moving root of the mean of the squared gradients (RMS)
    • Divides gradient by this moving RMS
    • Updates will tend to be similar for all parameters.
    • Since it uses a moving average, learning rates don't shrink
      • Good for non-convex problems, and often used in recurrent neural networks
    • Most famous unpublished optimizer

Optimizers

Minimizing the loss function

  •  tf.keras.optimizers.Adam
    • Adaptive Moment Estimation (Adam)
    • Momentum and different learning rates using an exponentially decaying average over the previous gradients
  •  tf.keras.optimizers.AdamW
    • Adaptive Moment Estimation (Adam)
    • Similar to Adam but with Weight Decay, generalizes better than Adam
    • Fast to learn but may have convergence problems

How to choose?

  • There is no solid theoretical foundation for this
  • So you must choose empirically
    • Which is just a fancy way of saying try and see what works...

Learning Rate

Choosing the best learning rate

  • Optimizers can have other parameters, but all have a learning rate
  • Too high a learning rate can lead to convergence problems

Learning Rate

  • However, if learning rate is too small training can take too long
  • Try to make it as high as you can while still converging to low error
    • (you can experiment with a subset of your training set, even if overfitting)

Batch Normalization

Normalizing (standardizing) activations

  • Compute running averages and standard deviations during training
    • And standardize the inputs to each layer
  • Just like we do for the inputs to the network, do for hidden layers too
    • Makes learning easier by preventing extreme values
    • Eliminates shifts in mean and variance during training
    • Reduces the need for each layer to adapt to the changes in the previous one
  • This can be done easily in Keras
    • The mean, standard deviation and rescaling can all be part of backpropagation
      • AutoDiff takes care of the derivatives
    • So we can add batch normalization as an additional layer

Activation, Loss and Optimization

Overfitting and Validation

Overfitting and Validation

The goal of (supervised) learning is prediction

  • And we want to predict outside of what we know

Overfitting

  • The problem of adjusting too much to training data
    • and losing generalization
  • Two ways of solving this:
    • Select the right model: model selection
    • Adjust training: regularization

Overfitting and Validation

How to check for overfitting

  • We need to evaluate performance outside the training set
    • Test set: we need to keep this for final evaluation of error rate
  • We can use a validation set
  • Or we can use cross-validation

Overfitting and Validation

How to check for overfitting

  • Cross-Validation:
    • Split training set into K folds, average validations training on the k-1

Overfitting and Validation

How to check for overfitting

  • Cross-Validation:
    • Split training set into K folds, average validations training on the k-1

Overfitting and Validation

How to check for overfitting

  • Option 1: Cross-validation on training set, test
    • Good when data is scarcer
    • Better estimate of true error
    • More computationally demanding
  • Option 2: train, validation for preventing overfitting, test
    • Good when we have lots of data (which is generally the case for DL)
  • Cross-validation is widely used outside deep learning
  • With deep learning training and validation is more common
    • Deep networks take some time to train

Overfitting and Validation

Estimating the true error

  • True error: the expected error over all possible data
    • We cannot measure this, since we would need all possible data
  • Must be estimated with a test set, outside the training set
  • This cannot be the validation set if the validation set was used to optimize hyperparameters
    • We choose the combination with the smallest validation error, this makes the estimate biased.
  • Solution: reserve a test set for final estimate of true error
    • This set should not be used for any choice or optimization

Overfitting

Model Selection

  • If the model adapts too much to the data, the training error may be low but the true error high
    • Example: Auto MPG problem, 100-50-10-1 network.

Overfitting

Model Selection

  • One way of solving this problem is to use a simpler model (assuming it can fit the data)
    • Example: Auto MPG problem, 30-10-1 network.

Overfitting

Model Selection

  • If the model is too simple, then error may become high
    • (Underfitting)
    • Example: Auto MPG problem, 3-2-1 network.

Activation, Loss and Optimization

Regularization in ANN

Regularization

Penalizing parameter size

  • To reduce variance, we can force parameters to remain small by adding a penalty to the objective (cost) function:
  • $$\tilde J(\theta;X,y) = J(\theta;X,y) + \alpha \Omega(\theta)$$
  • Where $\alpha$ is the weight of the regularization
    • Note: in ANN, generally only the input weights at each neuron are penalized and not the bias weights.
  • The norm function $\Omega(\theta)$ usually takes these forms:
    • L$^2$ Regularization (ridge regression): penalize $||\theta||^2$
    • L$^1$ Regularization: penalize $\sum_i |\theta_i|$

Regularization

L$^2$ Regularization is weight decay

  • If we penalize $w^2$, the gradient becomes:
  • $$\nabla \tilde J(\theta;X,y) = \nabla J(\theta;X,y) + 2\alpha w$$
  • This means the update rule for the weight becomes
  • $$w \leftarrow w - \epsilon 2\alpha w - \epsilon \nabla J(\theta;X,y)$$
  • We decrease the magnitude of $w$ to $(1-\epsilon 2 \alpha)$ per update
  • This causes weights that do not contribute to reducing the cost function to shrink

Regularization

L$^1$ Regularization

  • If we penalize $|w|$, the gradient becomes:
  • $$\nabla \tilde J(\theta;X,y) = \nabla J(\theta;X,y) + \alpha \ sign(w)$$
  • This penalizes parameters by a constant value, leading to a sparse solution
    • Some weights will have an optimal value of 0

L$^1$ vs L$^2$ Regularization

  • L$^1$ minimizes number of non-zero weights
  • L$^2$ minimizes overall weight magnitude

Regularization

Dataset augmentation

  • More data is generally better, although not always readily available
  • But sometimes we can make more data
  • E.g. Image classification:
    • Translate images. Rotate or flip, if appropriate (not for character recognition)

Wang et al, 2019, "A survey of face data augmentation".

Regularization

Dataset augmentation by noise injection

  • Noise injection is an (implicit) form of dataset augmentation
    • Add (carefully) noise to inputs, or even to some hidden layers
  • Noise can also be applied to the weights
  • Or even the output
    • There may be errors in labelling
    • Or for label smoothing: use $\frac{\epsilon}{(k-1)}$ and $1-\epsilon$ instead of 0 and 1 for target
    • This prevents pushing softmax or sigmoid to infinity

Regularization

Early stopping

  • Use validation to stop at best point
    • Constrains weights to be closer to starting distribution

Regularization

Bagging

  • Training a set of models on different subsets of the data
  • use the average response (or majority vote)
  • Improves performance, as it reduces variance without affecting bias, and ANN can have high variance
  • However, it can be costly to train and use many deep models.

Regularization

Dropout

  • "Turns off" random input and hidden neurons in each minibatch

Regularization

Dropout

  • Dropout does model averaging implicitely
  • Turning off neurons at random trains an ensemble of many different networks
  • After training, weights are scaled by the probability of being "on"
    • (same expected activation value)
  • Keras automatically adjust for this when we use a Dropout layer

Activation, Loss and Optimization

Summary

Activation and Loss

Summary

  • The vanishing gradients problem, ReLU
  • Activations for hidden and output layers
  • Loss functions
  • Optimizers, learning rate, batch normalization
  • Model selection and Regularization

Further reading:

  • Goodfellow et.al, Deep learning, Chaps 5-7 and 11, Sects 8.4; 8.7.1
  • Tensorflow, activation functions:
    • https://www.tensorflow.org/api_guides/python/nn#Activation_Functions