Lecture 10

Model Capacity

Informally, the Model Capacity is a model's capacity to fit a wide range of functions.
In statistical learning theory, model capacity is quantified by *VC-Dimension: Largest training set for which the model cna classify the labels arbitrarily into two classes
By the universal approximation theorem, neural networks can have a very high capacity see previous lecture

Underfitting & Overfitting

Underfitting: Too high a training error
Overfitting: Too large a gap between training error and test error

Model Capacity vs Error

Training and test error behave differently

The capacity of a model is optimum when the model matches data generation process

Regularisation

There are 3 model regimes

  1. Model family excludes data-generating process (underfitting)
  2. Model family matches data-generating process
  3. Model family matches data-generating process, and possibly many other models (possible overfitting)

Regularisation attempts to move a model from 2 \rightarrow 3

Data augmentation

Many datasets can be augmented via transformations
Examples include:

Regularisation Methods

Early Stopping

Here we split the data into training, validation and test data

Parameter Norm Penalties

Replace cost function with:

C~(Θ;X,y)=C(Θ;X,y)+αΩ(Θ)\widetilde{C}(\Theta; X,y) = C(\Theta;X,y) + \alpha\Omega (\Theta)

Where:

L2L^2 Parameter Regularisation

Assuming parameters are weights and biases, i.e. Θ=(w,b)\Theta= (w,b)
Then we can define:

Ω(Θ):=12w22\Omega(\Theta) := \frac{1}{2}\|w\|_2^2

This allows us to penalise large weights

Ensemble Methods

Combining different models often reduces generalisation error.

Idea

Train kk Neural Networks on kk subsets of the training data. Output the average (or modal) value of the networks

Disadvantages
Idea 2: Dropout

In each mini-batch, deactivate some randomly selected activation units (not in the output layer)

Each selection of units corresponds to a sub-network.
With nn inputs and hidden layer activation units, there are 2n2^n sub-networks.

The sub-networks share the weights.

No dropout during testing, i.e. implicit average output from all sub-networks

Implementing Dropout

Replace each activation unit ajl=ϕ(zjl)a_j^l = \phi(z_j^l) in a hidden layer with a dropout activation unit.

a~jl=11pdjlϕ(zjl)\widetilde{a}_j^l = \frac{1}{1-p}\cdot d_j^l \cdot \phi(z_j^l)

Where:

The expected value of the random variable a~jl\widetilde{a}_j^l is:

E[a~jl]=p11p0ϕ(zjl)+(1p)11p1ϕ(zjl)=ϕ(zjl)=ajl\mathcal{\mathbb{E}}[\widetilde{a}_j^l] = p\cdot \frac{1}{1-p}\cdot 0 \cdot \phi(z_j^l) + \\ (1-p) \cdot \frac{1}{1-p} \cdot 1 \cdot \phi(z_j^l) \\ = \phi(z_j^l) = a_j^l

Therefore, choosing the factor 11p\frac{1}{1-p} makes the expected activation identical to the activation without dropout

Backpropogation with Dropout

Given the local gradient δjl\delta_j^l, partial derivatives are:

δCδwjkk=δCδzjlδzjlδwjkl=δjla~kl1\frac{\delta C}{\delta w_{jk}^k} = \frac{\delta C}{\delta z_j^l} \cdot \frac{\delta z_j^l}{\delta w_{jk}^l } \\ = \delta_j^l \cdot \widetilde{a}_k^{l-1}

δCδbjl=δCδzjlδzjlδbjl=δjl\frac{\delta C}{\delta b_j^l} = \frac{\delta C}{\delta z_j^l} \cdot \frac{\delta z_j^l}{\delta b_j^l} \\ = \delta_j^l

Local Gradient for Hidden Layers

a~jl=11pdjlϕ(zjl)\widetilde{a}_j^l = \frac{1}{1-p}\cdot d_j^l \cdot \phi(z_j^l)

Derivation:

δjl=δCδzjl              By definition=δCδa~jlδa~jlδzjl              Chain Rule=(k=1mδCδzkl+1δzjl+1δa~jl)(11p)djlϕ(zjl)              Chain Rule=(k=1mδkl+1wkjl+1)(11p)djlϕ(zjl)              By definition of δjl+1\delta_j^l = \frac{\delta C}{\delta z_j^l} \;\;\;\;\;\;\; \texttt{By definition}\\ = \frac{\delta C}{\delta \widetilde{a}_j^l} \cdot \frac{\delta \widetilde{a}_j^l}{\delta z_j^l} \;\;\;\;\;\;\; \texttt{Chain Rule}\\ = \left( \sum_{k=1}^{m}\frac{\delta C}{\delta z_k^{l+1}\cdot\frac{\delta z_j^{l+1}}{\delta \widetilde{a}_j^l }} \right) \cdot \left( \frac{1}{1-p}\right) \cdot d_j^l \cdot \phi '(z_j^l) \;\;\;\;\;\;\; \texttt{Chain Rule} \\ = \left( \sum_{k=1}^{m}\delta_k^{l+1}\cdot w_{kj}^{l+1} \right) \cdot \left( \frac{1}{1-p} \right) \cdot d_j^l \cdot \phi '(z_j^l) \;\;\;\;\;\;\; \texttt{By definition of $\delta_j^{l+1}$}


Or in Matrix Form:

δl=(11p)((wl+1)Tδl+1)dlϕ(zl)\delta^l = \left( \frac{1}{1-p} \right)\left( (w^{l+1})^T \delta^{l+1} \right) \odot d^l \odot \phi ' (z^l)

Backpropogation Algorithm with Dropouts

 input: A training example (x,y)Rm×Rm \texttt{ \underline{input}: A training example $(x,y) \in \R^m \times \R^{m'}$ }

  1. Set the activation in the input layer:

dj1Bernoulli(p),for j=1,,ma~1=11pd1xd_j^1 \sim \text{Bernoulli}(p), \text{for } j=1,\ldots,m \\ \widetilde{a}^1 = \frac{1}{1-p}\cdot d^1\odot x

  1. For each l=2,(L1)l=2,\ldots (L-1) feed forward

dklBernoulli(p),for j=1,,mzl=wla~l1+bla~l=11pdlϕ(zl)d_k^l \sim \text{Bernoulli}(p), \text{for } j=1,\ldots, m \\ z^l = w^l\widetilde{a}^{l-1}+b^l \\ \widetilde{a}^l = \frac{1}{1-p}d^l\odot \phi(z^l)

  1. Set activations in output layer, no dropout

zL=wLa~l1+blaL=ϕ(zL)z^L = w^L\widetilde{a}^{l-1} + b^l \\ a^L = \phi(z^L)

  1. Compute local gradient for output layer:

δL:=aCϕ(zL)\delta^L := \nabla_a C\odot \phi '(z^L)

  1. Backpropogate the local gradients for hidden layers, i.e. for e ach l=L12l=L-1 \ldots 2 :

δl:=11p((w(l+1)Tδl+1))dlϕ(zl)\delta^l := \frac{1}{1-p}\left( (w^(l+1)^T\delta^{l+1}) \right)\odot d^l \odot \phi ' (z^l)

  1. return the partial derivatives

δCδwjkl=δjla~kl1δCδbjl=δjl\frac{\delta C}{\delta w_{jk}^l} = \delta_j^l\widetilde{a}_k^{l-1} \\ \frac{\delta C}{\delta b_j^l} = \delta_j^l