Informally, the Model Capacity is a model's capacity to fit a wide range of functions.
In statistical learning theory, model capacity is quantified by *VC-Dimension: Largest training set for which the model cna classify the labels arbitrarily into two classes
By the universal approximation theorem, neural networks can have a very high capacity see previous lecture
Underfitting & Overfitting
Underfitting: Too high a training error
Overfitting: Too large a gap between training error and test error
Model Capacity vs Error
Training and test error behave differently
Training error often decreases with capacity
Test error can increase beyond a certain capacity
The capacity of a model is optimum when the model matches data generation process
Regularisation
There are 3 model regimes
Model family excludes data-generating process (underfitting)
Model family matches data-generating process
Model family matches data-generating process, and possibly many other models (possible overfitting)
Regularisation attempts to move a model from 2 → 3
Data augmentation
Many datasets can be augmented via transformations
Examples include:
Mirroring
Translation
Scaling
Rotation
Noise
Regularisation Methods
Early Stopping
Here we split the data into training, validation and test data
Training model on the training set, evaluate with fixed intervals on the validation set
Stop training when validation error has increased
Return model parameters when validation loss was at the lowest, rather than the last parameters
Parameter Norm Penalties
Replace cost function with:
C(Θ;X,y)=C(Θ;X,y)+αΩ(Θ)
Where:
C is the original cost function
Θ is the model's parameters
X,y is the training data
Ω is a regular-iser i.e. a function which penalises complex models
α is a hyperparameter controlling the degree of regularisation
L2 Parameter Regularisation
Assuming parameters are weights and biases, i.e. Θ=(w,b)
Then we can define:
Ω(Θ):=21∥w∥22
This allows us to penalise large weights
Ensemble Methods
Combining different models often reduces generalisation error.
Idea
Train k Neural Networks on k subsets of the training data. Output the average (or modal) value of the networks
Disadvantages
Usually requires more training data
k times increase in training time (if sequentially training)
Only feasible for small values of k
Idea 2: Dropout
In each mini-batch, deactivate some randomly selected activation units (not in the output layer)
Each selection of units corresponds to a sub-network.
With n inputs and hidden layer activation units, there are 2n sub-networks.
The sub-networks share the weights.
No dropout during testing, i.e. implicit average output from all sub-networks
Implementing Dropout
Replace each activation unit ajl=ϕ(zjl) in a hidden layer with a dropout activation unit.
ajl=1−p1⋅djl⋅ϕ(zjl)
Where:
djl∼Bernoulli(1−p)
Is 0 with Probability p and 1 otherwise.
The expected value of the random variable ajl is: