Lecture 11: Deep Learning

Linear Regression - Revisit

In linear regression problems we have the following:

L(x,θ)=i=1n(wxi+b)yi2L(x,\theta) = \sum_{i=1}^{n}|(\textbf{w}x_i + \bf{b}) - y_i|^2

Where, θ={w,b}\theta = \{ \textbf{w},\textbf{b}\}

θj+1=θjθL(x,θj)\boldsymbol{\theta}^{j+1} = \boldsymbol{\theta}^j - \nabla_\theta L(x,\boldsymbol{\theta}^j)

Mulilayer Perceptrons

Networks are made up of a series of banks of artificial neurons. Each bank feeds into the following layer until the final output layer.
In our neurons we need to define a function which determines how and when the neuron fires. These are known as activation functions.

Activation Functions

Sigmoid

f(x)=1(1+ex)f(x) = \frac{1}{(1+e^{-x})}

tanh\tanh

f(x)=tanh(x)f(x) = \tanh(x)

ReLU - Rectified Linear Unit

f(x)=max(0,x)f(x) = \max(0,x)

Leaky ReLU

f(x)=max(0.01x,x)f(x) = \max(0.01x,x)

Tips in Practice

Design

A multilayer perceptron is comprised of layers or banks of neurons that feed into each other and eventually into an output layer of 1 neuron or a softmax function for probabilistic networks.

The first layer can the thought of as the input data, the following n2n-2 layers are the hidden layers which learn the optimal parameters. The final layer is the output layer producing the probability or label for a given sample

Convolutional Neural Networks

Dot Product and Convolution

The dot product of a=a1,,an\vec{a} = \lang a_1,\ldots,a_n \rang and b=b1,,bn\vec{b} = \lang b_1,\ldots, b_n\rang is defined as:

ab=i=1naibi=a1b1+a2b2++anbn\vec{a}\cdot\vec{b} = \sum_{i=1}^{n}a_ib_i = a_1b_1 + a_2b_2 + \cdots + a_nb_n

The convolution between an image xx and a kernel w\textbf{w} is given as:

G=wx              G[i,j]=u=kkv=kkw[u,v]x[iu,jv]G = \textbf{w} * x\;\;\;\;\;\;\; G[i,j] = \sum_{u=-k}^{k}\sum_{v=-k}^{k}\textbf{w}[u,v]x[i-u,j-v]

Convolution is also sometimes denoted as \circledast

Where uu and vv are indices in the kernel grid and ii and jj are indices in the image grid. kk denotes the radius of the kernel

Convolutional Neural Networks employ this convolution function to extract and learn features from input image data.

Kernels exist for edge detection, sharpening and blur. Networks often learn kernels that are a combination of many of these in order to extract complex features from the images.

Data Pre-processing

The idea behind pre-possessing our input data is to make our loss function and resulting loss less sensitive to changes in the model parameters as this makes it hard to optimise our model.

We will look at 2 different normalisation methods:

Min-Max

Min-Max=valueminmaxmin\text{Min-Max} = \frac{\text{value}-\text{min}}{\text{max}-\text{min}}

This method of normalisation guarantees all features will have the exact same scale but does not handle outliers well.

Z-Score

Z-Score=valuemeanstd\text{Z-Score} = \frac{\text{value}-\text{mean}}{\text{std}}

Z-Score=xxˉσ\text{Z-Score} = \frac{x-\bar{x}}{\sigma}

This method handles outliers well burt does not produce normalised data with the exact same scale.

PCA - Principal Component Analysis

  1. By performing principal component analysis on the original data we centre the data at 0 and rotate into the eigenbasis of the data's covariance matrix. In so doing, we de-correlate the data
  2. Each dimension is additionally scaled by eigenvalues, transforming the data's covariance matrix into the identity matrix. Geometrically, this essentially corresponds to stretching and squeezing the data into an isotropic gaussian blob