Lecture 5

Computation Graphs

We will describe ML models using computation graphs where

Example

The linear regression model, z=i=1maiwi+bz = \sum_{i=1}^m a_iw_i+b could be represented as:

Feed-Forward Neural Network

Stepping it up, here is a simple Feed-forward Neural Network:

Where:

Training of Feed-Forward Neural Networks

The parameters of the network are:

General Idea

To apply gradient descent to optimise a weight ww (or bias bb) in a network,we apply the chain rule

(1)

δCδw=δCδvδvδw\frac{\delta C}{\delta w} = \frac{\delta C}{\delta v}\cdot\frac{\delta v}{\delta w}

(2)

zjl=k=1mwjklakl1+bjlz_j^l = \sum_{k=1}^m w_{jk}^l a_k^{l-1}+ b_j^l

(3)

ajl=ϕ(zjl)a_j^l = \phi(z_j^l)

(4)

δCδwjkl=δCδzjlδzjlδwjkl=δCδzjlakl1\frac{\delta C}{\delta w_{jk}^l} = \frac{\delta C}{\delta z_j^l}\cdot\frac{\delta z_j^l}{\delta w_{jk}^l} = \frac{\delta C}{\delta z_j^l}\cdot a_k^{l-1}

Where akl1a_k^{l-1} is the "activation" of unit kk in layer l1l-1

(5)

δCδbjl=δCδzjlδzjlδbjl=δCδzjl1\frac{\delta C}{\delta b_j^l} = \frac{\delta C}{\delta z_j^l}\cdot \frac{\delta z_j^l}{\delta b_j^l} = \frac{\delta C}{\delta z_j^l}\cdot 1

Hence, we can compute δCδwkj\frac{\delta C}{\delta w_{kj}} and δCδbjl\frac{\delta C}{\delta b_j^l} if we know:

(6)

δjl:=δCδzjl\delta_j^l := \frac{\delta C}{\delta z_j^l}

The vector δl\delta^l is called the local gradient for layer ll

Local Gradient for output layers

(7)

ajL=ϕ(zjL)a_j^L = \phi(z_j^L)

The local gradient for the output layer is:

(8)

δjL=δCδzjL\delta_j^L = \frac{\delta C}{\delta z_j^L}
By definition

=δCδajLδajLδzjL= \frac{\delta C}{\delta a_j^L}\cdot\frac{\delta a_j^L}{\delta z_j^L}
By the chain rule

=δCδajLϕ(zjL)= \frac{\delta C }{\delta a_j^L} \cdot \phi'(z_j^L)
Because ajL=ϕ(zjL)a_j^L = \phi(z_j^L)

The partial derivative δCδajL\frac{\delta C}{\delta a_j^L} depends on the cost function.
For example, for a regression problem in mm dimensions, one could define :

(9)

C(a1L,,amL):=12k=1m(ykakL)2C(a_1^L,\cdots,a_m^L) := \frac{1}{2}\sum_{k=1}^m (y_k - a_k^L)^2
Where:

in which case,

(10)

δCδajL=ajLyj\frac{\delta C}{\delta a_j^L}= a_j^L - y_j

Local Gradient for Hidden Layers

(10)

zkl+1=rwkrl+1arlz_k^{l+1} = \sum_r w_{kr}^{l+1} a_r^l

ajl=ϕ(zjl)a_j^l = \phi(z_j^l)

(11)

δjl=δCδzjl\delta_j^l = \frac{\delta C}{\delta z_j^l}
By definition of local gradient δjl\delta_j^l
=δCδajlδajlδzjl= \frac{\delta C}{\delta a_j^l}\cdot \frac{\delta a_j^l}{\delta z_j^l}
By chain rule
=(kδCδzkl+1δzkl+1δajl)ϕ(zjl)= \left(\sum_k \frac{\delta C}{\delta z_k^{l+1}} \cdot \frac{\delta z_k^{l+1}}{\delta a_j^l}\right)\cdot \phi'(z_j^l)
By chain rule with respect to δCδajl\frac{\delta C}{\delta a_j^l}

=ϕ(zjl)kδkl+1wkjl+1= \phi'(z_j^l)\sum_k \delta_k^{l+1}\cdot w_{kj}^{l+1}
By definition of local gradient δkl+1\delta_k^{l+1}

Summary

\forall weights, ww and biases bb

(12)

δCδwjkl=δjlakl1\frac{\delta C}{\delta w_{jk}^l} = \delta_j^l\cdot a_k^{l-1}

(13)

δCδbjl=δjl\frac{\delta C}{\delta b_j^l} = \delta_j^l

Where the local gradient δjl\delta_j^l is:

(14)

δjl={ϕ(zjL)δCδajLif l=L (output layer)ϕ(zjl)kδkl+1wkjl+1otherwise (hidden layer)\delta_j^l = \begin{cases} \phi'(z_j^L)\cdot \frac{\delta C}{\delta a_j^L} &\text{if $l=L$ (output layer)} \\\\ \phi'(z_j^l)\cdot\sum_k \delta_k^{l+1}\cdot w_{kj}^{l+1} &\text{otherwise (hidden layer)} \end{cases}

Matrix Description

The back-propagation algorithm can exploit efficient implementations of matrix operations

Note: recall that for a matrix ARm×Rn\textbf{A} \in \R^m\times\R^n, AijA_ij denotes the element in the ithi^{th} row and jthj^{th} column

For two matrices ARm×Rn\textbf{A} \in \R^m\times\R^n and BRn×Rl\textbf{B} \in \R^n\times\R^l

Operation Name
(AT)ij=Aji(\textbf{A}^T)_{ij} = A_{ji} Matrix Transpose
(AB)ij=kAikBkj(\textbf{AB})_{ij} = \sum_k \textbf{A}_{ik}\textbf{B}_{kj} Matrix Multiplication

For two vectors u,vRm\vec{u},\vec{v} \in \R^m

Operation Name
u+v=u1+v1,,um+vm\vec{u}+\vec{v} = \langle u_1+v_1,\cdots,u_m+v_m\rangle Vector addition
uv=i=1muivi\vec{u}\cdot\vec{v} = \sum_{i=1}^m u_iv_i Dot Product
uu=u1v1,,umvm\vec{u}\odot\vec{u} = \langle u_1v_1,\cdots,u_mv_m \rangle Hadamard Product
(uvT)ij=uivj(\vec{u}\vec{v}^T)_{ij} = u_iv_j Outer Product

Weighted Inputs and Activations

(15)

zl=(z1l,,zml)z^l = (z_1^l, \cdots, z_m^l)

=(k=1mw1klakl1+b1l,,k=1mwmklakl1+bml)= \left( \sum_{k=1}^m w_{1k}^la_k^{l-1} + b_1^l, \cdots, \sum_{k=1}^m w_{mk}^la_k^{l-1}+ b_m^l \right)

=wlal1+b= w^la^l-1 + b

(16)

al=(a1l,,aml)a^l = (a_1^l,\cdots, a_m^l)

=(ϕ(z1l),,ϕ(zml))= (\phi(z_1^l), \cdots, \phi(z_m^l))

=ϕ(zl)=\phi(z^l)

Local Gradients

Output Layer
(17)

δL=(δ1L,,δmL)\delta^L = (\delta_1^L,\cdots, \delta_m^L)

=(δCδa1Lϕ(z1L),,δCδamLϕ(zmL))= \left( \frac{\delta C}{\delta a_1^L}\cdot \phi'(z_1^L), \cdots, \frac{\delta C}{\delta a_m^L}\cdot\phi'(z_m^L) \right)

=aLCϕ(zL)= \nabla_{a^L}C \odot \phi'(z^L)

Hidden Layer
(18)

δl=(δ1l,,δml)\delta^l = (\delta_1^l, \cdots, \delta_m^l)

=(ϕ(z1l)kδkl+1wk1l+1,,ϕ(zml)kδkl+1wkml+1)= \left( \phi'(z_1^l)\cdot\sum_k \delta_k^{l+1}\cdot w_{k1}^{l+1} , \cdots, \phi'(z_m^l)\cdot \sum_k \delta_k^{l+1}\cdot w_{km}^{l+1}\right)

=ϕ(zl)(k(wl+1)1kTδkl+1,,k(wl+1)mkTδkl+1)= \phi'(z^l) \odot \left( \sum_k (w^{l+1})_{1k}^T\delta_k^{l+1},\cdots, \sum_k(w^{l+1})_{mk}^T\delta_k^{l+1}\right)

=ϕ(zl)(wl+1)Tδl+1= \phi'(z^l)\odot (w^{l+1})^T\delta^{l+1}

Backpropagation Algorithm

Input : A training example, (x,y)Rm×Rm(x,y)\in \R^m\times\R^{m'}

  1. Set the activation in the input layer
    a1=xa^1 = x
  2. for each l=(2L)l=(2\cdots L), feed forward
    zl=wlal1+blz^l = w^la^{l-1}+b^l
    al=ϕ(zl)a^l = \phi(z^l)
  3. Compute local gradient for output layer
    δL:=aLCϕ(zL)\delta^L := \nabla_{a^L}C \odot \phi'(z^L)
  4. Backpropagate local gradients for hidden layers, i.e
    For each l=(L12)l=(L-1 \cdots 2)
    δl:=((wl+1)Tδl+1)ϕ(zl)\hspace{10px}\delta^l := \left( (w^{l+1})^T\delta^{l+1}\right)\odot \phi'(z^l)
  5. return the partial derivatioes
    δCδwjkl=akl1δjl\frac{\delta C}{\delta w_{jk}^l} = a_k^{l-1}\delta_j^l
    δCδbjl=δjl\frac{\delta C}{\delta b_j^l} = \delta_j^l

Training Feed-Forward Neural Networks

Assume nn training samples

(x1,y1),,(xn,yn)(x_1,y_1),\cdots,(x_n,y_n)

and a cost function:

C=1mi=1nCiC = \frac{1}{m}\sum_{i=1}^n C_i

Where CiC_i is the cost on the ithi^{th} example.

For example, with Mean Squared error, we can define it as:

Ci=12(yiaL)C_i = \frac{1}{2}(y_i-a^L)

Where aLa^L is the output of the network when a1=xia^1 = x_i

Backpropagation gives us the gradient of the overall cost function as follows:

(19)

δCδwl=1mi=1nδCiδwl\frac{\delta C}{\delta w^l} = \frac{1}{m}\sum_{i=1}^n \frac{\delta C_i}{\delta w^l}

δCδbl=1mi=1nδCiδBl\frac{\delta C}{\delta b^l} = \frac{1}{m}\sum_{i=1}^n \frac{\delta C_i}{\delta B^l}

Note: these provide the average gradient per training example

We can now use gradient descent to optimise the weights, ww and biases, bb.

Mini-Batch Gradient Descent

Computing the gradients is expensive when the number of training examples, nn is large

We can approximate the gradients:

(20)

δCδwl1bi=1bδCiδwl\frac{\delta C}{\delta w^l} \approx \frac{1}{b}\sum_{i=1}^b \frac{\delta C_i}{\delta w^l}

δCδbl1bi=1bδCiδbl\frac{\delta C}{\delta b^l} \approx \frac{1}{b} \sum_{i=1}^b \frac{\delta C_i}{\delta b^l}

using a random "mini-batch" of bnb\leq n training examples

Size Name
1<b<n1<b<n Mini-batch Gradient Descent
b=1b=1 Stochastic Gradient Descent
b=nb=n Batch Gradient Descent

It is common to use mini-batch size of b(20,100)b \in (20,100)