Lecture 6

Softmax

In Lecture 5 we derived the per-example cost function:

Ci=j=1m12(yj(i)ajL)2(1)C_i = \sum_{j=1}^m\frac{1}{2}(y_j^{(i)} -a_j^L)^2 \tag{1}

where ajLa_j^L is the output of the model

Using the maximum likelihood method under the assumption that the predicted output ajLa_j^L has a Gaussian (normal) distribution. This assumption is acceptable for regression problems

However, when we look at classification problems with mm discrete class labels {1,m}\{1,\cdots m\} it makes more sense to have one output unit PjP_j per class, jj where PjP_j is the probability of the example falling into class jj

These units therefore have to satisfy the following conditions:

Diagrammatically, we replace the last layer of our network with a "softmax" layer

Pj=ezjLQ(2)P_j = \frac{\mathcal{e}^{z_j^L}}{Q} \tag{2}

where:

Q=j=1mezjL(3)Q =\sum_{j=1}^m e^{z_j^L}\tag{3}

Here we use the exponential function to make certain that our probability stays positive and that the total probability doesn't exceed 1, a quick proof of this is as follows:

j=1mPj=1Qj=1meziL=QQ=1(4)\sum_{j=1}^m P_j = \frac{1}{Q}\sum_{j=1}^m e^{z_i^L} = \frac{Q}{Q} = 1 \tag{4}

Then, as we have bounded our probabilities, we can say:

Py=Pwb(yx)(5)P_y = P_{wb}(y|x) \tag{5}

i.e the probability of class, y{1,,m}y\in\{1,\cdots,m\} given input xx

Defining our loss function L\mathcal{L}

Given nn independent observations (x1,y1),,(xn,yn)(x_1,y_1),\cdots, (x_n,y_n) where yi{1,,n}y_i \in \{1,\cdots,n\} is the class corresponding to the input xix_i, the likelihood of weigth and bias parameters w,bw,b is:

L(w,b(x1,y1),,(xn,yn))=i=1nPwb(yixi)=i=1nPyi(6)\mathcal{L}(w,b | (x_1,y_1),\cdots, (x_n,y_n)) = \prod_{i=1}^n P_{wb}(y_i | x_i) \\ = \prod_{i=1}^n P_{y_i} \tag{6}

We can define a cost function using the maximum likelihood principal.

C=logL(w,b(x1,y1),,(xn,yn))=1ni=1nCi(7)C = -\log \mathcal{L}(w,b | (x_1,y_1),\cdots,(x_n,y_n)) \\ = \frac{1}{n}\sum_{i=1}^n C_i \tag{7}

where:

Ci=logPwb(yixi)=logPyi=logQzyiL(8)C_i = -\log P_{wb}(y_i | x_i) \\ = -\log P_{y_i} \\ = \log Q - z_{y_i}^L \tag{8}

To apply gradient descent to minimise the cost function, we need to compute the gradients.

δCδwjkl=i=1nδCiδwjkl(9)\frac{\delta C}{\delta w_{jk}^l} = \sum_{i=1}^n \frac{\delta C_i}{\delta w_{jk}^l} \tag{9}

δCδbjl=i=1nδCiδbjl(10)\frac{\delta C}{\delta b_{j}^l} = \sum_{i=1}^n \frac{\delta C_i}{\delta b_{j}^l} \tag{10}

To compute these using backpropogation, we need to compute the local gradient for the softmax layer:

δjL:=δCiδzj(11)\delta_j^L := \frac{\delta C_i}{\delta z_j} \tag{11}

Theorem

δjL=δCiδzjL=Pjδyij\delta_j^L = \frac{\delta C_i}{\delta z_j^L} = P_j - \delta_{y_ij}

where
δab={1if a=b0otherwise\delta_{ab} = \begin{cases} 1 & \text{if $a=b$} \\ 0 & \text{otherwise} \end{cases}

Proof

δCiδzjL=δlogP(yixi)δzj=δlogPyiδzj=δδzj(zyilogQ)=(δyijδlogQδzj)=(δyij1Qezj)=Pjδyij(12)\frac{\delta C_i}{\delta z_j^L} = -\frac{\delta\log P(y_i | x_i)}{\delta z_j} \\ = - \frac{\delta \log P_{y_i}}{\delta z_j} \\ = - \frac{\delta}{\delta z_j}\left( z_{y_i} - \log Q\right) \\ = - \left( \delta_{y_ij} - \frac{\delta \log Q}{\delta z_j}\right) \\ = - \left( \delta_{y_ij} - \frac{1}{Q}e^{z_j}\right) \\ = P_j - \delta_{y_ij} \\ \square \tag{12}

Numerical Issues with Softmax

Neural networks are usually implemented with fixed length representations of real numbers.
Obviously, these representations are finite representations of a uncountably infinite set R\R

For instance, the maximal value of the float64 data type in NumPy can represent is 1.8×10308\approx 1.8\times10^{308}
When computing the numerator in ezjLQ\frac{e^{z_j^L}}{Q} we can easily exceed this limit

Note:

For any constant,rr:

Pj=ezjLkezkL=erezjLerkezkL=ezjL+rkezkL+r(13)P_j = \frac{e^{z_j^L}}{\sum_k e^{z_k^L}} = \frac{e^r\cdot e^{z_j^L}}{e^r\cdot \sum_k e^{z_k^L}} = \frac{e^{z_j^L + r}}{\sum_k e^{z_k^L + r}} \tag{13}

To avoid too large exponents, it is common to implement the softmax function as the rightmost expansion above with the constant:

r:=maxkzkL(14)r := -\max_k z_k^L \tag{14}