Lecture 8

Repetition: Gradient Descent

In gradient descent we have an input of a cost function, J\mathcal{J} and a learning rate, ε\varepsilon

J:RmR\mathcal{J} : \R^m \rightarrow \R

The pseudo-code for conventional gradient decent is as follows:

foo

xsome initial point in Rmwhile termination condition not met{xxεJ(x)}x\leftarrow \text{some initial point in } \R^m \\ \text{while termination condition not met} \\ \{ \\ x \leftarrow x - \varepsilon \cdot \nabla \mathcal{J}(x) \\ \}

Impact of learning rate ε\varepsilon on Stochastic Gradient Descent

Gradient Descent with Momentum

Choose initial parameter, Θv=0while termination condition is not met{v=αvEΘC(Θ)Θ=Θ+v}\text{Choose initial parameter, } \Theta \\ v = 0 \\ \text{while termination condition is not met} \\ \{ \\ v = \alpha v - \mathcal{E}\nabla_{\Theta}C(\Theta) \\ \Theta = \Theta + v \\ \}

Conceptually, we can think of momentum in this context the same as in physics:

A Ball with position Θ\Theta and a velocity vv influenced by two forces, one which pushes the ball in the opposite direction of the current gradient, and a viscous drag determined by a parameter α<1\alpha < 1

The momentum smooths out the update steps. The size of the updates is proportional to how similar the previous gradients are.

Two Hyper-parameters

Nesterov Momentum

Choose an initial parameter, Θv=0While termination condition is not met{v=αvεC(Θ+αv)Θ=Θ+v}\text{Choose an initial parameter, } \Theta \\ v = 0 \\ \text{While termination condition is not met}\\ \{ \\ v = \alpha v - \varepsilon\nabla C(\Theta+ \alpha v ) \\ \Theta = \Theta + v \\ \}

Nesterov Momentum is a variant of standard momentum where the gradient is computed after the velocity is applied

Adagrad

Duchi et al. 2011

choose an initial parameter, Θr=0while termination condition is not met{g=ΘC(Θ)r=r+gg *Squared gradient* v=εδ+rgΘ=Θ+v\text{choose an initial parameter, } \Theta \\ r = 0 \\ \text{while termination condition is not met} \{ \\ g = \nabla_{\Theta}C(\Theta) \\ r = r + g \odot g \text{$\leftarrow$ *Squared gradient* } \\ v = \frac{\varepsilon}{\delta + \sqrt{r}} \odot g \\ \Theta = \Theta + v

Where δ\delta is a hyperparameter, typically around 10610^{-6}

Adagrad adapts a (possibly differing) learning rate εδ+r\frac{\varepsilon}{\delta + \sqrt{r}} for each dimension according to accumulated square gradient, rr

AdaGrad works well for problems with sparse gradients

All gradients are weighted equally by rr

Adam

Kingma and Ba, 2014

Choose an initial parameter, Θr=0,s=0,t=0while termination condition not met{t+=1g=ΘC(Θ)s=ρ1s+(1ρ1)gr=ρ2r+(1ρ2)ggsn=s1ρ1t,rn=r1ρ2tv=εsnrn+δΘ=Θ+v\text{Choose an initial parameter, } \Theta \\ r = 0, s = 0, t = 0 \\ \text{while termination condition not met} \{ \\ t += 1 \\ g = \nabla_{\Theta} C(\Theta) \\ s = \rho_1s + (1-\rho_1)g \\ r = \rho_2r + (1-\rho_2)g\odot g \\ \overset{n}{s} = \frac{s}{1-\rho_1^t} , \overset{n}{r} = \frac{r}{1-\rho_2^t} \\ v = -\varepsilon \cdot \frac{\overset{n}{s}}{\sqrt{\overset{n}{r}}+ \delta} \\ \Theta = \Theta + v

Here vv is defined as the average gradient divided by the average squared gradient

Typical hyperparameter values are as follows: