Lecture 4

(1)

J(Θ)=EX,YDlogPmodel(YX;Θ)J(\Theta) = \mathbb{E}_{\mathcal{X,Y}\sim \mathcal{D}}-\log_{Pmodel}(\mathcal{Y | X; \Theta)}

Functions of multiple variables

Vectors

(2)

v=v1,,vnRn\vec{v} = \langle v_1, \cdots, v_n \rangle \in \R^n

Properties of vectors

(3)

vp=(i=1nvip)jp\|\vec{v}\|_p = (\sum_{i=1}^{n}|v_i|^p)^{\frac{j}{p}}

Operations on vectors

aR,uu1,,umRmvv1,,vmRm\forall a \in \R, \\ \vec{u} \in \langle u_1, \cdots , u_m \rangle \in \R^m \\ \vec{v} \in \langle v_1, \cdots, v_m \rangle \in \R^m

Theorem: Geometric interpretation of dot-product

(4)

uv=uvcosθ\vec{u}\cdot\vec{v} = \|\vec{u}\| \cdot \|\vec{v}\|\cdot\cos\theta

Partial Differentiation

(5)

δfδxi(u1,,un)=limh0f(u1,,ui+h,,un)f(u1,,un)h\frac{\delta f}{\delta x_i}(u_1,\cdots,u_n) = \lim_{h\rightarrow 0} \frac{f(u_1,\cdots,u_i+h,\cdots,u_n)-f(u_1,\cdots, u_n)}{h}

Gradient

(6)

f:=(δfδx1,,δfδxn)\nabla f := (\frac{\delta f}{\delta x_1},\cdots, \frac{\delta f}{\delta x_n})

Chain Rule (Special Case)

For 1D functions For higher dimensional functions
if y=f(u)y=f(u) and u=g(x)u = g(x) \\ then \\ δyδx=δyδuδuδx\frac{\delta y}{\delta x}= \frac{\delta y}{\delta u} \cdot \frac{\delta u}{\delta x} If y=f(u1,,um)y=f(u_1,\cdots, u_m) and ui=g(x1,,xm)u_i = g(x_1,\cdots,x_m) for i{1,,m}i \in \{1,\cdots,m\} \\ then \\ δyδxi=j=1mδyδujδujδxi\frac{\delta y}{\delta x_i} = \sum_{j=1}^m \frac{\delta y}{\delta u_j}\cdot \frac{\delta u_j}{\delta x_i}

Directional Derivative

Definition

(7)

f:RmRf : \R^m \rightarrow \R

(8)

v=v1,,vmv=1\vec{v} = \langle v_1,\cdots,v_m\rangle \mid \|\vec{v}\| = 1

(9)

vf(x):=limα0f(x+αv)f(x)α\nabla_{\vec{v}} f(\vec{x}) := \lim_{\alpha \rightarrow 0} \frac{f(\vec{x}+ \alpha \vec{v}) - f(\vec{x})}{\alpha}
=limα0f(x1+αv1,,xm+αvm)f(x1,,xm)α= \lim_{\alpha \rightarrow 0} \frac{f(x_1+ \alpha v_1 , \cdots, x_m + \alpha v_m)- f(x_1,\cdots, x_m)}{\alpha}

Computing the Directional Derivative

Theorem
(10)

vf(x)=f(x)v\nabla_{\vec{v}}f(x) = \nabla f(x)\cdot\vec{v}

Proof
(11)

h(α):=f(u1,,um)h(\alpha) := f(u_1,\cdots,u_m)

(12)

vf(x):=limα0f(x+αv)f(x)α\nabla_{\vec{v}} f(\vec{x}) := \lim_{\alpha \rightarrow 0} \frac{f(\vec{x}+ \alpha \vec{v}) - f(\vec{x})}{\alpha}

=limα0h(0+α)=h(0)α=\lim_{\alpha\rightarrow 0} \frac{h(0+\alpha)=h(0)}{\alpha}

(13)

=h(0)= h'(0)

(14)

h(α)=δhδα=i=1mδuiδα=i=1mδfδuivih'(\alpha) = \frac{\delta h}{\delta \alpha} = \sum_{i=1}^m \frac{\delta u_i}{\delta \alpha} = \sum_{i=1}^m \frac{\delta f}{\delta u_i}\cdot v_i

(15)

ui=xi+0vi=xiu_i = x_i + 0 \cdot v_i = x_i

(16)

vf(x)=h(0)=i=1mδfδxivi=f(x)v\nabla_{\vec{v}}f(\vec{x}) = h'(0) = \sum_{i=1}^m \frac{\delta f}{\delta x_i}\cdot v_i = \nabla f(x)\cdot \vec{v}

The Gradient Points Towards the Steepest Ascent

(17)

arg maxv,v=1vf(x)\argmax_{\vec{v} , \|\vec{v}\|=1} \nabla_{\vec{v}}f(\vec{x})

=arg maxv,v=1f(x)v= \argmax_{\vec{v} , \|\vec{v}\|=1} \nabla f(\vec{x}) \cdot \vec{v}

=arg maxv,v=1f(x)vcosθ= \argmax_{\vec{v} , \|\vec{v}\|=1} \|\nabla f(\vec{x})\|\|\vec{v}\|\cdot \cos\theta

=arg maxv,v=1f(x)cosθ= \argmax_{\vec{v} , \|\vec{v}\|=1} \|\nabla f(\vec{x}) \|\cdot \cos\theta

Method of Gradient Descent

Input: cost function: J:RmRJ : \R^m \rightarrow \R
           learning rate: ϵR,ϵ>0\epsilon \in \R, \epsilon > 0

x\vec{x} \leftarrow Some initial point in Rm\R^m
while termination condition not met {
             xxϵJ(x)\vec{x}\leftarrow \vec{x} - \epsilon \cdot \nabla J(\vec{x})
}