
- The dataset consists of n data points
- ((x1,y1),...,(xn,yn)∈Rd×R)
- xi∈Rd is the "input" for the ith data point as a feature vector with d elements, d being the # of dimensions in the feature space, in this case 1.
- yi∈R is the "output" for the ith data point, in this case the weight of the corresponding cat heart.
- In this example, our task is: Linear Regression
- Find a "model", i.e. a function:
- f:Rd→R
- s.t. our future observations produce output "close to" the true output.
- A linear regression model has the form:
- f(x)=(∑i=1dwi⋅xi)+b
- where:
- x∈Rd is the input vector (feature)
- w∈Rd is the weight vector (parameters)
- b∈R is a bias (parameter)
- f(x)∈R is the predicted output
- In our cat example we have:
- d=1 as "body weight" is our only feature
- b=0 as from intuition we expect a cat of 0 weight to have a heart of 0 weight.
- Our model has one parameter: w
- Want a function, J(w) which quantifies the error in the predictions for a given parameter w

- The following empirical loss function, J takes into account the errors ∀n data points.
- J(w)=(1/2N)∑i=1N(yi−wxi)2
- where the summation term is squared so that:
- we ignore the sign
- we penalise large errors more
- To find the optimum weight, solve:
- δwδJ = 0
Given a continuous function:
- f:Rd→R , as our loss function
- an element x∈Rd is called:
- A global minimum of f iff:
- ∀y∈Rd,f(x)≤f(y)
- A local minimum of f iff:
- ∃ϵ>0,∀y∈Rd if ∀i∈{1,...,d},∣xi−yi∣<ϵ⟹f(x)≤f(y)

Theorem:
For any continous function, f:R→R, if x is a local optimum, f′(x)=0
Definition:
The 1st derivative of a function f:R→R is
f′(x)=limΔx→0Δxf(x+Δx)−f(x)
- (cf(x))′=cf′(x)
- (xk)′ = kxk−1, if k=0
- (f(x)+g(x))′=f′(x)+g′(x)
- (f(g(x))′=f′(g(x))g′(x)) ← chain rule
- Optimise J by solving J′(w)=0
- J(w)=2N1∑i=1N(yi−wxi)2
- J′(w)=N1∑i=1N(wxi−yi)xi
- J′(w)=0
- N1∑i=1N(wxi−yi)xi=0
- w∑i=1N(xi)2=∑i=1Nxiyi
- w=∑i=1Nxi2∑i=1Nxiyi
- This only has one solution ∴ a global minimum.
- Often difficult / impossible to solve J′(w)=0 for non-linear models with many parameters
Idea:
-
Start with an initial guess
-
While J′(w)=0:
- move slightly in the right direction
-
To make this viable we need to define:
- "what is the right direction?"
- "what is slightly?"

w←initial weight
repeat:
if J′(w)<0
w←w+ϵ
elseif J′(w)>0
w←w−ϵ
- where ϵ is the learning rate set manually. (hyper-parameter)
Issue with this attempt:
- w may oscillate in the interval [wopt−ϵ,wopt+ϵ]
- w fails to converge
w←initial weight
repeat:
if J′(w)<0
w←w−ϵ⋅J′(w)