Lecture 3: Maximum Likelihood


An aside into basic probability

Probability Density Functions

(1)

f(x;μ,σ2)=12πσ2exp((xμ)22σ2)f(x ; \mu , \sigma^2) = \frac{1}{\sqrt{2\pi \sigma^2}}\exp(-\frac{(x - \mu)^2}{2\sigma^2})

Expectation

(2)

ExP[f(x)]=P(x)f(x)δx\mathbb{E}_{x\sim P}[f(x)] = \int_{-\infin}^{\infin} P(x)f(x) \delta x

Joint Distributions and Independence

(3)

DP(x1,,xn)δx1δxn=Pr((x1,,xn)D)\int_D P(x_1,\cdots, x_n)\delta x_1 \cdots \delta x_n = Pr((x_1 ,\cdots, x_n)\in D)

(4)

Pθ(x1,,xn)=i=1nPθ(xi)P_{\theta}(x_1,\cdots, x_n) = \prod_{i=1}^n P_{\theta}(x_i)

Empirical Distribution

(5)

Prn(x):=1ni=1nδ(Xix){Pr}^n(x) := \frac{1}{n}\sum_{i=1}^{n} \delta(X_i - x)

Note: EXPrn[f(X)]=1ni=1nf(Xi)\mathbb{E}_{X\sim{Pr}^n}[f(X)] = \frac{1}{n}\sum_{i=1}^n f(X_i)

The Learning Task, TT

(6)

Pmodel(yx;θ)P_{model}(y | x ; \theta)

Likelihood function

(7)

L(θ;(x1,y1),,(xn,yn)):=i=1nPmodel(yixi;θ)\mathcal{L}(\theta; (x_1,y_1), \cdots , (x_n,y_n)) := \prod_{i=1}^n P_{model}(y_i | x_i ; \theta)

Maximum Likelihood Estimate (MLE)

(8)

ΘMLE:=arg maxθL(θ;(x1,y1),,(xn,yn))=arg maxθi=1nPmodel(yixi;θ)\Theta_{MLE} := \argmax_\theta \mathcal{L}(\theta; (x_1,y_1),\cdots, (x_n,y_n)) = \argmax_\theta \prod_{i=1}^n P_{model}(y_i | x_i; \theta)

Log-Likelihood

(9)

ΘMLE=arg maxθL(θ)=arg maxθlogL(θ)=arg maxθlogΠi=1nPmodel(yixi;θ)=arg maxθi=1nlogPmodel(yixi;θ)=arg minθ1ni=1nlogPmodel(yixi;θ)=arg minθE(X,Y)DnlogPmodel(YX;θ)\Theta_{MLE} = \argmax_\theta \mathcal{L}(\theta) \\ = \argmax_\theta \log \mathcal{L(\theta)} \\ = \argmax_\theta \log \Pi_{i=1}^n P_{model}(y_i | x_i ; \theta) \\ = \argmax_\theta \sum_{i=1}^n \log P_{model} (y_i | x_i ; \theta) \\ = \argmin_\theta \frac{1}{n}\sum_{i=1}^n -\log P_{model} (y_i | x_i ; \theta) \\ = \argmin_\theta \mathbb{E}_{(\mathcal{X},\mathcal{Y})\sim \mathcal{D}^n} - \log P_{model} (\mathcal{Y} | \mathcal{X} ; \theta)

Learning via Log-Likelihood
(10)

J(θ)=EX,YDnlogPmodel(YX;θ)J(\theta) = \mathbb{E}_{\mathcal{X},\mathcal{Y}\sim \mathcal{D}^n}- \log P_{model}(\mathcal{Y | X} ;\theta)