Lecture 9

Previously, we looked at different optimisation algorithms.

Generalisation in Neural Networks

Hypothesis:

Neural Networks generalise from the training data, i.e. by learning the inherent structure in the data.

Test of Hypothesis: removing structure should reduce network performance

Zhang et al. (ICLR 2017) trained a network over the CIFAR10 dataset with the following settings:

Perceptrons

These were a very early form of artificial networks and are comparatively very weak to current approaches

f(x)=sign(wTxb)where, sign(x)={1ifx>01otherwisef(x) = \text{sign}(w^Tx - b) \\ \text{where, }\\ \text{sign}(x) = \begin{cases} 1 & \text{if} x > 0 \\ -1 & \text{otherwise}\end{cases}

Linear decision Boundary

The boundary between positive and negative output of a single layer perceptron , can be described by a linear hyper-plane

For example, in 2D we get:
f(x1,x2)=sign(w1x1+w2x2b)f(x_1,x_2) = \text{sign}(w_1x_1 + w_2x_2 - b)
w1x1+w2x2b<0    x2>bw1x1w2w_1x_1 + w_2x_2 - b < 0 \implies x_2 > \frac{b-w_1x_1}{w_2}

Minsky & Papet (1969)

A single layer perceptron cannot learn the XOR function

Lebesgue Integration

THe Lebesgue Integral

f(x)dμ(x)\int f(x) d\mu(x)

This is an alternative to the Riemann Integram which is defined for more complex functions, ff.
It is defined with respect t a measurement μ\mu which measures the size of subsets of the domain of ff

It can be defined as:

Given f:xRf: x \rightarrow \R, let
f(t)=μ({xf(x)>t})f^*(t) = \mu(\{x | f(x)>t\})
then,

f(x)dμ(x)=0f(t)d(t)\int f(x)d\mu(x) = \int_0^{\infin} f^*(t)d(t)

where the LHS is the Lebesgue Integral and the RHS is the Riemann Integral

Discriminatory Functions

Definition:

σ:RR\sigma : \R \rightarrow \R is called discriminatory iff

INσ(yTx+Θ)δμ(x)=0yRN and ΘR    μ=0\int_{I_N} \sigma (y^{T}x+\Theta)\delta \mu(x) = 0 \\ \forall y \in \R^N \text{ and } \Theta \in \R \implies \mu =0

Lemma

Any function σ:RR\sigma: \R \rightarrow \R where:

σ(x)={1for x0for x\sigma (x) = \begin{cases} 1 & \text{for } x\rightarrow \infin \\ 0 & \text{for } x \rightarrow -\infin \end{cases}

is discriminatory

For example, the Sigmoid function σ(x)=11+ex\sigma(x) = \frac{1}{1+e^{-x}} is discriminatory

Theorem (Cybenko, 1989)

Let σ\sigma be any continuous discriminatory function, then for any function fC(IN)f \in C(I_N) i.e continuous function on IN=[0,1]NI_N = [0,1]^N and any ε>0\varepsilon > 0 then there exists a finite sum of the form:

G(x)=j=1Mαjσ(wjTx+Θj)G(x) = \sum_{j=1}^{M}\alpha_j\sigma(w_j^Tx + \Theta_j)

s.t.

G(x)f(x)<ϵ,xIm|G(x) - f(x)| < \epsilon, \forall x \in I_m

Definition: Normed Linear Space

A Normed Linear Space is a vector space XX over R\R and a function .:XR\|.\|: X \rightarrow \R satisfying the following:

  1. x0,xX\| x\| \geq 0, \forall x \in X
  2. x=0,    x=0\|x\| = 0, \iff x=0
  3. αx=αx,αR&xX\|\alpha x\|= |\alpha|\cdot\|x\|, \forall \alpha \in \R \And x \in X
  4. x+yx+y,x,yX\| x+y\| \leq \|x\| + \|y\|, \forall x,y \in X

Definition Supremum Norm

f:=smp{f(x)xX}\| f\| := \text{smp}\{|f(x)| \bigg\vert x\in X \}

We can now measure the distance between two functions gg and ff by

fg\| f-g\|

Closure

Let YY be a subset of a normed vector space XX. The closure of YY, Yˉ\bar{Y}, consists of all xXx\in X s.t. for each ε>0\varepsilon>0, we can find an element yYy\in Y s.t.

yx<ε\| y-x\| < \varepsilon

For example the closure of the set of rational numbers, Q\mathbb{Q}, is the set of real numbers, R\R.

Defininition

A linear function on a real vector space, XX is a function L:XRL: X \rightarrow \R s.t.

  1. L(x+y)=L(x)+L(y),x,yXL(x+y) = L(x) + L(y), \forall x,y\in X
  2. L(αx)=αL(x),xX,αRL(\alpha x) = \alpha L(x), \forall x\in X, \alpha \in \R

Theorem

Let (X,.)(X, \|.\|) be a normed linear space, YY can be a subspace of XX and fXf\in X

if ??tt?? does not belong to the closure of YY then there exists a bounded linear functional L:XRL:X\rightarrow \R s.t.

L(x)=0 if xY,andL(f)0L(x) =0 \text{ if } x\in Y, \text{and} \\ L(f) \neq 0

Proof

Let SCC(IN)\text{SCC}(I_N) be the set of functions that can be described on the form kk.
The statement of the theorem is equivalent to the claim that the closure of SS = C(IN)C(I_N)
Assume by contradiction that the closure of SS is a strict subset, RR of C(IN)C(I_N)

It follows by the previous theorem that there must exist a linear functional L:C(IN)RL:C(I_N) \rightarrow \R s.t.

L(g)=0,gRL(h)0,hC(IN)\RL(g) = 0, \forall g\in R \\ L(h) \neq 0, \forall h \in C(I_N) \backslash R

By the Riesz representation theorem, there exists a signed measure μ0\mu \neq 0 s.t.

L(f)=INf(x)dμ(x),fC(IN)L(f) = \int_{I_N} f(x) d\mu(x), \forall f\in C(I_N)

Since σ(wjTx+Θj)R,wj,Θj\sigma(w_j^Tx+\Theta_j)\in R, \forall w_j, \Theta_j, we have:

L(σ(wjTx+Θj))=INσ(wjTx+Θj)dμ(x)=0L(\sigma(w_j^Tx+\Theta_j)) = \int_{I_N} \sigma(w_j^Tx+\Theta_j)d\mu(x) =0

However, this contradicts our assumption that σ\sigma is discriminatory.
The theorem now follows

\square

The power of depth

Cybenko's result does not tell us:

However it does tell us:

Theorem (Eldan & Shamir, 2016)

If the activation function σ\sigma satisfies some weak assumptions, then there is a function g:RNRg: \R^N \rightarrow \R and a probability measure μ\mu on RN\R^N s.t.

  1. gg is "expressible" by a 3-layer network of width O(n194)O(n^{\frac{19}{4}})
  2. Every function, ff expressed by a 2 layer netowrk of width at most cecnce^{cn} satisfies:

EXμ(f(x)g(x))2c\mathop{\mathbb{E}}_{X\sim\mu}(f(x)-g(x))^2 \geq c