Previously, we looked at different optimisation algorithms.
Hypothesis:
Neural Networks generalise from the training data, i.e. by learning the inherent structure in the data.
Test of Hypothesis: removing structure should reduce network performance
Zhang et al. (ICLR 2017) trained a network over the CIFAR10 dataset with the following settings:
- True labels, original training set (ground)
- Random labels: all labels are replaced with random ones
- Shuffle pixels: a random permutation of the pixels for each image
- Gaussian: A gaussian distribution is used to generate random pixels for each image

- Deep neural networks easily fit random labels
- The effective capacity of neural networks is sufficient for memorising the entire dataset
- Training time increases only by a small constant factor
These were a very early form of artificial networks and are comparatively very weak to current approaches
f(x)=sign(wTx−b)where, sign(x)={1−1ifx>0otherwise
The boundary between positive and negative output of a single layer perceptron , can be described by a linear hyper-plane
For example, in 2D we get:
f(x1,x2)=sign(w1x1+w2x2−b)
w1x1+w2x2−b<0⟹x2>w2b−w1x1

A single layer perceptron cannot learn the XOR function
- Caused controversy and led to the "AI Winter"
⟹ Reduced research funding to Neural Network research
⟹ Reduced interest among researchers
THe Lebesgue Integral
∫f(x)dμ(x)
This is an alternative to the Riemann Integram which is defined for more complex functions, f.
It is defined with respect t a measurement μ which measures the size of subsets of the domain of f
It can be defined as:
Given f:x→R, let
f∗(t)=μ({x∣f(x)>t})
then,
∫f(x)dμ(x)=∫0∞f∗(t)d(t)
where the LHS is the Lebesgue Integral and the RHS is the Riemann Integral
Definition:
σ:R→R is called discriminatory iff
∫INσ(yTx+Θ)δμ(x)=0∀y∈RN and Θ∈R⟹μ=0
Lemma
Any function σ:R→R where:
σ(x)={10for x→∞for x→−∞
is discriminatory
For example, the Sigmoid function σ(x)=1+e−x1 is discriminatory
Let σ be any continuous discriminatory function, then for any function f∈C(IN) i.e continuous function on IN=[0,1]N and any ε>0 then there exists a finite sum of the form:
G(x)=j=1∑Mαjσ(wjTx+Θj)
s.t.
∣G(x)−f(x)∣<ϵ,∀x∈Im

Definition: Normed Linear Space
A Normed Linear Space is a vector space X over R and a function ∥.∥:X→R satisfying the following:
- ∥x∥≥0,∀x∈X
- ∥x∥=0,⟺x=0
- ∥αx∥=∣α∣⋅∥x∥,∀α∈R&x∈X
- ∥x+y∥≤∥x∥+∥y∥,∀x,y∈X
Definition Supremum Norm
∥f∥:=smp{∣f(x)∣∣∣∣∣∣x∈X}
We can now measure the distance between two functions g and f by
∥f−g∥
Closure
Let Y be a subset of a normed vector space X. The closure of Y, Yˉ, consists of all x∈X s.t. for each ε>0, we can find an element y∈Y s.t.
∥y−x∥<ε
For example the closure of the set of rational numbers, Q, is the set of real numbers, R.
Defininition
A linear function on a real vector space, X is a function L:X→R s.t.
- L(x+y)=L(x)+L(y),∀x,y∈X
- L(αx)=αL(x),∀x∈X,α∈R
Theorem
Let (X,∥.∥) be a normed linear space, Y can be a subspace of X and f∈X
if ??t?? does not belong to the closure of Y then there exists a bounded linear functional L:X→R s.t.
L(x)=0 if x∈Y,andL(f)=0
Proof
Let SCC(IN) be the set of functions that can be described on the form k.
The statement of the theorem is equivalent to the claim that the closure of S = C(IN)
Assume by contradiction that the closure of S is a strict subset, R of C(IN)
It follows by the previous theorem that there must exist a linear functional L:C(IN)→R s.t.
L(g)=0,∀g∈RL(h)=0,∀h∈C(IN)\R
By the Riesz representation theorem, there exists a signed measure μ=0 s.t.
L(f)=∫INf(x)dμ(x),∀f∈C(IN)
Since σ(wjTx+Θj)∈R,∀wj,Θj, we have:
L(σ(wjTx+Θj))=∫INσ(wjTx+Θj)dμ(x)=0
However, this contradicts our assumption that σ is discriminatory.
The theorem now follows
□
Cybenko's result does not tell us:
- how many units are needed in the hidden layer
- how difficult it is to tran the network to approximate the function
However it does tell us:
- This proof shows us that we can approximate any continuous function with a single layer.
If the activation function σ satisfies some weak assumptions, then there is a function g:RN→R and a probability measure μ on RN s.t.
- g is "expressible" by a 3-layer network of width O(n419)
- Every function, f expressed by a 2 layer netowrk of width at most cecn satisfies:
EX∼μ(f(x)−g(x))2≥c