Lecture 9: Self-Organising Maps

KK-means Clustering

This is an iterative algorithm for grouping data into distinct sets.

Algorithm

  1. Estimate the initial centroid values c10,,cK0c_1^0, \ldots , c_K^0
  2. Set i=0i=0
  3. n1...N,k1...K,\forall n \in 1...N , \forall k \in 1...K, calculate d(xn,cki)d(x_n,c_k^i)
  4. k1...K\forall k \in 1...K:
    1. Let Xi(k)X^i(k) be the set of xnsx_ns that are closest to ckic_k^i
    2. Define cki+1c_k^{i+1} to be the average of the data points in Xi(k)X^i(k)
      cki+1=1xi(k)xXi(k)xc_k^{i+1} = \frac{1}{|x^i(k)|} \sum_{x\in X^i(k)}x
    3. i++, return to step 3

See slides for examples

Optimality

We can now ask the question: is the set of kk centroids, C^\hat{C} created via kk-means globally optimal? I.e. does the following hold?

D(C,X)D(C^,X)D(C,X) \geq D(\hat{C},X)

No! kk-means is only guaranteed to find a local optimum with it's final result being dependent on the initial centroid guesses

Alternative to KK-means

In kk-means we calculate the distances between all data points and all centroids before any update takes place. Instead, we can update centroid locations based on seeing a single data point xnx_n

'Online' Clustering

Online clustering updates centroids with each sample using the following equation:

cknew=ckold+η(xnckold)c_k^{\text{new}} = c_k^{\text{old}} + \eta(x_n - c_k^{\text{old}})

Where η>0\eta>0 is the learning rate.

Our solution to this is to start with a big value of η\eta and to shrink it over time.

η(t)=η(0)×etτ\eta(t) = \eta(0) \times e^{\frac{-t}{\tau}}

Where τ\tau is the time scale, determining how fast η\eta will decrease.

This process is similar to that of annealing

Algorithm

  1. Choose the number of centroids, KK

  2. Randomly choose an initial codebook {c1,,cK}\{c_1,\ldots,c_K\}

  3. xnX\forall x_n \in X:

    1. Find the closest centroid ci(n)c_{i(n)}
    2. Move ci(n)c_{i(n)} closer to x_n
      ci(n)new=ci(n)old+η(t)(xnci(n)old)c_{i(n)}^{\text{new}} = c_{i(n)}^{\text{old}} + \eta(t)\left(x_n-c_{i(n)}^{\text{old}}\right)

    where η(t)>0\eta(t) >0 is a small learning rate that reduces with time as determined by the equation:
    η(t)=η(0)×etτ\eta(t) = \eta(0) \times e^{\frac{-t}{\tau}}
    where τ\tau is the timescale

Enhancements to Online Clustering

Neighbourhood Structure

Neighbourhood structuring is the process by which each point in the centroid set is connected it's nearest neighbours. The number of neighbours that it is directly connected to depends on the number of dimensions the structure is imposed for.

We can imagine the connections between these centroids in a neighbourhood as elastic bands whereby when we move one centroid closer to a data point to further reduce the distortion of the dataset we also move all other centroids by an amount proportional to how closely connected each centroid is to the one we just directly moved.

1 Dimensional Neighbourhood Structure

In 1 Dimension each centroid is directly connected to one other centroid, with ultimately each centroid being connected to all other centroids in the neighbourhood.

To define a neighbourhood structure you require a definition of distance between the members of the neighbourhood.

So for a neighbourhood centered on centroid jj with distance dd we can define the set of all other centroids where the indices of the centroids are less than or equal to dd as:

Nj(d)={ck  kjd}N_j(d) = \{c_k \big\vert \;|k-j| \leq d \}

An illustration of a 1D neighbourhood defined with a range of distances can be seen below

Note that these neighbourhoods are defined on the centroid index not the centroid coordinates!

2 Dimensional Neighbourhood Structure

Similarly we can define our neighbourhood structure in 2 dimensions with the equation:

Nij(d)={ckl[kl][ij]d}N_{ij}(d) = \left\{ c_{kl} \big\vert \lVert \begin{bmatrix} k \\ l \end{bmatrix} - \begin{bmatrix} i \\ j \end{bmatrix} \rVert \leq d \right\}

Constrained Clustering

The aim of Constrained clustering using self-organising or topographic maps is to discover NN dimensional structure of high dimensional data by clustering whilst constraining our centroids to lie in a NN dimensional 'elastic'

Where the NN dimensions are the dimensionality of the neighbourhood structure.

Recalling our equation for Online Clustering

ci(n)new=ci(n)old+η(t)(xnci(n)old)c_{i(n)}^{\text{new}} = c_{i(n)}^{\text{old}} + \eta(t)\left(x_n-c_{i(n)}^{\text{old}}\right)

We can now define our update rule for Constrained clustering using a topographic map as:

cknew=ckold+h(i(n),k)×η(t)×(xnckold)c_k^{\text{new}} = c_k^{\text{old}} + h(i(n),k) \times \eta(t) \times (x_n - c_k^{\text{old}})

Where h(i(n),k)h(i(n),k) indicates how close the kthk^{\text{th}} centroid is to the centroid ci(n)c_{i(n)} closest to xnx_n

What do we choose for h(x,y)h(x,y) ?

Any candidate function must satisfy:

  1. h(i(n),i(n)=1h(i(n),i(n) =1
  2. h(i(n),k)h(i(n),k) decreases as ckc_k becomes further away from ci(n)c_{i(n)}

One possible function is:

h(i(n),k)=exp[i(n)kσ]h(i(n),k) = \exp\left[ \frac{-\|i(n) - k \|}{\sigma} \right]

Where σ\sigma is the neighbourhood width or strength of the elastic

In choosing a value for σ\sigma we can employ Simulated Annealing whereby we initially choose a high value to allow for broad cooperation between centroids but gradually reduce the value as the algorithm runs.

One scaling factor could be:

σ(t)=σ(0)×exp[tν]\sigma(t) = \sigma(0) \times \exp\left[ \frac{-t}{\nu}\right]

where ν>0\nu > 0 is the timescale and σ(0)\sigma(0) is the initial neighbourhood weight