Lecture 7: Latent Semantic Analysis

Vector Notation for Documents

Given a set of documents D={d1,d2,,dN}D= \left\{ d_1, d_2,\ldots,d_N\right\}, we can consider this a corpus for information retrieval (IR). Suppose the number distinct words in the corpus (DD) is VV, this is our vocabulary size.

If we split each document dDd \in D into MM different terms, we can write d={ti(1),ti(2),,ti(M)}d = \left\{ t_{i(1)},t_{i(2)},\ldots,t_{i(M)} \right\}. Finally, we assign a frequency to each term s.t. term ti(m)t_{i(m)} occurs fi(m)f_{i(m)} times.

With this, the vector representation, vec(d)=d\text{vec}(d) = \vec{d} of dd is the VV dimensional vector:

(0,,0,wi(1),d,0,,0,wi(2),d,0,,0,wi(M),d,0,,0)\left( 0,\ldots, 0, w_{i(1),d}, 0, \ldots, 0, w_{i(2),d}, 0, \ldots, 0, w_{i(M),d}, 0, \ldots, 0 \right)

where:

Latent Semantic Analysis (LSA)

Suppose we have a corpus with a large number of documents, for each document dd the dimension of the d\vec{d} is potentially in the thousands.

Suppose one such document contained the words 'sea' and/or 'beach'. Intuitively, often, a document contains 'sea' it will also include 'beach'.

Mathematically, if d\vec{d} has a non-zero entry in the 'sea' component/ dimension, it will often also have a non-zero entry in the 'beach' component.

Latent Semantic Classes

If it is possible for us to detect such structures as the one seen above, then we can start to discover relationships between words automatically, from data.

The above example can be described as an equivalence set of terms including 'beach' and 'sea' which is about the seaside

Finding such classes is non-trivial and involves some relatively advanced linear algebra; the following is only an outline of which.

Steps

  1. Construct the word-document matrix A\bf A
    The word-document matrix is the N×VN\times V matrix whose nthn^{th} row is the document vector of the nthn^{th} document in the corpus.

    [wt1d1wt2d1wtmd1wtVd1wt1d2tt2d2wtmd2wtVd2wt1dntt2dnwtmdnwtVdnwt1dNwt2dNwtmdNwtVdN]\begin{bmatrix} w_{t_1 d_1} & w_{t_2 d_1} & \cdots & w_{t_m d_1} & \cdots & w_{t_V d_1} \\ w_{t_1 d_2} & t_{t_2 d_2} & \cdots& w_{t_m d_2}& \cdots & w_{t_V d_2} \\ \vdots & \vdots & \vdots & \vdots &\vdots & \vdots \\ w_{t_1 d_n} & t_{t_2 d_n} & \cdots & w_{t_m d_n} & \cdots & w_{t_V d_n} \\ \vdots & \vdots & \vdots & \vdots &\vdots & \vdots \\ w_{t_1 d_N} & w_{t_2 d_N} & \cdots & w_{t_m d_N} & \cdots & w_{t_V d_N} \end{bmatrix}

  2. Decompose A\bf A using Singular value decomposition (SVD)
    This is a common function found in packages such as matlab or Julia's Linear algebra module.
    This process is similar to that of eigenvector decomposition, an eigenvector of a square matrix A\bf A is a vector ee s.t. Ae=λee\textbf{A}e = \lambda_e e where λ\lambda is a scalar.
    For certain formations of A\bf A we can write A=UDUT\bf A = UDU^T where U\bf U is an orthogonal matrix and D\bf D is diagonal.

    • The elements of D\bf D are the eigenvalues
    • The columns of U\bf U are the eigenvectors
      One can think of SVD as a more general form of eigenvector decomposition, which works for general matrices

    In SVD we re-write A\bf A as A=USVT\bf A = USV^T

    A=[u1,1u1,2w1,Nw2,1u2,2u2,NuN,1uN,2uN,N][s10000s200000sN00][v1,1v2,1vm,1vV,1v1,2v2,2vm,2vV,2v1,nv2,nvm,nvV,nv1,Vv2,Vvm,VvV,V]\textbf{A} = \begin{bmatrix} u_{1,1} & u_{1,2} & \cdots & w_{1,N} \\ w_{2,1} & u_{2,2} & \cdots & u_{2,N} \\ \vdots & \vdots & \ddots & \vdots \\ u_{N,1} & u_{N,2} & \cdots & u_{N,N} \end{bmatrix} \cdot \begin{bmatrix} s_1 & 0 & 0 & \cdots & \cdots & 0 \\ 0 & s_2 & 0 & \cdots & \cdots & 0 \\ \vdots & \vdots & \ddots & 0 & \vdots & \vdots \\ 0 & 0 & \cdots & s_N & 0 & 0 \end{bmatrix} \cdot \begin{bmatrix} v_{1,1} & v_{2,1} & \cdots & v_{m,1} & \cdots & v_{V,1} \\ v_{1,2} & v_{2,2} & \cdots & v_{m,2} & \cdots & v_{V,2} \\ \vdots &\vdots &\vdots &\vdots &\vdots &\vdots \\ v_{1,n} & v_{2,n} & \cdots & v_{m,n} & \cdots & v_{V,n} \\ \vdots &\vdots &\vdots &\vdots &\vdots &\vdots \\ v_{1,V} & v_{2,V} & \cdots & v_{m,V} & \cdots & v_{V,V} \end{bmatrix}

Interpretation

The matrices U\bf U and V\bf V are orthogonal with real numbered entries.

The columns of V\bf V are VV dimensional unit vectors orthogonal to each other. They form a new orthonormal basis for the document vector space. Each column of V\bf V is a document vector corresponding to a semantic class, or topic, in the corpus. The importance of the topic vnv_n is given by the magnitude of sns_n

As vnv_n is a document vector, its jthj^{th} value corresponds to the TF-IDF weight for the jthj^{th} term in the vocabulary for the corresponding document or topic. Using this we can work out how important to a topic the jthj^{th} term in a document vector is to a topic vnv_n. If vn,jv_{n,j} is large then it plays an important role in the topic.

While V\bf V describes topics as combinations of terms/ words, U\bf U describes topics as a combination of documents.

Topic-based representations

The columns of V\bf V , v1,,vVv_1, \ldots, v_V are an orthonormal basis for the document vector space.

If dd is a document in the corpus then dvn\vec{d}\cdot v_n is the magnitude of the component of d\vec{d} in the direction of vnv_n, i.e. the component of d\vec{d} corresponding to topic nn. Therefore, it follows that:

top(d)=[dv1dv2dvndvV]top(d) = \begin{bmatrix} \vec{d}\cdot v_1 \\ \vec{d} \cdot v_2 \\ \vdots \\ \vec{d}\cdot v_n \\ \vdots \\ \vec{d}\cdot v_V \end{bmatrix}

is a topic-based representation of dd in terms of v1,,vVv_1,\ldots,v_V

Topic-Based dimension reduction

Since the singular values sns_n indicate the importance of the topic vnv_n, we can therefore, truncate the vector top(d)top(d) when sns_n becomes small or insignificant.

top(d)[dv1dv2dvn]=V(n)Tvec(d)top(d) \approx \begin{bmatrix} \vec{d}\cdot v_1 \\ \vec{d}\cdot v_2 \\ \vdots \\ \vec{d}\cdot v_n \end{bmatrix} = \textbf{V}_{(n)}^T vec(d)

where V(n)\bf V_{(n)} is the V×nV\times n matrix comprised of the first nn columns of V\bf V
We can now say that top(d)top(d) is a reduced (n)(n) dimensional vector representation of the document dd

Topic based word representation

Suppose ww is the ithi^{th} word or term in the vocabulary.

The one-hot vector h(w)h(w) is the vector:

h(w)=[00100]h(w) = \begin{bmatrix} 0 \\ \vdots \\ 0 \\ 1 \\ 0 \\ \vdots \\ 0 \end{bmatrix}

Note: a one-hot vector is a vector where only one row has a non-zero value.

h(w)h(w) is essentially the document vector for a document consisting of just the word ww

We can use h(w)h(w) to find the vector top(w)top(w) that describes ww in terms of the topics that it contributes to:

top(w)=[h(w)v1h(w)v2h(w)vn]=V(n)Ttop(w)top(w) = \begin{bmatrix} h(w) \cdot v_1 \\ h(w) \cdot v_2 \\ \vdots \\ h(w) \cdot v_n \\ \end{bmatrix} = \textbf{V}_{(n)}^T top(w)

Where V(n)\textbf{V}_{(n)} is the V×nV\times n matrix comprised of the first nn columns of V\bf V

Intuitively, given two synonymous we would expect that they would contribute similarly to similar topics, and as such we would expect top(w1)top(w_1) and top(w2)top(w_2) to point in similar directions. top(w)top(w) is often referred to as a word embedding