Given a set of documents D={d1,d2,…,dN}, we can consider this a corpus for information retrieval (IR). Suppose the number distinct words in the corpus (D) is V, this is our vocabulary size.
If we split each document d∈D into M different terms, we can write d={ti(1),ti(2),…,ti(M)}. Finally, we assign a frequency to each term s.t. term ti(m) occurs fi(m) times.
With this, the vector representation, vec(d)=d of d is the V dimensional vector:
wi(1),d is the weighting calculated from term frequency × the inverse document frequency, i.e. =fi(1),d×IDF(i(1)) and the 1 denotes the i(1)th place term.
Latent Semantic Analysis (LSA)
Suppose we have a corpus with a large number of documents, for each document d the dimension of the d is potentially in the thousands.
Suppose one such document contained the words 'sea' and/or 'beach'. Intuitively, often, a document contains 'sea' it will also include 'beach'.
Mathematically, if d has a non-zero entry in the 'sea' component/ dimension, it will often also have a non-zero entry in the 'beach' component.
Latent Semantic Classes
If it is possible for us to detect such structures as the one seen above, then we can start to discover relationships between words automatically, from data.
The above example can be described as an equivalence set of terms including 'beach' and 'sea' which is about the seaside
Finding such classes is non-trivial and involves some relatively advanced linear algebra; the following is only an outline of which.
Steps
Construct the word-document matrix A
The word-document matrix is the N×V matrix whose nth row is the document vector of the nth document in the corpus.
Decompose A using Singular value decomposition (SVD)
This is a common function found in packages such as matlab or Julia's Linear algebra module.
This process is similar to that of eigenvector decomposition, an eigenvector of a square matrix A is a vector e s.t. Ae=λee where λ is a scalar.
For certain formations of A we can write A=UDUT where U is an orthogonal matrix and D is diagonal.
The elements of D are the eigenvalues
The columns of U are the eigenvectors
One can think of SVD as a more general form of eigenvector decomposition, which works for general matrices
The matrices U and V are orthogonal with real numbered entries.
U is N×N where N is the number of documents in the corpus.
V is V×V where V is the vocabulary size
They satisfy: UUT=I=UTU,VVT=I=VTV
The singular values s1,…,sN are positive and satisfy s1≥s2≥…≥sN
The off-diagonal values of S are all 0
The columns of V are V dimensional unit vectors orthogonal to each other. They form a new orthonormal basis for the document vector space. Each column of V is a document vector corresponding to a semantic class, or topic, in the corpus. The importance of the topic vn is given by the magnitude of sn
As vn is a document vector, its jth value corresponds to the TF-IDF weight for the jth term in the vocabulary for the corresponding document or topic. Using this we can work out how important to a topic the jth term in a document vector is to a topic vn. If vn,j is large then it plays an important role in the topic.
While V describes topics as combinations of terms/ words, U describes topics as a combination of documents.
Topic-based representations
The columns of V , v1,…,vV are an orthonormal basis for the document vector space.
If d is a document in the corpus then d⋅vn is the magnitude of the component of d in the direction of vn, i.e. the component of d corresponding to topic n. Therefore, it follows that:
is a topic-based representation of d in terms of v1,…,vV
Topic-Based dimension reduction
Since the singular values sn indicate the importance of the topic vn, we can therefore, truncate the vector top(d) when sn becomes small or insignificant.
where V(n) is the V×n matrix comprised of the first n columns of V
We can now say that top(d) is a reduced (n) dimensional vector representation of the document d
Topic based word representation
Suppose w is the ith word or term in the vocabulary.
The one-hot vector h(w) is the vector:
h(w)=⎣⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎡0⋮010⋮0⎦⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎤
Note: a one-hot vector is a vector where only one row has a non-zero value.
h(w) is essentially the document vector for a document consisting of just the word w
We can use h(w) to find the vector top(w) that describes w in terms of the topics that it contributes to:
Where V(n) is the V×n matrix comprised of the first n columns of V
Intuitively, given two synonymous we would expect that they would contribute similarly to similar topics, and as such we would expect top(w1) and top(w2) to point in similar directions. top(w) is often referred to as a word embedding