Lecture 4: Vector Representation of Documents

Vector Notation for Documents

Given we have a set of documents D={d1,d2,,dN}D = \{ d_1, d_2, \ldots, d_N \}. We can think of this as the corpus for IR

Suppose that the number of different words in the corpus is VV, our vocabulary size

Now suppose a document dDd\in D contains MM different terms: {ti(1),ti(2),,ti(M)}\{ t_{i(1)},t_{i(2)},\ldots ,t_{i(M)} \}

Finally assume term ti(m)t_{i(m)} occurs fi(m)f_{i(m)} times

The vector representation vec(d)vec(d) of dd is the VV dimensional vector:

vec(d)=d=(wt0,dwtm,dwtV,d)vec(d) = \vec{d} = \begin{pmatrix} w_{t_0,d} \\ \vdots \\ w_{t_m,d} \\ \vdots \\ w_{t_V,d} \end{pmatrix}

Where wtidw_{t_id} is the weighting of the ithi^{th} term relative to document dd

Uniqueness

Is the mapping between documents and vectors 1-to-1 ?

If two vectors are equal it means that they contain the same words, not that they are in the same order

If λ\lambda is a scalar and d1=λd2\vec { d_1 }=\lambda \vec{ d_2 } then d1d_1 and d2d_2 are comprised of the same words but d1d_1 has λ\lambda occurrences of each word in d2d_2

Document Length

Recall that the length (norm) of a vector, x=(x1,,xN)\vec{x} = (x_1,\ldots,x_N) is given by:

x=x12+x22++xN2\|\vec{x}\| = \sqrt{x_1^2 + x_2^2 + \ldots + x_N^2}

Therefore, in the case of a document vector

vec(d)=(0,,0,wi(1)d,0,,0,wi(2)d,0,,wi(M)d,0,,0)vec(d) = (0,\ldots,0,w_{i(1)d},0,\ldots,0,w_{i(2)d},0,\ldots\ldots,w_{i(M)d},0,\ldots,0)
vec(d)=wi(1)d2+wi(2)d2+wi(M)d2=d\|vec(d)\| = \sqrt{w_{i(1)d}^2 + w_{i(2)d^2 + \ldots w_{i(M)d}^2}} = \|d\|

Where d\|d\| is the length of the document dd

Document Similarity

Suppose dd is a document and qq is a query

Cosine Similarity

We define the Cosine Similarity between dd and qq by:

CSim(q,d)=cosθCSim(q,d) = cos\theta

where θ\theta is the angle between q\vec q and d\vec d

Similarly, the Cosine Similarity between documents d1d_1 and d2d_2 can be defined:

CSim(d1,d2)=cosθCSim(d_1,d_2) = cos\theta

Where θ\theta is the angle between d1\vec{d_1} and d2\vec{d_2}

cos(θ)=x1x2+y1y2uv=uvuv\cos(\theta) = \frac{x_1x_2 + y_1y_2}{\|u\|\|v\|} = \frac{u\cdot v}{\|u\|\|v\|}

Therefore, if qq is a query, dd is a document and θ\theta is th angle between q\vec q and d\vec d then:

CSim(q,d)=cos(θ)=qdqd=tqdwtqwtdqd=Sim(q,d)CSim(q,d) = \cos(\theta) = \frac{\vec{q}\cdot \vec{d}}{\|q\|\|d\|} = \frac{\sum_{t\in q\cap d}w_{tq}\cdot w_{td}}{\|q\|\|d\|} = Sim(q,d)