Given we have a set of documents D={d1,d2,…,dN}. We can think of this as the corpus for IR
Suppose that the number of different words in the corpus is V, our vocabulary size
Now suppose a document d∈D contains M different terms: {ti(1),ti(2),…,ti(M)}
Finally assume term ti(m) occurs fi(m) times
The vector representation vec(d) of d is the V dimensional vector:
vec(d)=d=⎝⎜⎜⎜⎜⎜⎜⎜⎛wt0,d⋮wtm,d⋮wtV,d⎠⎟⎟⎟⎟⎟⎟⎟⎞
Where wtid is the weighting of the ith term relative to document d
Is the mapping between documents and vectors 1-to-1 ?
If two vectors are equal it means that they contain the same words, not that they are in the same order
If λ is a scalar and d1=λd2 then d1 and d2 are comprised of the same words but d1 has λ occurrences of each word in d2
Recall that the length (norm) of a vector, x=(x1,…,xN) is given by:
∥x∥=x12+x22+…+xN2
Therefore, in the case of a document vector
vec(d)=(0,…,0,wi(1)d,0,…,0,wi(2)d,0,……,wi(M)d,0,…,0)
∥vec(d)∥=wi(1)d2+wi(2)d2+…wi(M)d2=∥d∥
Where ∥d∥ is the length of the document d
Suppose d is a document and q is a query
- if d contains the same words in the same proportions, then d and q will point in the same direction
- If d and q contain different words then d and q will point in different directions
- Intuitively, the greater the angle between d and q, the less similar d and q are.
We define the Cosine Similarity between d and q by:
CSim(q,d)=cosθ
where θ is the angle between q and d
Similarly, the Cosine Similarity between documents d1 and d2 can be defined:
CSim(d1,d2)=cosθ
Where θ is the angle between d1 and d2
cos(θ)=∥u∥∥v∥x1x2+y1y2=∥u∥∥v∥u⋅v
Therefore, if q is a query, d is a document and θ is th angle between q and d then:
CSim(q,d)=cos(θ)=∥q∥∥d∥q⋅d=∥q∥∥d∥∑t∈q∩dwtq⋅wtd=Sim(q,d)