Lecture 3: Stopping, Stemming & TF-IDF Similarity

Text Pre-Processing

Stop word removal

Stemming (Morphology)

Essentially, if a query and a document contian different forms of the same word, then they are related.

Remove surface markings from words to reveal their basic form

"form" is the stem of forms, forming, formed and former

Stemming replaces tokens (words) with equivalence classes of tokens

Equivalence classes are stems

This:

However, not all words obey regular rules e.g. runningrun\texttt{run\underline{ning}}\rightarrow\texttt{run}

A common solution to this problem is to identify a sub-pattern of letters within words and devise rules for dealing with these patterns

Example rules [Belew, p45]

A stemmer is a piece of software which implements a stemming algorithm
The Porter Stemmer is a standard stemmer which implements a set of about 60 rules
The use of a stemmer usually reduces vocab size by 10 to 50 %

Simple Text Retrieval

Given 2 documents d1d_1 and d2d_2 assuming:

  1. all stopwords have been removed and
  2. stemming has been applied.

We want to know if d1d_1 and d2d_2 are about the same thing, i.e. we want to calculate the similarity.
The simplest approach is to calculate the TF-IDF similarity

Matching

Given a query q\bf{q} and a document d\bf{d} we want to define a number:

Sim(q,d)Sim(\textbf{q},\textbf{d})

Given the query we want to return documents d1d2dN\bf d_1 d_2 \ldots d_N s.t. :

Similarity

The similarity between q\bf q and d\bf d will depend on the number terms which are common to q\bf q and d\bf d

We also eed to know how useful each common term is for discrimintation between different documents

IDF Weighting

A popular measure of the significance of a term for discriminating documents is the Inverse document frequency (IDF)

For a token t\bf t we define:

IDF(t)=log(NDNDt)IDF(t) = \log \left( \frac{ND}{ND_t} \right)

where:

TF-IDF Weight

Let t\bf t be a term in d\bf d, a document

The TF-IDF weight wtdw_{td} of term tt for document dd is:

wtd=ftdIDF(t)w_{td} = f_{td}\cdot IDF(t)

where ftdf_{td} is the term frequency i.e. the number of times tt occurs in dd

So for wtdw_{td} to be large:

Query Weights

Given a term tt and a query qq

If qq is a long query, we can treat q as a document

wtq=ftqIDF(t)w_{tq} = f_{tq} \cdot IDF(t)

where ftqf_{tq} is the query term frequency

If qq is a short query we define the TF-IDF weight as:

wtq=IDF(t)w_{tq} = IDF(t)

TF-IDF Similarity

Define the similarity between query qq and document dd as:

Sim(q,d)=tqdwtdwtqdqSim(q,d) = \frac{\sum_{t\in q \cap d} w_{td} \cdot w_{tq}}{\|d\|\cdot\|q\|}

where:

Document Length

Given a document dd. For each term tdt \in d we can define the TF-IDF weight wtdw_{td}

The length of the document dd is defined by:

Len(d)=d=tdwtd2Len(d) = \|d\| = \sqrt{\sum_{t\in d} w_{td}^2}

Practical Considerations

For a query qq we must:

For each document dd

The potential number of documents is very large     \implies time to compute Sim(q,d)Sim(q,d) is huge

Assume qq contains a term tt. If tt didn't already occur in the corpus, it is of no use.

Therefore, we need to identify all documents dd which include tt. However, this will take too long if the number of documents is very large. As it frequently is in real-world applications

To speed up computation we can compute a data structure called the Document Index, in advance

The Document Index

Summary of the IR Process