Simple techniques to remove noise words from texts
Remove common noise words which contribute no information to the IR process. Such as "The"
Stemming (Morphology)
Remove irrelevant differences from different versions of the same word
Identify different forms of the same word such as "run" and "ran". We identify them using a common stem
Essentially, if a query and a document contian different forms of the same word, then they are related.
Remove surface markings from words to reveal their basic form
forms→form, forming→form
formed→form, former→form
"form" is the stem of forms, forming, formed and former
Stemming replaces tokens (words) with equivalence classes of tokens
Equivalence classes are stems
This:
Reduces the number of different words in a corpus
Increases the number of instances of each tokens
However, not all words obey regular rules e.g. running→run
A common solution to this problem is to identify a sub-pattern of letters within words and devise rules for dealing with these patterns
Example rules [Belew, p45]
(.*)SSES→/1SS
Any string ending SSES is stemmed by replacing SSES with SS
A stemmer is a piece of software which implements a stemming algorithm
The Porter Stemmer is a standard stemmer which implements a set of about 60 rules
The use of a stemmer usually reduces vocab size by 10 to 50 %
Simple Text Retrieval
Given 2 documents d1 and d2 assuming:
all stopwords have been removed and
stemming has been applied.
We want to know if d1 and d2 are about the same thing, i.e. we want to calculate the similarity.
The simplest approach is to calculate the TF-IDF similarity
Matching
Given a query q and a document d we want to define a number:
Sim(q,d)
Given the query we want to return documents d1d2…dN s.t. :
d1 is the document for which Sim(q,d) is biggest
d2 is the next biggest etc.
Similarity
The similarity between q and d will depend on the number terms which are common to q and d
We also eed to know how useful each common term is for discrimintation between different documents
IDF Weighting
A popular measure of the significance of a term for discriminating documents is the Inverse document frequency (IDF)
For a token t we define:
IDF(t)=log(NDtND)
where:
ND is the total number of documents in the corpus
NDt is the number of those documents that include t
Note: in this equation log=loge the natural logarithm
TF-IDF Weight
Let t be a term in d, a document
The TF-IDF weight wtd of term t for document d is:
wtd=ftd⋅IDF(t)
where ftd is the term frequency i.e. the number of times t occurs in d
So for wtd to be large:
ftd must be large, so t must occur often in d
IDF(t) must be large, so t must occur only in relatively few documents
Query Weights
Given a term t and a query q
If q is a long query, we can treat q as a document
wtq=ftq⋅IDF(t)
where ftq is the query term frequency
If q is a short query we define the TF-IDF weight as:
wtq=IDF(t)
TF-IDF Similarity
Define the similarity between query q and document d as:
Sim(q,d)=∥d∥⋅∥q∥∑t∈q∩dwtd⋅wtq
where:
We sum over all terms in both qandd
∥q∥ is the length of query q
∥d∥ is the length of document d
Document Length
Given a document d. For each term t∈d we can define the TF-IDF weight wtd
The length of the document d is defined by:
Len(d)=∥d∥=t∈d∑wtd2
Practical Considerations
For a query q we must:
Calculate $q∥ and wtq for for each t∈q
This isn't too expensive
For each document d
∥d∥ can be computed in advance
wtd can be computed in advance for each term t∈d
The potential number of documents is very large ⟹ time to compute Sim(q,d) is huge
Assume q contains a term t. If t didn't already occur in the corpus, it is of no use.
Therefore, we need to identify all documents d which include t. However, this will take too long if the number of documents is very large. As it frequently is in real-world applications
To speed up computation we can compute a data structure called the Document Index, in advance