Web Mining: Important Questions and Answers on Information Retrieval (IR)

In this post, we will discuss the following questions:

1. Why is cosine similarity preferred over Euclidean distance in comparing two documents?


Euclidean distance performs well for numerical data, and it focuses on the magnitude, not the similarity between the documents, whereas the cosine similarity performs well for numerical as well as ordinal data. When the value of Euclidean distance is high, the documents are more similar nut in cosine similarity. When its value is near zero, the two documents are more similar.

When we are comparing two documents that are more similar in meaning, then cosine similarity can give better results.

Let's have an example:

We have three documents related to sports- Document A, Document B, Document C. A related to Sachin , B related to Virat and C is a part of Document A. So we can say that c is more similar to document A than B. But A and B are bigger documents having hundreds of common terms. Euclidean Distance between A and B is much greater than that between the A and C, indicating that A and B are more similar. Whereas the cosine value between A and B is lesser than that of A and C, indicating that A and C are more similar, which is in reality.

So because the Euclidean Distance focuses more on the magnitude, it may give a misleading result when comparing two documents. While the cosine similarity can provide a better result.

2. What is an inverted index? Discuss at least three applications of the inverted index?


An inverted index is an index data structure that allows faster searching the documents. When any user searches for some information, one way to find the searched information is to scan all the documents sequentially. It is a very inefficient way to search for any information from a large document. When it comes to web search, this type of sequential searching is very inefficient and costly in terms of processing. In the inverted indexing technique, all the documents are given the documents id's. In this technique, the simplest way is to assign all terms to the document id's so that any term can be accessed by using the document id. Another more complex way to do inverted indexing is using document id's, the frequency of the terms in that document, and the offset or position. These are assigned to the terms. This type of inverted indexing is more efficient in searching because it has the frequency and position of the term.

Applications of inverted index: 

1. The most important application of the inverted index is in the search engine indexing algorithms. The inverted index is used in all search engines nowadays. It made the searching very fast. This has made it possible to search documents, images, media, and large data.

2. Inverted index creates a map to content. The inverted index provides a detailed map of the content where these contents are stored.

3. Another important application of inverted index is in bioinformatics. These are important in aligning and merging fragments from a longer DNA sequence to reconstruct the original sequence.

3. What do you understand by Relevance feedback? Explain the use of Rocchio’s method for query expansion and document classification with proper example.


 The information retrieval system can have a query operation module in which the user feedback is accepted and used to make the original query more efficient. In Relevance feedback, the user feedback is taken to find whether a result of the query is relevant or not to the user. And this feedback is used to make query operation more efficient. The system uses feedback from users to classify the documents into user-relevant and irrelevant documents.

Rocchio’s method is a very effective Relevance Feedback method. This method expands the original query based on the user feed-backed relevant and irrelevant documents. The expanded query is determined using the formula in which it considers the original query, the relevant document, and the irrelevant document identified by users.

The formula for the extended query:

Where α, β, and γ are parameters.


Vocabulary: {run, lion, cat, dog, program}

original query: q=[1,0,1,0,0]

relevant document: Dr=[2,2,1,0,0]

irrelevant document: Dir=[2,0,1,0,3]

α=1.0, β=1.0 and γ=.5

qe= 1.0[1,0,1,0,0] + 1.0[2,2,1,0,0]-0.5[2,0,1,0,3]


The document classification uses a vector space model, in this model the documents are presented as the vectors.

4. What is the significance of idf, and why does tf-idf give good results for retrieving relevant documents?


The inverse document frequency is significant while retrieving the relevant documents. It measures the importance of a term. All terms are considered equally important while computing tf. But there may be that a term has appeared in a large number of documents but has little importance. Thus generally, we need to weigh down the more frequent terms while scaling up the rare ones by computing the following:

IDF(t) = log(Total number of documents / Number of documents with term t).

Typically, the tf-idf weight is composed of two terms: the first computes the normalized Term Frequency (TF), which means the number of times a word appears in a document, divided by the maximum number of any term that appeared in that document; the second term is the Inverse Document Frequency (IDF), computed as the logarithm of the number of the documents divided by the number of documents where the specific term appears. The final tf-idf term weight is given by the product of normalized term frequency and the inverse document frequency. Since the tf is normalized and idf is used, the tf-idf weight gives good result while retrieving relevant documents.

5. What are similarities and differences in use of tf-idf over okapi measure ? 


Similarities- Both tf-idf and okapi measure provide that how relevant the document is. In tf-idf is any term is more frequent in documents then it may be that it does not have lots of relevance to query but a term may less frequent and can have more relevance. okapi measure also provide that how much any document is relevant with respect to the query.

Differences- tf-idf is a vector representation for a given term in a document. To find the rank is just sum up the tf-idf for esch query term. This ranking is biased towards the long documents where the more of terms appear. Okapi measure is used to find the degree of relevance. Okapi measure computes the relevance score for each document to the query.