
Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Community
Ask the community for help and clear up your study doubts
Discover the best universities in your country according to Docsity users
Free resources
Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors
Material Type: Exam; Class: Information Retrieval; Subject: Computer Science; University: University of Massachusetts - Amherst; Term: Fall 2004;
Typology: Exams
1 / 1
This page cannot be seen from the preview
Don't miss anything!
CMPSCI 646, Information Retrieval Fall, 2004 Mid-term exam
Page 1 of 1
This is an in-class midterm exam. You have 75 minutes to complete it. The exam is open- book. You may use class notes, textbooks, papers, or other written resources. You may not use any device that is capable of being connected to the internet. If you have questions, please ask the TA or instructor. Corrections and clarifications (if any) will be written on the whiteboard of the room.
This exam includes four (4) questions on one (1) page. Good luck.
Assume that you have an evaluation program that calculates average precision by looking only at the top 1,000 documents in the ranked list. Any relevant documents not found in the top 1,000 are assumed to occur at rank infinity—i.e., precision for that relevant document will be zero. (This is the way that average precision is calculated at TREC.)
Assume that a query is run against a collection of one million (1,000,000) documents and that of the 10 relevant documents, only 6 were found in the top 1,000. (1) For that query, what is the largest and smallest possible error in average precision caused by that assumption? (2) Similarly, how will precision at 10% recall be affected?
In both cases, you do not have to compute the actual numbers, provided you show equations that will result in the correct numbers if calculated. (For example, if the answer were found by adding 4 and 5 [it isn’t], you could either answer “4+5” or “9” and either would be counted as correct.)
Borrowing ideas from query expansion (e.g., LCA), statistical stemming, and LSI, explain how you might use clustering of terms (words) to address the term independence problem. (You may certainly borrow ideas from other areas, too.) How likely do you think it is that such a system will improve effectiveness? Why do you believe that?
How does (1) stopping and (2) stemming reduce the size of inverted lists as stored on disk? Be as specific as you can.