Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Midterm Exam for Information Retrieval | CMPSCI 646, Exams of Computer Science

Material Type: Exam; Class: Information Retrieval; Subject: Computer Science; University: University of Massachusetts - Amherst; Term: Fall 2004;

Typology: Exams

Pre 2010

Uploaded on 08/19/2009

koofers-user-5pa
koofers-user-5pa 🇺🇸

5

(1)

10 documents

1 / 1

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
CMPSCI 646, Information Retrieval Fall, 2004
Mid-term exam
Page 1 of 1
This is an in-class midterm exam. You have 75 minutes to complete it. The exam is open-
book. You may use class notes, textbooks, papers, or other written resources. You may
not use any device that is capable of being connected to the internet. If you have
questions, please ask the TA or instructor. Corrections and clarifications (if any) will be
written on the whiteboard of the room.
This exam includes four (4) questions on one (1) page. Good luck.
Problem A. (10 points, has parts 1 and 2)
Assume that you have an evaluation program that calculates average precision by looking
only at the top 1,000 documents in the ranked list. Any relevant documents not found in
the top 1,000 are assumed to occur at rank infinity—i.e., precision for that relevant
document will be zero. (This is the way that average precision is calculated at TREC.)
Assume that a query is run against a collection of one million (1,000,000) documents and
that of the 10 relevant documents, only 6 were found in the top 1,000. (1) For that query,
what is the largest and smallest possible error in average precision caused by that
assumption? (2) Similarly, how will precision at 10% recall be affected?
In both cases, you do not have to compute the actual numbers, provided you show
equations that will result in the correct numbers if calculated. (For example, if the
answer were found by adding 4 and 5 [it isn’t], you could either answer “4+5” or “9” and
either would be counted as correct.)
Problem B. (20 points)
Borrowing ideas from query expansion (e.g., LCA), statistical stemming, and LSI,
explain how you might use clustering of terms (words) to address the term independence
problem. (You may certainly borrow ideas from other areas, too.) How likely do you
think it is that such a system will improve effectiveness? Why do you believe that?
Problem C. (15 points, has parts 1 and 2)
How does (1) stopping and (2) stemming reduce the size of inverted lists as stored on
disk? Be as specific as you can.
Problem D. (15 points, has 3 parts)
1. What is a probable problem with query expansion (i.e., assumed-relevance
feedback) in a signature-file system?
2. Describe a situation in which the zero-frequency problem would arise (the
estimation slides in the language modeling lecture). Give a specific example of
how it might occur.
3. It is necessary for a distributed system to have the same stemming algorithm at
every collection’s site? Why or why not?

Partial preview of the text

Download Midterm Exam for Information Retrieval | CMPSCI 646 and more Exams Computer Science in PDF only on Docsity!

CMPSCI 646, Information Retrieval Fall, 2004 Mid-term exam

Page 1 of 1

This is an in-class midterm exam. You have 75 minutes to complete it. The exam is open- book. You may use class notes, textbooks, papers, or other written resources. You may not use any device that is capable of being connected to the internet. If you have questions, please ask the TA or instructor. Corrections and clarifications (if any) will be written on the whiteboard of the room.

This exam includes four (4) questions on one (1) page. Good luck.

Problem A. (10 points, has parts 1 and 2)

Assume that you have an evaluation program that calculates average precision by looking only at the top 1,000 documents in the ranked list. Any relevant documents not found in the top 1,000 are assumed to occur at rank infinity—i.e., precision for that relevant document will be zero. (This is the way that average precision is calculated at TREC.)

Assume that a query is run against a collection of one million (1,000,000) documents and that of the 10 relevant documents, only 6 were found in the top 1,000. (1) For that query, what is the largest and smallest possible error in average precision caused by that assumption? (2) Similarly, how will precision at 10% recall be affected?

In both cases, you do not have to compute the actual numbers, provided you show equations that will result in the correct numbers if calculated. (For example, if the answer were found by adding 4 and 5 [it isn’t], you could either answer “4+5” or “9” and either would be counted as correct.)

Problem B. (20 points)

Borrowing ideas from query expansion (e.g., LCA), statistical stemming, and LSI, explain how you might use clustering of terms (words) to address the term independence problem. (You may certainly borrow ideas from other areas, too.) How likely do you think it is that such a system will improve effectiveness? Why do you believe that?

Problem C. (15 points, has parts 1 and 2)

How does (1) stopping and (2) stemming reduce the size of inverted lists as stored on disk? Be as specific as you can.

Problem D. (15 points, has 3 parts)

  1. What is a probable problem with query expansion (i.e., assumed-relevance feedback) in a signature-file system?
  2. Describe a situation in which the zero-frequency problem would arise (the estimation slides in the language modeling lecture). Give a specific example of how it might occur.
  3. It is necessary for a distributed system to have the same stemming algorithm at every collection’s site? Why or why not?