Midterm Exam for Information Retrieval | CMPSCI 646 | Exams Computer Science

CMPSCI 646, Information Retrieval Fall, 2004

Mid-term exam

Page 1 of 1

This is an in-class midterm exam. You have 75 minutes to complete it. The exam is open-

book. You may use class notes, textbooks, papers, or other written resources. You may

not use any device that is capable of being connected to the internet. If you have

questions, please ask the TA or instructor. Corrections and clarifications (if any) will be

written on the whiteboard of the room.

This exam includes four (4) questions on one (1) page. Good luck.

Problem A. (10 points, has parts 1 and 2)

Assume that you have an evaluation program that calculates average precision by looking

only at the top 1,000 documents in the ranked list. Any relevant documents not found in

the top 1,000 are assumed to occur at rank infinity—i.e., precision for that relevant

document will be zero. (This is the way that average precision is calculated at TREC.)

Assume that a query is run against a collection of one million (1,000,000) documents and

that of the 10 relevant documents, only 6 were found in the top 1,000. (1) For that query,

what is the largest and smallest possible error in average precision caused by that

assumption? (2) Similarly, how will precision at 10% recall be affected?

In both cases, you do not have to compute the actual numbers, provided you show

equations that will result in the correct numbers if calculated. (For example, if the

answer were found by adding 4 and 5 [it isn’t], you could either answer “4+5” or “9” and

either would be counted as correct.)

Problem B. (20 points)

Borrowing ideas from query expansion (e.g., LCA), statistical stemming, and LSI,

explain how you might use clustering of terms (words) to address the term independence

problem. (You may certainly borrow ideas from other areas, too.) How likely do you

think it is that such a system will improve effectiveness? Why do you believe that?

Problem C. (15 points, has parts 1 and 2)

How does (1) stopping and (2) stemming reduce the size of inverted lists as stored on

disk? Be as specific as you can.

Problem D. (15 points, has 3 parts)

1. What is a probable problem with query expansion (i.e., assumed-relevance

feedback) in a signature-file system?

2. Describe a situation in which the zero-frequency problem would arise (the

estimation slides in the language modeling lecture). Give a specific example of

how it might occur.

3. It is necessary for a distributed system to have the same stemming algorithm at

every collection’s site? Why or why not?

Partial preview of the text

Download Midterm Exam for Information Retrieval | CMPSCI 646 and more Exams Computer Science in PDF only on Docsity!

CMPSCI 646, Information Retrieval Fall, 2004 Mid-term exam

Page 1 of 1

This is an in-class midterm exam. You have 75 minutes to complete it. The exam is open- book. You may use class notes, textbooks, papers, or other written resources. You may not use any device that is capable of being connected to the internet. If you have questions, please ask the TA or instructor. Corrections and clarifications (if any) will be written on the whiteboard of the room.

This exam includes four (4) questions on one (1) page. Good luck.

Problem A. (10 points, has parts 1 and 2)

Assume that you have an evaluation program that calculates average precision by looking only at the top 1,000 documents in the ranked list. Any relevant documents not found in the top 1,000 are assumed to occur at rank infinity—i.e., precision for that relevant document will be zero. (This is the way that average precision is calculated at TREC.)

Assume that a query is run against a collection of one million (1,000,000) documents and that of the 10 relevant documents, only 6 were found in the top 1,000. (1) For that query, what is the largest and smallest possible error in average precision caused by that assumption? (2) Similarly, how will precision at 10% recall be affected?

In both cases, you do not have to compute the actual numbers, provided you show equations that will result in the correct numbers if calculated. (For example, if the answer were found by adding 4 and 5 [it isn’t], you could either answer “4+5” or “9” and either would be counted as correct.)

Problem B. (20 points)

Borrowing ideas from query expansion (e.g., LCA), statistical stemming, and LSI, explain how you might use clustering of terms (words) to address the term independence problem. (You may certainly borrow ideas from other areas, too.) How likely do you think it is that such a system will improve effectiveness? Why do you believe that?

Problem C. (15 points, has parts 1 and 2)

How does (1) stopping and (2) stemming reduce the size of inverted lists as stored on disk? Be as specific as you can.

Problem D. (15 points, has 3 parts)

What is a probable problem with query expansion (i.e., assumed-relevance feedback) in a signature-file system?
Describe a situation in which the zero-frequency problem would arise (the estimation slides in the language modeling lecture). Give a specific example of how it might occur.
It is necessary for a distributed system to have the same stemming algorithm at every collection’s site? Why or why not?

Midterm Exam for Information Retrieval | CMPSCI 646, Exams of Computer Science

Related documents

Partial preview of the text

Download Midterm Exam for Information Retrieval | CMPSCI 646 and more Exams Computer Science in PDF only on Docsity!

Problem A. (10 points, has parts 1 and 2)

Problem B. (20 points)

Problem C. (15 points, has parts 1 and 2)

Problem D. (15 points, has 3 parts)