



Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Community
Ask the community for help and clear up your study doubts
Discover the best universities in your country according to Docsity users
Free resources
Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors
An introduction to the challenges of working with natural language in the context of information retrieval. It covers difficulties such as synonyms, anaphora, and slang, as well as the importance of query languages and user tasks. The document also discusses data cleaning and the 'bag of words' model, and introduces the concepts of precision, recall, and probabilistic queries.
Typology: Study notes
1 / 6
This page cannot be seen from the preview
Don't miss anything!
Chris Brooks Department of Computer ScienceUniversity of San Francisco
Department of Computer Science — University of San Francisco – p.1/
??
Department of Computer Science — University of San Fra
Department of Computer Science — University of San Francisco – p.3/
??
Department of Computer Science — University of San Francisco – p.4/
??
information need
The job of an IR system is to translate that need into aquery language and then find documents that satisfythat need. What are some sorts of query languages?
Department of Computer Science — University of San Fra
Department of Computer Science — University of San Francisco – p.6/
??
searching
and^
browsing
Searching - the user has a specific information need,and wants a document that meets that need. “Find me an explanation of the re module in Python” Browsing - the user has a broadly defined set ofinterests, and wants information that satisfies his/herinterests. “Find me interesting pages about Python” These different modes have different models of success.
Department of Computer Science — University of San Francisco – p.7/
??
pull^ tasks.
User is actively fetching information from a repository. We can also think about
push
tasks, where selected
data is delivered to a client as it is made available. This is called
filtering RSS readers are an example of this, as is GoogleNews.
Department of Computer Science — University of San Fra
model
of the document.
This might be:^ A category or description (as in a library)^ A set of extracted phrases or keywords^ The full text of the document^ Full text with filtering
Department of Computer Science — University of San Francisco – p.9/
??
Order is discarded; we just count how often each wordappears. No semantics involved Intuition: Frequently-appearing words give an indicationof subject matter. Advantage: No need to parse, computationally tractablefor large collections. Disadvantage: Contextual information and meaning islost.
Department of Computer Science — University of San Francisco – p.10/
??
cleaned
first.
HTML, Javascript removed. (Links and structural information might be keptseparately) Non-words removed. Converted to lower case stopwords
removed. These are words that have little or no semantic content. (a, an, the, he, she, their,among, etc)
Department of Computer Science — University of San Fran
Department of Computer Science — University of San Francisco – p.18/
??
Department of Computer Science — University of San Francisco – p.19/
??
Department of Computer Science — University of San Fran
Department of Computer Science — University of San Francisco – p.21/
??
document frequency
Department of Computer Science — University of San Francisco – p.22/
??
(word
(word
)^ ∗^ log
|corpus(
| DF^ (word
T F^ (word
)^ is how frequently the word occurs in the search query (or a specific document) DF^ (word
)^ is the number of pages in the corpus that contain the word.
Department of Computer Science — University of San Fran
Department of Computer Science — University of San Francisco – p.24/
??
Department of Computer Science — University of San Francisco – p.25/
??
dog:^
2.1^ ;^ bunny:
8.2;^
fish:^
Conceptually, these documents can be thought of as an n-dimensional vector, where
n^ is the number of words in
the lexicon (all words in all documents) and the value ofv[n] is the TFIDF score for that word. Many elements of the vector are zero, since those wordsdon’t appear in that specific document.
Department of Computer Science — University of San Fran
Department of Computer Science — University of San Francisco – p.27/
??
a·b ||a||||b ||
This is the dot product of the two vectors, divided by theproduct of their magnitudes. The dot product is computed by summing the product ofthe respective elements of each vector: ∑v1[^ i^
i]^ ∗^ v2[
i] The magnitudes are computed by calculating the squareroot of the sum of the squares of each component. (thisis Pythagoras’ rule) √∑^ i
(^2) v[i]
Department of Computer Science — University of San Francisco – p.28/
??
, d) =^2
P^ word∈
dd∩d 12 1 [word]
∗d^2 [word
]
qP^ word
d[word^1 ∈d 1
qP (^2) ]∗ word∈d^2
d[word^2
(^2) ]
This is a very powerful and useful technique forcomparing documents. It can also be used to compare a query to a document. We’ll return to it when we study clustering.
Department of Computer Science — University of San Fran