Prepare for your exams
Get points
Guidelines and tips

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Log in Sign up

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search Store documents

The best documents sold by students who completed their studies

Search through all study resources

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

University Rankings

Discover the best universities in your country according to Docsity users

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

From our blog

Exams and Study

Go to the blog

Understanding Information Retrieval and Text Processing in Artificial Intelligence, Study notes of Computer Science

University of San Francisco (USF)Computer Science

An introduction to the challenges of working with natural language in the context of information retrieval. It covers difficulties such as synonyms, anaphora, and slang, as well as the importance of query languages and user tasks. The document also discusses data cleaning and the 'bag of words' model, and introduces the concepts of precision, recall, and probabilistic queries.

Typology: Study notes

Pre 2010

Uploaded on 07/30/2009

koofers-user-709 🇺🇸

5

(1)

10 documents

1 / 6

This page cannot be seen from the preview

Don't miss anything!

Artificial Intelligence

Programming

Information Retrieval

Chris Brooks

Department of Computer Science

University of San Francisco

Processing text

Now that we know a little bit about how to consider and

compare different states, we’ll think about a problem

with a harder representation.

English-language text

Specifically, webpages

We’ll look at several different approaches to

automatically summarizing and understanding

documents.

Departmentof Computer Science — University of San Francisco – p.1/??

Difficulties

What makes working with natural language hard?

Departmentof Computer Science — University of San Francisco

Difficulties

What makes working with natural language hard?

Nonunique parses

Synonyms and multiple meanings

Anaphora

Slang and technical terms

Analogy and metaphor

Misspelling and incorrect grammar

Departmentof Computer Science — University of San Francisco – p.3/??

Information Retrieval

Information retrieval deals with the storage, retrieval,

organization of, and access to information items

Overlaps with:

Databases (more of a focus on content)

AI

Search engines

Departmentof Computer Science — University of San Francisco – p.4/??

Needs and queries

A user typically has an information need.

The job of an IR system is to translate that need into a

query language and then find documents that satisfy

that need.

What are some sorts of query languages?

Departmentof Computer Science — University of San Francisco

Partial preview of the text

Download Understanding Information Retrieval and Text Processing in Artificial Intelligence and more Study notes Computer Science in PDF only on Docsity!

Artificial Intelligence

Programming Information Retrieval

Chris Brooks Department of Computer ScienceUniversity of San Francisco

Processing text Now that we know a little bit about how to consider andcompare different states, we’ll think about a problemwith a harder representation. English-language text^ Specifically, webpages We’ll look at several different approaches toautomatically summarizing and understandingdocuments.

Department of Computer Science — University of San Francisco – p.1/

??

Difficulties What makes working with natural language hard?

Department of Computer Science — University of San Fra

Difficulties What makes working with natural language hard? Nonunique parses Synonyms and multiple meanings Anaphora Slang and technical terms Analogy and metaphor Misspelling and incorrect grammar

Department of Computer Science — University of San Francisco – p.3/

??

Information Retrieval Information retrieval deals with the storage, retrieval,organization of, and access to information items Overlaps with:^ Databases (more of a focus on content)^ AI^ Search engines

Department of Computer Science — University of San Francisco – p.4/

??

Needs and queries A user typically has an

information need

The job of an IR system is to translate that need into aquery language and then find documents that satisfythat need. What are some sorts of query languages?

Department of Computer Science — University of San Fra

Query Languages What are some sorts of query languages? Keyword - Google, Yahoo!, etc. Natural language - Ask.com SQL-style Similar item - Netflix, Amazon Multimedia - Flickr

Department of Computer Science — University of San Francisco – p.6/

??

User tasks We’ll also distinguish between different types of usertasks. The most common are

searching

and^

browsing

Searching - the user has a specific information need,and wants a document that meets that need. “Find me an explanation of the re module in Python” Browsing - the user has a broadly defined set ofinterests, and wants information that satisfies his/herinterests. “Find me interesting pages about Python” These different modes have different models of success.

Department of Computer Science — University of San Francisco – p.7/

??

User tasks Searching and browsing are both

pull^ tasks.

User is actively fetching information from a repository. We can also think about

push

tasks, where selected

data is delivered to a client as it is made available. This is called

filtering RSS readers are an example of this, as is GoogleNews.

Department of Computer Science — University of San Fra

Modeling a Document In order to match a query to a document, an IR systemmust have a

model

of the document.

This might be:^ A category or description (as in a library)^ A set of extracted phrases or keywords^ The full text of the document^ Full text with filtering

Department of Computer Science — University of San Francisco – p.9/

??

“Bag of words” model The techniques we’ll look at today treat a document as a bag of words

Order is discarded; we just count how often each wordappears. No semantics involved Intuition: Frequently-appearing words give an indicationof subject matter. Advantage: No need to parse, computationally tractablefor large collections. Disadvantage: Contextual information and meaning islost.

Department of Computer Science — University of San Francisco – p.10/

??

Data cleaning When preparing a document such as a webpage for anIR system, the data must typically be

cleaned

first.

HTML, Javascript removed. (Links and structural information might be keptseparately) Non-words removed. Converted to lower case stopwords

removed. These are words that have little or no semantic content. (a, an, the, he, she, their,among, etc)

Department of Computer Science — University of San Fran

Probabilistic Queries A simple extension is to allow partial matches on queries Score documents according to the fraction of queryterms matched Return documents according to score^ Example: Document contains “cat cat dog bunny fish”^ Query is “cat dog (bunny OR snake) bird”^ Score is 3/4.

Department of Computer Science — University of San Francisco – p.18/

??

Probabilistic Queries Weaknesses:^ Still requires logical queries^ Doesn’t deal with word frequency^ Dependent on query length - short queries will havea hard time getting differentiated scores.^ The average Google query is only three words long!

Department of Computer Science — University of San Francisco – p.19/

??

Dealing with Word Frequency Intuitively, some words in a document should mattermore than others.^ The word “aardvark” occurring 10 times in adocument is probably more meaningful than the word“date” occurring 10 times. We want to weight words such that words which are rarein general, but common in a document, are more highlyconsidered.

Department of Computer Science — University of San Fran

Building a corpus To measure how frequently words occur in general, wemust construct a corpus.^ This is a large collection of documents Must be careful to ensure that we select documents ofthe appropriate style Different types of documents have different wordfrequencies^ New York Times vs Livejournal The statistical distribution of words in a corpus is calleda^ language model

Department of Computer Science — University of San Francisco – p.21/

??

Building a corpus We begin by cleaning the data as before Construct a dictionary that maps words to the number ofpages they occur in.^ Don’t worry about multiple occurrences within adocument The result is referred to as

document frequency

Department of Computer Science — University of San Francisco – p.22/

??

TFIDF We can now weight each word to indicate its importancein the language model. The most common weighting scheme is TF-IDF: termfrequency - inverse document frequency. T F IDF

(word

) =^ T F

(word

)^ ∗^ log

|corpus(

| DF^ (word

T F^ (word

)^ is how frequently the word occurs in the search query (or a specific document) DF^ (word

)^ is the number of pages in the corpus that contain the word.

Department of Computer Science — University of San Fran

TFIDF Think about extrema:^ What happens if a word occurs in exactly onedocument in the corpus?^ What happens if a word occurs in every document inthe corpus? We want to favor words that discriminate interestingpages from non-interesting pages

Department of Computer Science — University of San Francisco – p.24/

??

Word Weighting We can now process each document and assign aweight to each word. We could use this to improve the performance of theprobabilistic scorer. More interestingly, we can use it to determine howsimilar two documents are. This gives us another way for users to search^ “Find more documents like this”

Department of Computer Science — University of San Francisco – p.25/

??

Documents as vectors At this point, each document can be represented as adictionary of words and TFIDF scores cat:^ 4.33;

dog:^

2.1^ ;^ bunny:

8.2;^

fish:^

Conceptually, these documents can be thought of as an n-dimensional vector, where

n^ is the number of words in

the lexicon (all words in all documents) and the value ofv[n] is the TFIDF score for that word. Many elements of the vector are zero, since those wordsdon’t appear in that specific document.

Department of Computer Science — University of San Fran

Comparing vectors We can now use well-known techniques from geometryto compare these vectors. We could measure the angle between the vectors.^ The scale is not convenient, and the calculation iscomplicated. Easier is to measure the cosine of this angle. Identical docuents have a cosine of 1, and completelydissimilar documents have a cosine of zero.

Department of Computer Science — University of San Francisco – p.27/

??

Computing cosine similarity The formula for the cosine of the angle between twovectors is:

a·b ||a||||b ||

This is the dot product of the two vectors, divided by theproduct of their magnitudes. The dot product is computed by summing the product ofthe respective elements of each vector: ∑v1[^ i^

i]^ ∗^ v2[

i] The magnitudes are computed by calculating the squareroot of the sum of the squares of each component. (thisis Pythagoras’ rule) √∑^ i

(^2) v[i]

Department of Computer Science — University of San Francisco – p.28/

??

Computing cosine similarity The entire formula, in terms of words in documents,looks like this: cos(d^1

, d) =^2

P^ word∈

dd∩d 12 1 [word]

∗d^2 [word

]

qP^ word

d[word^1 ∈d 1

qP (^2) ]∗ word∈d^2

d[word^2

(^2) ]

This is a very powerful and useful technique forcomparing documents. It can also be used to compare a query to a document. We’ll return to it when we study clustering.

Department of Computer Science — University of San Fran

Understanding Information Retrieval and Text Processing in Artificial Intelligence, Study notes of Computer Science

Related documents

Partial preview of the text

Download Understanding Information Retrieval and Text Processing in Artificial Intelligence and more Study notes Computer Science in PDF only on Docsity!

Artificial Intelligence

Programming Information Retrieval

Difficulties What makes working with natural language hard?

Difficulties What makes working with natural language hard? Nonunique parses Synonyms and multiple meanings Anaphora Slang and technical terms Analogy and metaphor Misspelling and incorrect grammar

Information Retrieval Information retrieval deals with the storage, retrieval,organization of, and access to information items Overlaps with:^ Databases (more of a focus on content)^ AI^ Search engines

Needs and queries A user typically has an

Query Languages What are some sorts of query languages? Keyword - Google, Yahoo!, etc. Natural language - Ask.com SQL-style Similar item - Netflix, Amazon Multimedia - Flickr

User tasks We’ll also distinguish between different types of usertasks. The most common are

User tasks Searching and browsing are both

Modeling a Document In order to match a query to a document, an IR systemmust have a

“Bag of words” model The techniques we’ll look at today treat a document as a bag of words

Data cleaning When preparing a document such as a webpage for anIR system, the data must typically be

Probabilistic Queries A simple extension is to allow partial matches on queries Score documents according to the fraction of queryterms matched Return documents according to score^ Example: Document contains “cat cat dog bunny fish”^ Query is “cat dog (bunny OR snake) bird”^ Score is 3/4.

Probabilistic Queries Weaknesses:^ Still requires logical queries^ Doesn’t deal with word frequency^ Dependent on query length - short queries will havea hard time getting differentiated scores.^ The average Google query is only three words long!

Building a corpus We begin by cleaning the data as before Construct a dictionary that maps words to the number ofpages they occur in.^ Don’t worry about multiple occurrences within adocument The result is referred to as

TFIDF We can now weight each word to indicate its importancein the language model. The most common weighting scheme is TF-IDF: termfrequency - inverse document frequency. T F IDF

) =^ T F

TFIDF Think about extrema:^ What happens if a word occurs in exactly onedocument in the corpus?^ What happens if a word occurs in every document inthe corpus? We want to favor words that discriminate interestingpages from non-interesting pages

Documents as vectors At this point, each document can be represented as adictionary of words and TFIDF scores cat:^ 4.33;

Computing cosine similarity The formula for the cosine of the angle between twovectors is:

Computing cosine similarity The entire formula, in terms of words in documents,looks like this: cos(d^1