Prepare for your exams
Get points
Guidelines and tips

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search Store documents

The best documents sold by students who completed their studies

Search through all study resources

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

University Rankings

Discover the best universities in your country according to Docsity users

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

From our blog

Exams and Study

Go to the blog

Naive Bayes Algorithm and Its Application in Natural Language Processing, Slides of Artificial Intelligence

West Bengal University of Animal and Fishery Sciences Artificial Intelligence

An in-depth exploration of the naive bayes algorithm, focusing on its implementation in natural language processing (nlp). The lecture covers various aspects of naive bayes, including its application in text classification, conditional independence, and subtle issues. It also discusses related materials and provides exercises for further study.

Typology: Slides

2012/2013

Uploaded on 04/29/2013

shantii 🇮🇳

4.4

(14)

98 documents

1 / 24

This page cannot be seen from the preview

Don't miss anything!

Applications 2 of 3:

Machine Translation and Language Learning

Lecture 32 of 41

Docsity.com

Partial preview of the text

Download Naive Bayes Algorithm and Its Application in Natural Language Processing and more Slides Artificial Intelligence in PDF only on Docsity!

Applications 2 of 3:

Machine Translation and Language Learning

Lecture 32 of 41

Lecture Outline

Simple Bayes, aka Naïve Bayes
- More examples
- Classification: choosing between two classes; general case
- Robust estimation of probabilities
Learning in Natural Language Processing (NLP)
- Learning over text: problem definitions
- Case study: Newsweeder (Naïve Bayes application)
- Probabilistic framework
- Bayesian approaches to NLP
  - Issues: word sense disambiguation, part-of-speech tagging
  - Applications: spelling correction, web and document searching
Related Material, Mitchell; Pearl
- Read: “Bayesian Networks without Tears”, Charniak
- Go over Chapter 14, Russell and Norvig; Heckerman tutorial (slides)

Conditional Independence

Attributes: Conditionally Independent (CI) Given Data
- P ( x , y | D ) = P ( x | D ) • P ( y | D ): D “mediates” x , y (not necessarily independent)
- Conversely, independent variables are not necessarily CI given any function
Example: Independent but Not CI
- Suppose P ( x = 0) = P ( x = 1) = 0.5 , P ( y = 0) = P ( y = 1) = 0.5, P ( xy ) = P ( x ) P ( y )
- Let f ( x , y ) = x  y
- f ( x , y ) = 0  P ( x = 1 | f = 0) = P ( y = 1 | f = 0) = 1/3, P ( x = 1, y = 1 | f = 0) = 0
- x and y are independent but not CI given f
Example: CI but Not Independent
- Suppose P ( x = 1 | f = 0) = 1, P ( y = 1 | f = 0) = 0, P ( x = 1 | f = 1) = 0, P ( y = 1 | f = 1) = 1
- Suppose P ( f = 0) = P ( f = 1) = 1/
- P ( x = 1) = 1/2, P( y = 1) = 1/2, P(x = 1)• P(y = 1) = 1/4  P(x = 1, y = 1) = 0
- x and y are CI given f but not independent
Moral: Choose Evidence Carefully and Understand Dependencies

Naïve Bayes:

Example [1]

Concept: PlayTennis
Application of Naïve Bayes: Computations
- P ( PlayTennis = { Yes , No }) 2 numbers
- P ( Outlook = { Sunny , Overcast , Rain } | PT = { Yes , No }) 6 numbers
- P ( Temp = { Hot , Mild , Cool } | PT = { Yes , No }) 6 numbers
- P ( Humidity = { High , Normal } | PT = { Yes , No }) 4 numbers
- P ( Wind = { Light , Strong } | PT = { Yes , No }) 4 numbers

Day Outlook Temperature Humidity Wind PlayTennis? 1 Sunny Hot High Light No 2 Sunny Hot High Strong No 3 Overcast Hot High Light Yes 4 Rain Mild High Light Yes 5 Rain Cool Normal Light Yes 6 Rain Cool Normal Strong No 7 Overcast Cool Normal Strong Yes 8 Sunny Mild High Light No 9 Sunny Cool Normal Light Yes 10 Rain Mild Normal Light Yes 11 Sunny Mild Normal Strong Yes 12 Overcast Mild High Strong Yes 13 Overcast Hot Normal Light Yes 14 Rain Mild High Strong No



  



 

j i ik j v V

v (^) NB argmax Pv P x x |v j (^) 1

ˆ ˆ

Naïve Bayes:

Subtle Issues [1]

Conditional Independence Assumption Often Violated
- CI assumption:
- However, it works well surprisingly well anyway
- Note
  - Don’t need estimated conditional probabilities to be correct
  - Only need
See [Domingos and Pazzani, 1996] for analysis

P x 1 ,x 2 ,  ,xn|vj P xi|vj

P^ ˆ  v^ j|x 

  (^) j   (^) 1 2 n j  v V

j i ik j v V

argmax Pv P x ,x , ,x | v

v argmax Pv P x x |v

 







 

  



ˆ ˆ

Naïve Bayes:

Subtle Issues [2]

Naïve Bayes Conditional Probabilities Often Unrealistically Close to 0 or 1
- Scenario: what if none of the training instances with target value vj have xi = xik?
  - Ramification: one missing term is enough to disqualify the label vj
- e.g., P ( Alan Greenspan | Topic = NBA ) = 0 in news corpus
- Many such zero counts
Solution Approaches (See [Kohavi, Becker, and Sommerfield, 1996])
- No-match approaches: replace P = 0 with P = c / m (e.g., c = 0.5, 1) or P ( v )/ m
- Bayesian estimate ( m -estimate) for
  - nj  number of examples  v = vj , nik,j  number of examples  v = vj and xi = xik
  - p  prior estimate for ; m  weight given to prior (“virtual” examples)
  - aka Laplace approaches: see Kohavi et al ( P ( xik | vj )  ( N + f )/( n + kf ))
  - f  control parameter; N  nik,j ; n  nj ; 1  v  k

P x|vj Pvj P xi xik|vj ˆ ˆ ˆ

n m

n mp P x |v j

ik,j ik j 

 ˆ 

P ˆ xik |vj 

Learning to Classify Text:

Probabilistic Framework

Target Concept Interesting? : Document  {+, – }
Problem Definition
- Representation
  - Convert each document to a vector of words ( w 1 , w 2 , …, wn )
  - One attribute per word position in document
- Learning
  - Use training examples to estimate P ( + ), P ( – ), P ( document | + ), P ( document | – )
- Assumptions
  - Naïve Bayes conditional independence assumption
  - Here, wk denotes word k in a vocabulary of N words (1  k  N )
  - P ( xi = wk | vj ) = probability that word in position i is word k , given document vj
  -  i , m. P( xi = wk | vj ) = P( xm = wk | vj ): word CI of position given vj

 



 

length document

P document|vj P xi wk|vj 1

Learning to Classify Text:

A Naïve Bayesian Algorithm

Algorithm Learn-Naïve-Bayes-Text ( D, V )
- 1. Collect all words, punctuation, and other tokens that occur in D
  - Vocabulary  {all distinct words, tokens occurring in any document x  D }
- 1. Calculate required P ( vj ) and P ( xi = wk | vj ) probability terms
  - FOR each target value vj  V DO
    - docs [ j ]  {documents x  D  v ( x ) = vj }
    - text [ j ]  Concatenation ( docs [ j ]) // a single document
    - n  total number of distinct word positions in text [ j ]
    - FOR each word wk in Vocabulary
      - nk  number of times word wk occurs in text [ j ]
- 1. RETURN <{ P ( vj )}, { P ( wk | vj )}>

docs j P vj 

n Vocabulary

n P w |v

k k j 

 

Example:

Twenty Newsgroups

20 USENET Newsgroups
- comp.graphics misc.forsale soc.religion.christian sci.space
- comp.os.ms-windows.misc rec.autos talk.politics.guns sci.crypt
- comp.sys.ibm.pc.hardware rec.motorcycles talk.politics.mideast sci.electronics
- comp.sys.mac.hardware rec.sports.baseball talk.politics.misc sci.med
- comp.windows.x rec.sports.hockey talk.religion.misc
- alt.atheism
Problem Definition [Joachims, 1996]
- Given: 1000 training documents (posts) from each group
- Return: classifier for new documents that identifies the group it belongs to
Example: Recent Article from comp.graphics.algorithms

Hi all

I'm writing an adaptive marching cube algorithm, which must deal with cracks. I got the vertices of the cracks in a list (one list per crack).

Does there exist an algorithm to triangulate a concave polygon? Or how can I bisect the polygon so, that I get a set of connected convex polygons.

The cases of occuring polygons are these:

...

Performance of Newsweeder (Naïve Bayes): 89% Accuracy

Newsweeder Performance: Training Set Size versus Test Accuracy
- 1/3 holdout for testing
Found: Superset of “Useful and Interesting” Articles
- Evaluation criterion: user feedback (ratings elicited while reading)

Learning Curve for

Twenty Newsgroups

Articles

% Classification

Accuracy

Learning Framework for Natural Language:

Linear Statistical Queries (LSQ) Hypotheses

Linear Statistical Queries (LSQ) Hypothesis [Kearns, 1993; Roth, 1999]
- Predicts vLSQ ( x ) (e.g.,  {+, – }) given x  X when
- What does this mean? LSQ classifier…
  - Takes a query example x
  - Asks its built-in SQ oracle for estimates on each xi’^ (that satisfy error

bound )

Computes fi,j ( estimated conditional probability ), coefficients for xi’ , label vj
Returns the most likely label according to this linear discriminator
What Does This Framework Buy Us?
Naïve Bayes is one of a large family of LSQ learning algorithms
Includes: BOC (must transform x ); (hidden) Markov models; max entropy

   

 ^ 



 

 

LSQ (^) v V x'i,vj x'i,vj x'i j

v x' argmax f P I 1

ˆ^ D

 

D j

' xi ,v

P ˆ

Learning Framework for Natural Language:

Naïve Bayes and LSQ

Key Result: Naïve Bayes is A Case of LSQ
Variants of Naïve Bayes: Dealing with Missing Values
- Q: What can we do when xi is missing?
- A: Depends on whether xi is unknown or truly missing (not recorded or corrupt)
  - Method 1: just leave it out (use when truly missing) - standard LSQ
  - Method 2: treat as false or a known default value - modified LSQ
  - Method 3 [Domingos and Pazzani, 1996]: introduce a new value, “?”
- See [Roth, 1999] and [Kohavi, Becker, and Sommerfield, 1996] for more info

  ^  

  ^

 

f i , , , n

,v j

xi,vj i j

j ,vj

x, v

12  P

P lg

lgP

(^11)

 













 

, ˆ

NLP Issues:

Word Sense Disambiguation (WSD)

Problem Definition
- Given: m sentences, each containing a usage of a particular ambiguous word
- Example: “The can will rust.” ( auxiliary verb versus noun)
- Label: vj  s  correct word sense (e.g., s  {auxiliary verb, noun})
- Representation: m examples (labeled attribute vectors <( w 1 , w 2 , …, wn ), s >)
- Return: classifier f : X  V that disambiguates new x  ( w 1 , w 2 , …, wn )
Solution Approach: Use Bayesian Learning (e.g., Naïve Bayes)
- Caveat : can’t observe s in the text!
- A solution: treat s in P ( wi | s) as missing value , impute s (assign by inference)
- [Pedersen and Bruce, 1998]: fill in using Gibbs sampling, EM algorithm (later)
- [Roth, 1998]: Naïve Bayes, sparse networks of Winnows (SNOW), TBL
Recent Research
- T. Pedersen’s research home page: http://www.d.umn.edu/~tpederse/
- D. Roth’s Cognitive Computation Group: http://l2r.cs.uiuc.edu/~cogcomp/



P w 1 ,w 2 , ,wn|s Pwi|s



NLP Issues:

Part-of-Speech (POS) Tagging

Problem Definition
- Given: m sentences containing untagged words
- Example: “The can will rust.”
- Label (one per word, out of ~30-150): vj  s  ( art , n , aux , vi )
- Representation: labeled examples <( w 1 , w 2 , …, wn ), s >
- Return: classifier f : X  V that tags x  ( w 1 , w 2 , …, wn )
- Applications: WSD, dialogue acts (e.g., “That sounds OK to me.”  ACCEPT )
Solution Approaches: Use Transformation-Based Learning (TBL)
- [Brill, 1995]: TBL - mistake-driven algorithm that produces sequences of rules
  - Each rule of the form ( ti , v ): a test condition (constructed attribute) and a tag
  - ti : “ w occurs within  k words of wi ” ( context words); collocations (windows)
- For more info: see [Roth, 1998], [Samuel, Carberry, Vijay-Shankar, 1998]
Recent Research
- E. Brill’s page: http://www.cs.jhu.edu/~brill/
- K. Samuel’s page: http://www.eecis.udel.edu/~samuel/work/research.html

Discourse Labeling

Speech Acts

Natural Language

Parsing / POS Tagging

Lexical Analysis

Naive Bayes Algorithm and Its Application in Natural Language Processing, Slides of Artificial Intelligence

Related documents

Partial preview of the text

Download Naive Bayes Algorithm and Its Application in Natural Language Processing and more Slides Artificial Intelligence in PDF only on Docsity!

Applications 2 of 3:

Machine Translation and Language Learning

Lecture 32 of 41

Lecture Outline

Conditional Independence

Naïve Bayes:

Example [1]

Naïve Bayes:

Subtle Issues [1]

P^ ˆ  v^ j|x 

Naïve Bayes:

Subtle Issues [2]

Scenario: what if none of the training instances with target value vj have xi = xik?

P ˆ xik |vj 

Learning to Classify Text:

Probabilistic Framework

Learning to Classify Text:

A Naïve Bayesian Algorithm

Example:

Twenty Newsgroups

Learning Curve for

Twenty Newsgroups

Learning Framework for Natural Language:

Linear Statistical Queries (LSQ) Hypotheses

 ^ 

Learning Framework for Natural Language:

Naïve Bayes and LSQ

  ^  

  ^

NLP Issues:

Word Sense Disambiguation (WSD)

NLP Issues:

Part-of-Speech (POS) Tagging