Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Version Spaces - Artificial Intelligence - Lecture Slides, Slides of Artificial Intelligence

Some concept of Artificial Intelligence are Agents and Problem Solving, Autonomy, Programs, Classical and Modern Planning, First-Order Logic, Resolution Theorem Proving, Search Strategies, Structure Learning. Main points of this lecture are: Version Spaces, Decision Trees, Machine Learning, Variables, Data Type Definition, Binary, Supervised Learning Problem, Describe the General Concept, Forecast, Sport

Typology: Slides

2012/2013

Uploaded on 04/29/2013

shantii
shantii 🇮🇳

4.4

(14)

98 documents

1 / 24

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Lecture 35 of 41
Machine Learning:
Version Spaces and Decision Trees
Docsity.com
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18

Partial preview of the text

Download Version Spaces - Artificial Intelligence - Lecture Slides and more Slides Artificial Intelligence in PDF only on Docsity!

Lecture 35 of 41

Machine Learning:

Version Spaces and Decision Trees

Example:

Learning A Concept ( EnjoySport ) from Data

Example Sky Air

Temp

Humidity Wind Water Forecast Enjoy

Sport

0 Sunny Warm Normal Strong Warm Same Yes

1 Sunny Warm High Strong Warm Same Yes

2 Rainy Cold High Strong Warm Change No

3 Sunny Warm High Strong Cool Change Yes

  • Specification for Training Examples
    • Similar to a data type definition
    • 6 variables ( aka attributes, features): Sky , Temp , Humidity , Wind , Water , Forecast
    • Nominal-valued (symbolic) attributes - enumerative data type

• Binary (Boolean-Valued or H -Valued) Concept

  • Supervised Learning Problem: Describe the General Concept

Typical Concept Learning Tasks

  • Given
    • Instances X: possible days, each described by attributes Sky, AirTemp, Humidity, Wind, Water, Forecast
    • Target function c  EnjoySport: XH  {{Rainy, Sunny}  {Warm, Cold}  {Normal, High}  {None, Mild, Strong}  {Cool, Warm}  {Same, Change}}  {0, 1}
    • Hypotheses H : conjunctions of literals (e.g., )
    • Training examples D : positive and negative examples of the target function
  • Determine
    • Hypothesis hH such that h(x) = c(x) for all xD
    • Such h are consistent with the training data
  • Training Examples
    • Assumption: no missing X values
    • Noise in values of c (contradictory labels)?

x1, c  x 1  ,  , xm,c  xm 

Inductive Learning Hypothesis

  • Fundamental Assumption of Inductive Learning
  • Informal Statement
    • Any hypothesis found to approximate the target function well over a sufficiently large set of training examples will also approximate the target function well over other unobserved examples
    • Definitions deferred: sufficiently large, approximate well, unobserved
  • Formal Statements, Justification, Analysis
    • Statistical (Mitchell, Chapter 5; statistics textbook)
    • Probabilistic (R&N, Chapters 14-15 and 19; Mitchell, Chapter 6)
    • Computational (R&N, Section 18.6; Mitchell, Chapter 7)
  • More on This Topic: Machine Learning and Pattern Recognition (CIS732)
  • Next: How to Find This Hypothesis?

Find-S Algorithm

1. Initialize h to the most specific hypothesis in H

H : the hypothesis space (partially ordered set under relation Less-Specific-Than )

2. For each positive training instance x

For each attribute constraint ai in h IF the constraint ai in h is satisfied by x THEN do nothing ELSE replace ai in h by the next more general constraint that is satisfied by x

3. Output hypothesis h

Hypothesis Space Search

by Find-S

Instances X Hypotheses H

x 1 = <Sunny, Warm, Normal, Strong, Warm, Same>, + x 2 = <Sunny, Warm, High, Strong, Warm, Same>, + x 3 = <Rainy, Cold, High, Strong, Warm, Change>, - x 4 = <Sunny, Warm, High, Strong, Cool, Change>, +

h 1 = <Ø, Ø, Ø, Ø, Ø, Ø> h 2 = <Sunny, Warm, Normal, Strong, Warm, Same> h 3 = <Sunny, Warm, ?, Strong, Warm, Same> h 4 = <Sunny, Warm, ?, Strong, Warm, Same> h 5 = <Sunny, Warm, ?, Strong, ?, ?>

  • Shortcomings of Find-S
    • Can’t tell whether it has learned concept
    • Can’t tell when training data inconsistent
    • Picks a maximally specific h (why?)
    • Depending on H , there might be several!

h 1

h 0

h2,

h 4

x 3

x 1 x 2

x 4

1. Initialization

G  (singleton) set containing most general hypothesis in H , denoted {} S  set of most specific hypotheses in H , denoted {<Ø, … , Ø>}

2. For each training example d

If d is a positive example ( Update-S ) Remove from G any hypotheses inconsistent with d For each hypothesis s in S that is not consistent with d Remove s from S Add to S all minimal generalizations h of s such that

  1. h is consistent with d
  2. Some member of G is more general than h (These are the greatest lower bounds, or meets , sd , in VSH,D ) Remove from S any hypothesis that is more general than another hypothesis in S (remove any dominated elements)

Candidate Elimination Algorithm [1]

Candidate Elimination Algorithm [2]

(continued)

If d is a negative example ( Update-G ) Remove from S any hypotheses inconsistent with d For each hypothesis g in G that is not consistent with d Remove g from G Add to G all minimal specializations h of g such that

  1. h is consistent with d
  2. Some member of S is more specific than h (These are the least upper bounds, or joins , gd , in VSH,D ) Remove from G any hypothesis that is less general than another hypothesis in G (remove any dominating elements)

An Unbiased Learner

  • Example of A Biased H
    • Conjunctive concepts with don’t cares
    • What concepts can H not express? (Hint: what are its syntactic limitations?)
  • Idea
    • Choose H’ that expresses every teachable concept
    • i.e., H’ is the power set of X
    • Recall: | AB | = | B | |^ A^ |^ ( A = X ; B = {labels}; H’ = AB )
    • {{Rainy, Sunny}  {Warm, Cold}  {Normal, High}  {None, Mild, Strong}  {Cool, Warm}  {Same, Change}}  {0, 1}
  • An Exhaustive Hypothesis Language
    • Consider: H’ = disjunctions () , conjunctions (), negations (¬) over previous H
    • | H’ | = 2 (2 • 2 • 2 • 3 • 2 • 2)^ = 2^96 ; | H | = 1 + (3 • 3 • 3 • 4 • 3 • 3) = 973
  • What Are S, G For The Hypothesis Language H’?
    • S  disjunction of all positive examples
    • G  conjunction of all negated negative examples

Decision Trees

  • Classifiers: Instances (Unlabeled Examples)
  • Internal Nodes: Tests for Attribute Values
    • Typical: equality test (e.g., “Wind = ?”)
    • Inequality, other tests possible
  • Branches: Attribute Values
    • One-to- one correspondence (e.g., “Wind = Strong”, “Wind = Light”)
  • Leaves: Assigned Classifications (Class Labels)
  • Representational Power: Propositional Logic ( Why? )

Outlook?

Humidity? Maybe Wind?

Sunny Overcast Rain

No Yes

High Normal

No Maybe

Strong Light

Decision Tree for Concept PlayTennis

[21+, 5-] [8+, 30-]

Decision Tree Learning:

Top-Down Induction ( ID3 )

A 1

True False

[29+, 35-]

[18+, 33-] [11+, 2-]

A 2

True False

[29+, 35-]

  • Algorithm Build-DT ( Examples , Attributes )

IF all examples have the same label THEN RETURN (leaf node with label ) ELSE IF set of attributes is empty THEN RETURN (leaf with majority label ) ELSE Choose best attribute A as root FOR each value v of A Create a branch out of the root for the condition A = v IF { xExamples : x.A = v } = Ø THEN RETURN (leaf with majority label ) ELSE Build-DT ({ xExamples : x.A = v }, Attributes ~ {A})

  • But Which Attribute Is Best?

Choosing the “Best” Root Attribute

  • Objective
    • Construct a decision tree that is a small as possible (Occam’s Razor)
    • Subject to: consistency with labels on training data
  • Obstacles
    • Finding the minimal consistent hypothesis (i.e., decision tree) is NP - hard (D’oh!)
    • Recursive algorithm ( Build-DT )
      • A greedy heuristic search for a simple tree
      • Cannot guarantee optimality (D’oh!)
  • Main Decision: Next Attribute to Condition On
    • Want: attributes that split examples into sets that are relatively pure in one label
    • Result: closer to a leaf node
    • Most popular heuristic
      • Developed by J. R. Quinlan
      • Based on information gain
      • Used in ID3 algorithm

Entropy:

Information Theoretic Definition

  • Components
    • D : a set of examples {< x 1 , c ( x 1 )>, < x 2 , c ( x 2 )>, …, < xm , c ( xm )>}
    • p+ = Pr ( c ( x ) = +), p- = Pr ( c ( x ) = -)
  • Definition
    • H is defined over a probability density function p
    • D contains examples whose frequency of + and - labels indicates p+ and p- for the observed data
    • The entropy of D relative to c is: H ( D )  - p+ log b ( p +) - p- log b ( p - )
  • What Units is H Measured In?
    • Depends on the base b of the log (bits for b = 2, nats for b = e , etc.)
    • A single bit is required to encode each example in the worst case ( p+ = 0.5)
    • If there is less uncertainty (e.g., p+ = 0.8), we can use less than 1 bit each

Information Gain:

Information Theoretic Definition

  • Partitioning on Attribute Values
    • Recall: a partition of D is a collection of disjoint subsets whose union is D
    • Goal: measure the uncertainty removed by splitting on the value of attribute A
  • Definition
    • The information gain of D relative to attribute A is the expected reduction in entropy due to splitting (“sorting”) on A:

where Dv is { xD : x.A = v }, the set of examples in D where attribute A has value v

  • Idea: partition on A ; scale entropy to the size of each subset Dv
  • Which Attribute Is Best?

v values(A) v

v (^) HD D

GainD,A -HD D

[21+, 5-] [8+, 30-]

A 1

True False

[29+, 35-]

[18+, 33-] [11+, 2-]

A 2

True False

[29+, 35-]