Download Version Spaces - Artificial Intelligence - Lecture Slides and more Slides Artificial Intelligence in PDF only on Docsity!
Lecture 35 of 41
Machine Learning:
Version Spaces and Decision Trees
Example:
Learning A Concept ( EnjoySport ) from Data
Example Sky Air
Temp
Humidity Wind Water Forecast Enjoy
Sport
0 Sunny Warm Normal Strong Warm Same Yes
1 Sunny Warm High Strong Warm Same Yes
2 Rainy Cold High Strong Warm Change No
3 Sunny Warm High Strong Cool Change Yes
- Specification for Training Examples
- Similar to a data type definition
- 6 variables ( aka attributes, features): Sky , Temp , Humidity , Wind , Water , Forecast
- Nominal-valued (symbolic) attributes - enumerative data type
• Binary (Boolean-Valued or H -Valued) Concept
- Supervised Learning Problem: Describe the General Concept
Typical Concept Learning Tasks
- Given
- Instances X: possible days, each described by attributes Sky, AirTemp, Humidity, Wind, Water, Forecast
- Target function c EnjoySport: X H {{Rainy, Sunny} {Warm, Cold} {Normal, High} {None, Mild, Strong} {Cool, Warm} {Same, Change}} {0, 1}
- Hypotheses H : conjunctions of literals (e.g., , Cold, High, ?, ?, ?> )
- Training examples D : positive and negative examples of the target function
- Determine
- Hypothesis h H such that h(x) = c(x) for all x D
- Such h are consistent with the training data
- Training Examples
- Assumption: no missing X values
- Noise in values of c (contradictory labels)?
x1, c x 1 , , xm,c xm
Inductive Learning Hypothesis
- Fundamental Assumption of Inductive Learning
- Informal Statement
- Any hypothesis found to approximate the target function well over a sufficiently large set of training examples will also approximate the target function well over other unobserved examples
- Definitions deferred: sufficiently large, approximate well, unobserved
- Formal Statements, Justification, Analysis
- Statistical (Mitchell, Chapter 5; statistics textbook)
- Probabilistic (R&N, Chapters 14-15 and 19; Mitchell, Chapter 6)
- Computational (R&N, Section 18.6; Mitchell, Chapter 7)
- More on This Topic: Machine Learning and Pattern Recognition (CIS732)
- Next: How to Find This Hypothesis?
Find-S Algorithm
1. Initialize h to the most specific hypothesis in H
H : the hypothesis space (partially ordered set under relation Less-Specific-Than )
2. For each positive training instance x
For each attribute constraint ai in h IF the constraint ai in h is satisfied by x THEN do nothing ELSE replace ai in h by the next more general constraint that is satisfied by x
3. Output hypothesis h
Hypothesis Space Search
by Find-S
Instances X Hypotheses H
x 1 = <Sunny, Warm, Normal, Strong, Warm, Same>, + x 2 = <Sunny, Warm, High, Strong, Warm, Same>, + x 3 = <Rainy, Cold, High, Strong, Warm, Change>, - x 4 = <Sunny, Warm, High, Strong, Cool, Change>, +
h 1 = <Ø, Ø, Ø, Ø, Ø, Ø> h 2 = <Sunny, Warm, Normal, Strong, Warm, Same> h 3 = <Sunny, Warm, ?, Strong, Warm, Same> h 4 = <Sunny, Warm, ?, Strong, Warm, Same> h 5 = <Sunny, Warm, ?, Strong, ?, ?>
- Shortcomings of Find-S
- Can’t tell whether it has learned concept
- Can’t tell when training data inconsistent
- Picks a maximally specific h (why?)
- Depending on H , there might be several!
h 1
h 0
h2,
h 4
x 3
x 1 x 2
x 4
1. Initialization
G (singleton) set containing most general hypothesis in H , denoted {, … , ?>} S set of most specific hypotheses in H , denoted {<Ø, … , Ø>}
2. For each training example d
If d is a positive example ( Update-S ) Remove from G any hypotheses inconsistent with d For each hypothesis s in S that is not consistent with d Remove s from S Add to S all minimal generalizations h of s such that
- h is consistent with d
- Some member of G is more general than h (These are the greatest lower bounds, or meets , s d , in VSH,D ) Remove from S any hypothesis that is more general than another hypothesis in S (remove any dominated elements)
Candidate Elimination Algorithm [1]
Candidate Elimination Algorithm [2]
(continued)
If d is a negative example ( Update-G ) Remove from S any hypotheses inconsistent with d For each hypothesis g in G that is not consistent with d Remove g from G Add to G all minimal specializations h of g such that
- h is consistent with d
- Some member of S is more specific than h (These are the least upper bounds, or joins , g d , in VSH,D ) Remove from G any hypothesis that is less general than another hypothesis in G (remove any dominating elements)
An Unbiased Learner
- Example of A Biased H
- Conjunctive concepts with don’t cares
- What concepts can H not express? (Hint: what are its syntactic limitations?)
- Idea
- Choose H’ that expresses every teachable concept
- i.e., H’ is the power set of X
- Recall: | A B | = | B | |^ A^ |^ ( A = X ; B = {labels}; H’ = A B )
- {{Rainy, Sunny} {Warm, Cold} {Normal, High} {None, Mild, Strong} {Cool, Warm} {Same, Change}} {0, 1}
- An Exhaustive Hypothesis Language
- Consider: H’ = disjunctions () , conjunctions (), negations (¬) over previous H
- | H’ | = 2 (2 • 2 • 2 • 3 • 2 • 2)^ = 2^96 ; | H | = 1 + (3 • 3 • 3 • 4 • 3 • 3) = 973
- What Are S, G For The Hypothesis Language H’?
- S disjunction of all positive examples
- G conjunction of all negated negative examples
Decision Trees
- Classifiers: Instances (Unlabeled Examples)
- Internal Nodes: Tests for Attribute Values
- Typical: equality test (e.g., “Wind = ?”)
- Inequality, other tests possible
- Branches: Attribute Values
- One-to- one correspondence (e.g., “Wind = Strong”, “Wind = Light”)
- Leaves: Assigned Classifications (Class Labels)
- Representational Power: Propositional Logic ( Why? )
Outlook?
Humidity? Maybe Wind?
Sunny Overcast Rain
No Yes
High Normal
No Maybe
Strong Light
Decision Tree for Concept PlayTennis
[21+, 5-] [8+, 30-]
Decision Tree Learning:
Top-Down Induction ( ID3 )
A 1
True False
[29+, 35-]
[18+, 33-] [11+, 2-]
A 2
True False
[29+, 35-]
- Algorithm Build-DT ( Examples , Attributes )
IF all examples have the same label THEN RETURN (leaf node with label ) ELSE IF set of attributes is empty THEN RETURN (leaf with majority label ) ELSE Choose best attribute A as root FOR each value v of A Create a branch out of the root for the condition A = v IF { x Examples : x.A = v } = Ø THEN RETURN (leaf with majority label ) ELSE Build-DT ({ x Examples : x.A = v }, Attributes ~ {A})
- But Which Attribute Is Best?
Choosing the “Best” Root Attribute
- Objective
- Construct a decision tree that is a small as possible (Occam’s Razor)
- Subject to: consistency with labels on training data
- Obstacles
- Finding the minimal consistent hypothesis (i.e., decision tree) is NP - hard (D’oh!)
- Recursive algorithm ( Build-DT )
- A greedy heuristic search for a simple tree
- Cannot guarantee optimality (D’oh!)
- Main Decision: Next Attribute to Condition On
- Want: attributes that split examples into sets that are relatively pure in one label
- Result: closer to a leaf node
- Most popular heuristic
- Developed by J. R. Quinlan
- Based on information gain
- Used in ID3 algorithm
Entropy:
Information Theoretic Definition
- Components
- D : a set of examples {< x 1 , c ( x 1 )>, < x 2 , c ( x 2 )>, …, < xm , c ( xm )>}
- p+ = Pr ( c ( x ) = +), p- = Pr ( c ( x ) = -)
- Definition
- H is defined over a probability density function p
- D contains examples whose frequency of + and - labels indicates p+ and p- for the observed data
- The entropy of D relative to c is: H ( D ) - p+ log b ( p +) - p- log b ( p - )
- What Units is H Measured In?
- Depends on the base b of the log (bits for b = 2, nats for b = e , etc.)
- A single bit is required to encode each example in the worst case ( p+ = 0.5)
- If there is less uncertainty (e.g., p+ = 0.8), we can use less than 1 bit each
Information Gain:
Information Theoretic Definition
- Partitioning on Attribute Values
- Recall: a partition of D is a collection of disjoint subsets whose union is D
- Goal: measure the uncertainty removed by splitting on the value of attribute A
- Definition
- The information gain of D relative to attribute A is the expected reduction in entropy due to splitting (“sorting”) on A:
where Dv is { x D : x.A = v }, the set of examples in D where attribute A has value v
- Idea: partition on A ; scale entropy to the size of each subset Dv
- Which Attribute Is Best?
v values(A) v
v (^) HD D
GainD,A -HD D
[21+, 5-] [8+, 30-]
A 1
True False
[29+, 35-]
[18+, 33-] [11+, 2-]
A 2
True False
[29+, 35-]