Title: Learning from Observation
1- Learning from Observation
- CS570 Lecture Notes
- by Jin Hyung Kim
- Computer Science Department
- KAIST
2Contents
- Introduction
- Inductive learning
- Learning decision trees
3Learning
- Change of contents and organization of systems
knowledge enabling to improve to its performance
on task - Simon - When it acquire new knowledge from environment
- When it organize its current knowledge
- Learning from Observation
- from trivial memorization to the creation of
scientific theories - Inductive Inference
- New consistent interpretation of data
(observations) - General conclusion from examples
- Infer association between input and output
- with some confidence
4Depending on Available Feedback
- supervised learning
- Environment provides examples of correct
input/output pair - Induction
- unsupervised learning
- No hint at all about the correct outputs.
- Clustering or consistent interpretation.
- reinforcement learning
- Receives no examples, but rewards or punishments
at the end - Transduction Semi-supervised learning
- Training with labeled training examples and
unlabeled examples
5 Issues on Learning Algorithm
- Prior Knowledge
- Prior knowledge can help in learning.
- Assumptions on parametric forms and range of
values - Incremental learning
- Update old knowledge whenever new example arrives
- Batch learning
- Apply learning algorithm to the entire set of
examples - Data Mining
- Learning rules from large set of data
- Availability of large database allows application
of machine learning to real problems
6Inductive Learning
- given training examples
- correct input-output pairs
- recover unknown function from data generated from
the function - generalization ability for unseen
- classification function is discrete
- concept learning output is binary
7Classification of Inductive Learning
- Supervised Learning
- Unsupervised Learning
- No correct output-output pairs
- needs other source for determining correctness
- reinforcement learning yes/no answer only
- example chess playing
- Clustering group into clusters of common
characteristics - Map Learning explore unknown territory
- Discovery Learning uncover new relationships
8Problems of Induction
- Example
- Pair (x, f(x)), where x is input and f(x) is
output. - Also called training examples
- Induction
- Task to find h that approximates f from given
examples of f. - Hypothesis
- h, approximation of f
- Bias
- Preference of any hypothesis over others
- How good will the hypothesis generalize ?
9Consistent Linear hypotheses
- William of Ockham (also Occam ) 1285?-1349?
- English scholastic philosopher
- Prefer the simplest hypothesis consistent with
data - Definition of simple is not easy
- For nondeterministic function, Tradeoff between
complexity of hypothesis and degree of fit
10Theory of Inductive Inference
- Concept C ? X
- Examples are given as (x, y) where x?X and
- y 1 if x ?C, y 0 if x ? C
- Find F such that F(x) 1 if x ?C, and F(x) 0 if
x ? C - Inductive bias
- constraints on hypothesis space
- Table of all observation is not a choice
- Restricted Hypothesis space biases
- Preference biases
- Occams razor (Ockham) simple hypo is best
11Probably Approximately Correct
Theory of Inductive Inference
- Error(F ) ? Pr(x) where
- D x (f(x) 0 ? xltC) ? (f(x) 1 ?
x?C) - Approximately correct with ?
- Probably Approximately correct(PAC)
- Pr(Error(F) gt ?) lt d
- PAC whenever
- samples gt ln(d /H) / ln(1- ?)
- for given H, samples grows slowly
- However, H is large (all Boolean functions on n
attributes 22n )
x?D
12Leaning General Logical Descriptions
- Find (general) logical descriptions consistent
with sample data (examples) - logical connections among examples and
goal(concept) - Iteratively refine Hypothesis space observing
examples - false-negative example
- H says negative, but example is positive
- need generalization
- false-positive example
- H says positive, but example is negative
- need specialization
13Generalization / Specialization
- Specialization and generalization relationship
- C1 ? C2 , (blue ? book) ? book
- Transitive relationships hold
- Generalization example
- Hypothesis ?(x) boy(x) ? KAIST(x) ? smart(x)
- example ?boy(x1) ? KAIST(x1) ? smart(x1)
- Generalization ?(x) KAIST(x) ? smart(x)
- Specialization example
- Hypothesis ?(x) KAIST(x) ? smart(x)
- example boy(x2) ? KAIST(x2) ? ? smart(x2)
- specialization ?(x) ?boy(x) ? KAIST(x) ?
smart(x)
14Why Pure Inductive Inference can be Learning?
- Learning can be seen as learning the
representation of a function. - Hypothesis is a approximated representation.
- Pure inductive inference finds hypothesis.
- Function representation
- Logical sentences
- Polynomials
- Set of weights (Neural Networks)
-
15Logical Sentences
- Logic
- Target language for learning algorithms
- Expressiveness and well-understood semantics
- A major tool for AI research
- Two approaches
- Decision tree
- Version-space
16Decision Tree
- A tree whose all internal node have a test, and
all leaf node have the decision. - Select decision based on attributes
salary
Credit Card Approval
? 20,000
20,000 ?
education
Yes
20,000 ? salary
graduate
others
Yes
No
20,000 gt salary and education graduate
17Example WillWait(Will wait for a table at a
restaurant?)
Figure 18.4
18Expressiveness of Decision Trees
- Restriction
- Single object (implicitly)
- Cannot represent test related two or more objects
- Is there a cheaper restaurant nearby ?
- Fully expressive
- Class of propositional language
- Any Boolean function can be represented as
Decision tree - Bad cases
- Parity function or majority rules
- Exponentially large decision tree needed.
19Inducing Decision Trees from examples
- Terminologies
- Classification
- The value of the goal predicate (ex. Yes/No)
- Examples
- Positive/Negative
- Noise
- Training Set
- Example set to use for inducing decision tree
- Test Set
- Example set to use for checking quality of
decision tree
20Examples for Restaurant Domain
Figure 18.5
21Inducing Decision Tree from example
- Simple way
- One path for each example
- Just memorization of observation
- Extracting of pattern
- To describe a large number of cases in a concise
way - General Principle of induction learning
Ockhams razor - The most likely hypothesis is the simplest one
that is consistent with all observations. - Finding smallest decision tree is an intractable
problem - gt Use heuristics (greedy)
- Idea most important attribute first
- Examples of result partitions are in one class,
if possible - discriminating power
- Otherwise, make it close to one class as much as
possible
22Splitting the examples by testing on attributes
Patrons is a good attribute to test first
Type is a bad attribute to test first
23Splitting the examples by testing on attributes
(cont)
Hungry is a fairly good second test, given that
Patron is the first test
24Decision tree induced from the 12-example
training set
25Decision Tree Learning
- Remember features that distinguish positive from
negative - Build decision tree for classification
- Non-terminal node question (attribute)
- answer (attribute value) leads to children
- Terminal node class(concept)
- Path from root to terminal conjunction of
features for the terminal concept - How to implement Occams razor ?
26Decision Tree Learning Algorithm(recursive)
- Mixed examples
- choose best attribute and split
- All positive or All negative
- leaf node
- No examples left
- not observed condition
- No attributes left, but mixed
- incorrect example data noise
- attributes dont describe situation enough
- domain is truly nondeterministic
27Building Decision Tree
- Finding Smallest Tree NP class
- How many distinct decision trees with n Boolean
attributes ? - number of Boolean functions
- number of distinct truth table with 2n rows
22n - E.g. with 6 Boolean attritbutes 18,
446,744,073,709,551,616 trees - Heuristic Methods of acceptable performance
- best attribute first (DTBA)
28Decision Tree Building Algorithm(DTBA)
All Examples in a class ?
Choose an attribute A
quit
Apply DTBA recursively on each children node
Partition Examples by value of A
Create New nodes for each non-empty subset of
examples
Set the new nodes as the children of node
29Choosing Attribute
- Choose best attribute first
- Definition of best
- examples of result partitions are in one class,
if possible - otherwise, make it close to one class as much as
possible - Which is better ?
- (AAABB) or (AAAAB) ?
- (AABBCC) (ABBBCC)
- small disorder
30Information Theory
- C.E. Shannon, 1948,1949 papers
- Information, I(e) average number of binary
questions required to identify an event, e. - For random variable E e1, e2, ., en,
probability weighted average - called Entropy H Measure of disorder,
randomness, information (I C- H), uncertainty,
complexity of choice
31Information Gain
- If there N of A class examples and P of B class
examples, -
- Information Gain of A, G(A)
- difference between entropy of original set, O,
and sum of entropy of the sets after
sub-partitioning, S1, S2,, Sn) using an
attribute A - G(A) H(O) - S H(Si)
- Best Attribute A
- A argmax G(Ai)
32Gain Ratio
- Gain favor attributes of a large number of
values. - attribute D that has distinct value for each
record, Info(D,T) is 0, thus Gain(D,T) is
maximal. - Use ratio instead of Gain
- Gain(D,T)
- GainRatio(D,T) -------------
- SplitInfo(D,T)
- SprintInfo I(T1/T, T2/T, .., Tm/T)
- where T1, T2, .. Tm is the partition of T
induced by the value of D.
33Noise and Over-fitting
- More than one classes in the leaf node
- interpret it as probability distribution
- To prevent Over-fitting
- Too much dependent on training data which is not
a good representative - Decision tree pruning
- if information gain is small, prune it.
- Irrelevant attribute - Chi-square pruning
- Cross-validation
- how well current hypothesis predict unseen data
- training set - test set partition
34Continuous Valued Attribute
- Discretize
- Find f0 which maximize gain, then recursively
- linear discriminant
f0
35Assessing the Performance
- Collect a large set of examples
- Divide two disjoint set
- training / test set
- Generate decision tree using training set
- Measure decision tree using test set
- Repeat steps 1 to 4, for randomly selected
training set with different size
36Performance Evaluation
- How do you know h f ?
- Computational learning theory
- Bound of h based on the number of training
samples - Learning curve shows the prediction accuracy as a
function of the number of observed examples - Prediction quality increases, as training set
grows