Learning from Observation - PowerPoint PPT Presentation

1 / 36

About This Presentation

Title:

Learning from Observation

Description:

Induction. unsupervised learning. No hint at all about the correct outputs. ... General Principle of induction learning: Ockham's razor ' ... – PowerPoint PPT presentation

Number of Views:64

Avg rating:3.0/5.0

Slides: 37

Provided by: Jinhyu

Category:

more less

Transcript and Presenter's Notes

Title: Learning from Observation

1

Learning from Observation
CS570 Lecture Notes
by Jin Hyung Kim
Computer Science Department
KAIST

2
Contents

Introduction
Inductive learning
Learning decision trees

3
Learning

Change of contents and organization of systems
knowledge enabling to improve to its performance
on task - Simon
When it acquire new knowledge from environment
When it organize its current knowledge
Learning from Observation
from trivial memorization to the creation of
scientific theories
Inductive Inference
New consistent interpretation of data
(observations)
General conclusion from examples
Infer association between input and output
with some confidence

4
Depending on Available Feedback

supervised learning
Environment provides examples of correct
input/output pair
Induction
unsupervised learning
No hint at all about the correct outputs.
Clustering or consistent interpretation.
reinforcement learning
Receives no examples, but rewards or punishments
at the end
Transduction Semi-supervised learning
Training with labeled training examples and
unlabeled examples

5
Issues on Learning Algorithm

Prior Knowledge
Prior knowledge can help in learning.
Assumptions on parametric forms and range of
values
Incremental learning
Update old knowledge whenever new example arrives
Batch learning
Apply learning algorithm to the entire set of
examples
Data Mining
Learning rules from large set of data
Availability of large database allows application
of machine learning to real problems

6
Inductive Learning

given training examples
correct input-output pairs
recover unknown function from data generated from
the function
generalization ability for unseen
classification function is discrete
concept learning output is binary

7
Classification of Inductive Learning

Supervised Learning
Unsupervised Learning
No correct output-output pairs
needs other source for determining correctness
reinforcement learning yes/no answer only
example chess playing
Clustering group into clusters of common
characteristics
Map Learning explore unknown territory
Discovery Learning uncover new relationships

8
Problems of Induction

Example
Pair (x, f(x)), where x is input and f(x) is
output.
Also called training examples
Induction
Task to find h that approximates f from given
examples of f.
Hypothesis
h, approximation of f
Bias
Preference of any hypothesis over others
How good will the hypothesis generalize ?

9
Consistent Linear hypotheses

William of Ockham (also Occam ) 1285?-1349?
English scholastic philosopher
Prefer the simplest hypothesis consistent with
data
Definition of simple is not easy
For nondeterministic function, Tradeoff between
complexity of hypothesis and degree of fit

10
Theory of Inductive Inference

Concept C ? X
Examples are given as (x, y) where x?X and
y 1 if x ?C, y 0 if x ? C
Find F such that F(x) 1 if x ?C, and F(x) 0 if
x ? C
Inductive bias
constraints on hypothesis space
Table of all observation is not a choice
Restricted Hypothesis space biases
Preference biases
Occams razor (Ockham) simple hypo is best

11
Probably Approximately Correct
Theory of Inductive Inference

Error(F ) ? Pr(x) where
D x (f(x) 0 ? xltC) ? (f(x) 1 ?
x?C)
Approximately correct with ?
Probably Approximately correct(PAC)
Pr(Error(F) gt ?) lt d
PAC whenever
samples gt ln(d /H) / ln(1- ?)
for given H, samples grows slowly
However, H is large (all Boolean functions on n
attributes 22n )

x?D
12
Leaning General Logical Descriptions

Find (general) logical descriptions consistent
with sample data (examples)
logical connections among examples and
goal(concept)
Iteratively refine Hypothesis space observing
examples
false-negative example
H says negative, but example is positive
need generalization
false-positive example
H says positive, but example is negative
need specialization

13
Generalization / Specialization

Specialization and generalization relationship
C1 ? C2 , (blue ? book) ? book
Transitive relationships hold
Generalization example
Hypothesis ?(x) boy(x) ? KAIST(x) ? smart(x)
example ?boy(x1) ? KAIST(x1) ? smart(x1)
Generalization ?(x) KAIST(x) ? smart(x)
Specialization example
Hypothesis ?(x) KAIST(x) ? smart(x)
example boy(x2) ? KAIST(x2) ? ? smart(x2)
specialization ?(x) ?boy(x) ? KAIST(x) ?
smart(x)

14
Why Pure Inductive Inference can be Learning?

Learning can be seen as learning the
representation of a function.
Hypothesis is a approximated representation.
Pure inductive inference finds hypothesis.
Function representation
Logical sentences
Polynomials
Set of weights (Neural Networks)

15
Logical Sentences

Logic
Target language for learning algorithms
Expressiveness and well-understood semantics
A major tool for AI research
Two approaches
Decision tree
Version-space

16
Decision Tree

A tree whose all internal node have a test, and
all leaf node have the decision.
Select decision based on attributes

salary
Credit Card Approval
? 20,000
20,000 ?
education
Yes
20,000 ? salary
graduate
others
Yes
No
20,000 gt salary and education graduate
17
Example WillWait(Will wait for a table at a
restaurant?)
Figure 18.4
18
Expressiveness of Decision Trees

Restriction
Single object (implicitly)
Cannot represent test related two or more objects
Is there a cheaper restaurant nearby ?
Fully expressive
Class of propositional language
Any Boolean function can be represented as
Decision tree
Bad cases
Parity function or majority rules
Exponentially large decision tree needed.

19
Inducing Decision Trees from examples

Terminologies
Classification
The value of the goal predicate (ex. Yes/No)
Examples
Positive/Negative
Noise
Training Set
Example set to use for inducing decision tree
Test Set
Example set to use for checking quality of
decision tree

20
Examples for Restaurant Domain
Figure 18.5
21
Inducing Decision Tree from example

Simple way
One path for each example
Just memorization of observation
Extracting of pattern
To describe a large number of cases in a concise
way
General Principle of induction learning
Ockhams razor
The most likely hypothesis is the simplest one
that is consistent with all observations.
Finding smallest decision tree is an intractable
problem
gt Use heuristics (greedy)
Idea most important attribute first
Examples of result partitions are in one class,
if possible
discriminating power
Otherwise, make it close to one class as much as
possible

22
Splitting the examples by testing on attributes
Patrons is a good attribute to test first
Type is a bad attribute to test first
23
Splitting the examples by testing on attributes
(cont)
Hungry is a fairly good second test, given that
Patron is the first test
24
Decision tree induced from the 12-example
training set
25
Decision Tree Learning

Remember features that distinguish positive from
negative
Build decision tree for classification
Non-terminal node question (attribute)
answer (attribute value) leads to children
Terminal node class(concept)
Path from root to terminal conjunction of
features for the terminal concept
How to implement Occams razor ?

26
Decision Tree Learning Algorithm(recursive)

Mixed examples
choose best attribute and split
All positive or All negative
leaf node
No examples left
not observed condition
No attributes left, but mixed
incorrect example data noise
attributes dont describe situation enough
domain is truly nondeterministic

27
Building Decision Tree

Finding Smallest Tree NP class
How many distinct decision trees with n Boolean
attributes ?
number of Boolean functions
number of distinct truth table with 2n rows
22n
E.g. with 6 Boolean attritbutes 18,
446,744,073,709,551,616 trees
Heuristic Methods of acceptable performance
best attribute first (DTBA)

28
Decision Tree Building Algorithm(DTBA)
All Examples in a class ?
Choose an attribute A
quit
Apply DTBA recursively on each children node
Partition Examples by value of A
Create New nodes for each non-empty subset of
examples
Set the new nodes as the children of node
29
Choosing Attribute

Choose best attribute first
Definition of best
examples of result partitions are in one class,
if possible
otherwise, make it close to one class as much as
possible
Which is better ?
(AAABB) or (AAAAB) ?
(AABBCC) (ABBBCC)
small disorder

30
Information Theory

C.E. Shannon, 1948,1949 papers
Information, I(e) average number of binary
questions required to identify an event, e.
For random variable E e1, e2, ., en,
probability weighted average
called Entropy H Measure of disorder,
randomness, information (I C- H), uncertainty,
complexity of choice

31
Information Gain

If there N of A class examples and P of B class
examples,
Information Gain of A, G(A)
difference between entropy of original set, O,
and sum of entropy of the sets after
sub-partitioning, S1, S2,, Sn) using an
attribute A
G(A) H(O) - S H(Si)
Best Attribute A
A argmax G(Ai)

32
Gain Ratio

Gain favor attributes of a large number of
values.
attribute D that has distinct value for each
record, Info(D,T) is 0, thus Gain(D,T) is
maximal.
Use ratio instead of Gain
Gain(D,T)
GainRatio(D,T) -------------
SplitInfo(D,T)
SprintInfo I(T1/T, T2/T, .., Tm/T)
where T1, T2, .. Tm is the partition of T
induced by the value of D.

33
Noise and Over-fitting

More than one classes in the leaf node
interpret it as probability distribution
To prevent Over-fitting
Too much dependent on training data which is not
a good representative
Decision tree pruning
if information gain is small, prune it.
Irrelevant attribute - Chi-square pruning
Cross-validation
how well current hypothesis predict unseen data
training set - test set partition

34
Continuous Valued Attribute

Discretize
Find f0 which maximize gain, then recursively
linear discriminant

f0
35
Assessing the Performance

Collect a large set of examples
Divide two disjoint set
training / test set
Generate decision tree using training set
Measure decision tree using test set
Repeat steps 1 to 4, for randomly selected
training set with different size

36
Performance Evaluation

How do you know h f ?
Computational learning theory
Bound of h based on the number of training
samples
Learning curve shows the prediction accuracy as a
function of the number of observed examples
Prediction quality increases, as training set
grows

Write a Comment

User Comments (0)