Learning from Observation - PowerPoint PPT Presentation

1 / 36
About This Presentation
Title:

Learning from Observation

Description:

Induction. unsupervised learning. No hint at all about the correct outputs. ... General Principle of induction learning: Ockham's razor ' ... – PowerPoint PPT presentation

Number of Views:64
Avg rating:3.0/5.0
Slides: 37
Provided by: Jinhyu
Category:

less

Transcript and Presenter's Notes

Title: Learning from Observation


1
  • Learning from Observation
  • CS570 Lecture Notes
  • by Jin Hyung Kim
  • Computer Science Department
  • KAIST

2
Contents
  • Introduction
  • Inductive learning
  • Learning decision trees

3
Learning
  • Change of contents and organization of systems
    knowledge enabling to improve to its performance
    on task - Simon
  • When it acquire new knowledge from environment
  • When it organize its current knowledge
  • Learning from Observation
  • from trivial memorization to the creation of
    scientific theories
  • Inductive Inference
  • New consistent interpretation of data
    (observations)
  • General conclusion from examples
  • Infer association between input and output
  • with some confidence

4
Depending on Available Feedback
  • supervised learning
  • Environment provides examples of correct
    input/output pair
  • Induction
  • unsupervised learning
  • No hint at all about the correct outputs.
  • Clustering or consistent interpretation.
  • reinforcement learning
  • Receives no examples, but rewards or punishments
    at the end
  • Transduction Semi-supervised learning
  • Training with labeled training examples and
    unlabeled examples

5
Issues on Learning Algorithm
  • Prior Knowledge
  • Prior knowledge can help in learning.
  • Assumptions on parametric forms and range of
    values
  • Incremental learning
  • Update old knowledge whenever new example arrives
  • Batch learning
  • Apply learning algorithm to the entire set of
    examples
  • Data Mining
  • Learning rules from large set of data
  • Availability of large database allows application
    of machine learning to real problems

6
Inductive Learning
  • given training examples
  • correct input-output pairs
  • recover unknown function from data generated from
    the function
  • generalization ability for unseen
  • classification function is discrete
  • concept learning output is binary

7
Classification of Inductive Learning
  • Supervised Learning
  • Unsupervised Learning
  • No correct output-output pairs
  • needs other source for determining correctness
  • reinforcement learning yes/no answer only
  • example chess playing
  • Clustering group into clusters of common
    characteristics
  • Map Learning explore unknown territory
  • Discovery Learning uncover new relationships

8
Problems of Induction
  • Example
  • Pair (x, f(x)), where x is input and f(x) is
    output.
  • Also called training examples
  • Induction
  • Task to find h that approximates f from given
    examples of f.
  • Hypothesis
  • h, approximation of f
  • Bias
  • Preference of any hypothesis over others
  • How good will the hypothesis generalize ?

9
Consistent Linear hypotheses
  • William of Ockham (also Occam ) 1285?-1349?
  • English scholastic philosopher
  • Prefer the simplest hypothesis consistent with
    data
  • Definition of simple is not easy
  • For nondeterministic function, Tradeoff between
    complexity of hypothesis and degree of fit

10
Theory of Inductive Inference
  • Concept C ? X
  • Examples are given as (x, y) where x?X and
  • y 1 if x ?C, y 0 if x ? C
  • Find F such that F(x) 1 if x ?C, and F(x) 0 if
    x ? C
  • Inductive bias
  • constraints on hypothesis space
  • Table of all observation is not a choice
  • Restricted Hypothesis space biases
  • Preference biases
  • Occams razor (Ockham) simple hypo is best

11
Probably Approximately Correct
Theory of Inductive Inference
  • Error(F ) ? Pr(x) where
  • D x (f(x) 0 ? xltC) ? (f(x) 1 ?
    x?C)
  • Approximately correct with ?
  • Probably Approximately correct(PAC)
  • Pr(Error(F) gt ?) lt d
  • PAC whenever
  • samples gt ln(d /H) / ln(1- ?)
  • for given H, samples grows slowly
  • However, H is large (all Boolean functions on n
    attributes 22n )

x?D
12
Leaning General Logical Descriptions
  • Find (general) logical descriptions consistent
    with sample data (examples)
  • logical connections among examples and
    goal(concept)
  • Iteratively refine Hypothesis space observing
    examples
  • false-negative example
  • H says negative, but example is positive
  • need generalization
  • false-positive example
  • H says positive, but example is negative
  • need specialization

13
Generalization / Specialization
  • Specialization and generalization relationship
  • C1 ? C2 , (blue ? book) ? book
  • Transitive relationships hold
  • Generalization example
  • Hypothesis ?(x) boy(x) ? KAIST(x) ? smart(x)
  • example ?boy(x1) ? KAIST(x1) ? smart(x1)
  • Generalization ?(x) KAIST(x) ? smart(x)
  • Specialization example
  • Hypothesis ?(x) KAIST(x) ? smart(x)
  • example boy(x2) ? KAIST(x2) ? ? smart(x2)
  • specialization ?(x) ?boy(x) ? KAIST(x) ?
    smart(x)

14
Why Pure Inductive Inference can be Learning?
  • Learning can be seen as learning the
    representation of a function.
  • Hypothesis is a approximated representation.
  • Pure inductive inference finds hypothesis.
  • Function representation
  • Logical sentences
  • Polynomials
  • Set of weights (Neural Networks)

15
Logical Sentences
  • Logic
  • Target language for learning algorithms
  • Expressiveness and well-understood semantics
  • A major tool for AI research
  • Two approaches
  • Decision tree
  • Version-space

16
Decision Tree
  • A tree whose all internal node have a test, and
    all leaf node have the decision.
  • Select decision based on attributes

salary
Credit Card Approval
? 20,000
20,000 ?
education
Yes
20,000 ? salary
graduate
others
Yes
No
20,000 gt salary and education graduate
17
Example WillWait(Will wait for a table at a
restaurant?)
Figure 18.4
18
Expressiveness of Decision Trees
  • Restriction
  • Single object (implicitly)
  • Cannot represent test related two or more objects
  • Is there a cheaper restaurant nearby ?
  • Fully expressive
  • Class of propositional language
  • Any Boolean function can be represented as
    Decision tree
  • Bad cases
  • Parity function or majority rules
  • Exponentially large decision tree needed.

19
Inducing Decision Trees from examples
  • Terminologies
  • Classification
  • The value of the goal predicate (ex. Yes/No)
  • Examples
  • Positive/Negative
  • Noise
  • Training Set
  • Example set to use for inducing decision tree
  • Test Set
  • Example set to use for checking quality of
    decision tree

20
Examples for Restaurant Domain
Figure 18.5
21
Inducing Decision Tree from example
  • Simple way
  • One path for each example
  • Just memorization of observation
  • Extracting of pattern
  • To describe a large number of cases in a concise
    way
  • General Principle of induction learning
    Ockhams razor
  • The most likely hypothesis is the simplest one
    that is consistent with all observations.
  • Finding smallest decision tree is an intractable
    problem
  • gt Use heuristics (greedy)
  • Idea most important attribute first
  • Examples of result partitions are in one class,
    if possible
  • discriminating power
  • Otherwise, make it close to one class as much as
    possible

22
Splitting the examples by testing on attributes
Patrons is a good attribute to test first
Type is a bad attribute to test first
23
Splitting the examples by testing on attributes
(cont)
Hungry is a fairly good second test, given that
Patron is the first test
24
Decision tree induced from the 12-example
training set
25
Decision Tree Learning
  • Remember features that distinguish positive from
    negative
  • Build decision tree for classification
  • Non-terminal node question (attribute)
  • answer (attribute value) leads to children
  • Terminal node class(concept)
  • Path from root to terminal conjunction of
    features for the terminal concept
  • How to implement Occams razor ?

26
Decision Tree Learning Algorithm(recursive)
  • Mixed examples
  • choose best attribute and split
  • All positive or All negative
  • leaf node
  • No examples left
  • not observed condition
  • No attributes left, but mixed
  • incorrect example data noise
  • attributes dont describe situation enough
  • domain is truly nondeterministic

27
Building Decision Tree
  • Finding Smallest Tree NP class
  • How many distinct decision trees with n Boolean
    attributes ?
  • number of Boolean functions
  • number of distinct truth table with 2n rows
    22n
  • E.g. with 6 Boolean attritbutes 18,
    446,744,073,709,551,616 trees
  • Heuristic Methods of acceptable performance
  • best attribute first (DTBA)

28
Decision Tree Building Algorithm(DTBA)
All Examples in a class ?
Choose an attribute A
quit
Apply DTBA recursively on each children node
Partition Examples by value of A
Create New nodes for each non-empty subset of
examples
Set the new nodes as the children of node
29
Choosing Attribute
  • Choose best attribute first
  • Definition of best
  • examples of result partitions are in one class,
    if possible
  • otherwise, make it close to one class as much as
    possible
  • Which is better ?
  • (AAABB) or (AAAAB) ?
  • (AABBCC) (ABBBCC)
  • small disorder

30
Information Theory
  • C.E. Shannon, 1948,1949 papers
  • Information, I(e) average number of binary
    questions required to identify an event, e.
  • For random variable E e1, e2, ., en,
    probability weighted average
  • called Entropy H Measure of disorder,
    randomness, information (I C- H), uncertainty,
    complexity of choice

31
Information Gain
  • If there N of A class examples and P of B class
    examples,
  • Information Gain of A, G(A)
  • difference between entropy of original set, O,
    and sum of entropy of the sets after
    sub-partitioning, S1, S2,, Sn) using an
    attribute A
  • G(A) H(O) - S H(Si)
  • Best Attribute A
  • A argmax G(Ai)

32
Gain Ratio
  • Gain favor attributes of a large number of
    values.
  • attribute D that has distinct value for each
    record, Info(D,T) is 0, thus Gain(D,T) is
    maximal.
  • Use ratio instead of Gain
  • Gain(D,T)
  • GainRatio(D,T) -------------
  • SplitInfo(D,T)
  • SprintInfo I(T1/T, T2/T, .., Tm/T)
  • where T1, T2, .. Tm is the partition of T
    induced by the value of D.

33
Noise and Over-fitting
  • More than one classes in the leaf node
  • interpret it as probability distribution
  • To prevent Over-fitting
  • Too much dependent on training data which is not
    a good representative
  • Decision tree pruning
  • if information gain is small, prune it.
  • Irrelevant attribute - Chi-square pruning
  • Cross-validation
  • how well current hypothesis predict unseen data
  • training set - test set partition

34
Continuous Valued Attribute
  • Discretize
  • Find f0 which maximize gain, then recursively
  • linear discriminant

f0
35
Assessing the Performance
  • Collect a large set of examples
  • Divide two disjoint set
  • training / test set
  • Generate decision tree using training set
  • Measure decision tree using test set
  • Repeat steps 1 to 4, for randomly selected
    training set with different size

36
Performance Evaluation
  • How do you know h f ?
  • Computational learning theory
  • Bound of h based on the number of training
    samples
  • Learning curve shows the prediction accuracy as a
    function of the number of observed examples
  • Prediction quality increases, as training set
    grows
Write a Comment
User Comments (0)
About PowerShow.com