Learning Parameters - PowerPoint PPT Presentation

About This Presentation
Title:

Learning Parameters

Description:

ECG. MaxHeartRt. Angina. OldPeak. Heart Disease. Examples. Predicting heart disease ... ECG. MaxHeartRate. Angina. OldPeak. STSlope. Vessels. Thal. Outcome ... – PowerPoint PPT presentation

Number of Views:34
Avg rating:3.0/5.0
Slides: 25
Provided by: NirFri
Category:

less

Transcript and Presenter's Notes

Title: Learning Parameters


1
PGM Tirgul 11Na?ve Bayesian Classifier Tree
Augmented Na?ve Bayes(adapted from tutorial by
Nir Friedman and Moises Goldszmidt
2
The Classification Problem
Age Sex ChestPain RestBP Cholesterol BloodSugar EC
G MaxHeartRt Angina OldPeak Heart Disease
  • From a data set describing objects by vectors of
    features and a class
  • Find a function F features ? class to classify a
    new object

Vector1 lt49, 0, 2, 134, 271, 0, 0, 162, 0, 0,
2, 0, 3gt Presence Vector2 lt42, 1, 3, 130, 180,
0, 0, 150, 0, 0, 1, 0, 3gt Presence Vector3
lt39, 0, 3, 94, 199, 0, 0, 179, 0, 0, 1, 0, 3
gt Presence Vector4 lt41, 1, 2, 135, 203, 0, 0,
132, 0, 0, 2, 0, 6 gt Absence Vector5 lt56, 1,
3, 130, 256, 1, 2, 142, 1, 0.6, 2, 1, 6 gt
Absence Vector6 lt70, 1, 2, 156, 245, 0, 2, 143,
0, 0, 1, 0, 3 gt Presence Vector7 lt56, 1, 4,
132, 184, 0, 2, 105, 1, 2.1, 2, 1, 6 gt Absence
3
Examples
  • Predicting heart disease
  • Features cholesterol, chest pain, angina, age,
    etc.
  • Class present, absent
  • Finding lemons in cars
  • Features make, brand, miles per gallon,
    acceleration,etc.
  • Class normal, lemon
  • Digit recognition
  • Features matrix of pixel descriptors
  • Class 1, 2, 3, 4, 5, 6, 7, 8, 9, 0
  • Speech recognition
  • Features Signal characteristics, language model
  • Class pause/hesitation, retraction

4
Approaches
  • Memory based
  • Define a distance between samples
  • Nearest neighbor, support vector machines
  • Decision surface
  • Find best partition of the space
  • CART, decision trees
  • Generative models
  • Induce a model and impose a decision rule
  • Bayesian networks

5
Generative Models
  • Bayesian classifiers
  • Induce a probability describing the data
  • P(A1,,An,C)
  • Impose a decision rule. Given a new object lt
    a1,,an gt
  • c argmaxC P(C c a1,,an)
  • We have shifted the problem to learning
    P(A1,,An,C)
  • We are learning how to do this efficiently learn
    a Bayesian network representation for
    P(A1,,An,C)

6
Optimality of the decision ruleMinimizing the
error rate...
  • Let ci be the true class, and let lj be the class
    returned by the classifier.
  • A decision by the classifier is correct if cilj,
    and in error if ci? lj.
  • The error incurred by choose label lj is
  • Thus, had we had access to P, we minimize error
    rate by choosing li when which is the decision
    rule for the Bayesian classifier

7
Advantages of the Generative Model Approach
  • Output Rank over the outcomes---likelihood of
    present vs. absent
  • Explanation What is the profile of a typical
    person with a heart disease
  • Missing values both in training and testing
  • Value of information If the person has high
    cholesterol and blood sugar, which other test
    should be conducted?
  • Validation confidence measures over the model
    and its parameters
  • Background knowledge priors and structure

8
Evaluating the performance of a classifier
n-fold cross validation
  • Partition the data set in n segments
  • Do n times
  • Train the classifier with the green segments
  • Test accuracy on the red segments
  • Compute statistics on the n runs
  • Variance
  • Mean accuracy
  • Accuracy on test data of size m
  • Acc

D1
D2
D3
Dn
Run 1
Run 2
Run 3
Run n
Original data set
9
Advantages of Using a Bayesian Network
  • Efficiency in learning and query answering
  • Combine knowledge engineering and statistical
    induction
  • Algorithms for decision making, value of
    information, diagnosis and repair

Heart disease Accuracy 85 Data source UCI
repository
10
Problems with BNs as classifiers
  • When evaluating a Bayesian network, we examine
    the likelyhood of the model B given the data D
    and try to maximize it
  • When Learning structure we also add penalty for
    structure complexity and seek a balance between
    the two terms (MDL or variant). The following
    properties follow
  • A Bayesian network minimized the error over all
    the variables in the domain and not necessarily
    the local error of the class given the attributes
    (OK with enough data).
  • Because of the penalty, a Bayesian network in
    effect looks at a small subset of the variables
    that effect a given node (its Markov blanket)

11
Problems with BNs as classifiers (cont.)
  • Lets look closely at the likelyhood term
  • The first term estimates just what we want the
    probability of the class given the attributes.
    The second term estimates the joint probability
    of the attributes.
  • When there are many attributes, the second term
    starts to dominate (value of log is increased for
    small values).
  • Why not use the just the first term? We can no
    longer factorize and calculations become much
    harder.

12
The Naïve Bayesian Classifier
Diabetes in Pima Indians (from UCI repository)
  • Fixed structure encoding the assumption that
    features are independent of each other given the
    class.
  • Learning amounts to estimating the parameters for
    each P(FiC) for each Fi.

13
The Naïve Bayesian Classifier (cont.)
  • What do we gain?
  • We ensure that in the learned network, the
    probability P(CA1An) will take every attribute
    into account.
  • We will show polynomial time algorithm for
    learning the network.
  • Estimates are robust consisting of low order
    statistics requiring few instances
  • Has proven to be a powerful classifier often
    exceeding unrestricted Bayesian networks.

14
The Naïve Bayesian Classifier (cont.)
  • Common practice is to estimate
  • These estimate are identical to MLE for
    multinomials

15
Improving Naïve Bayes
  • Naïve Bayes encodes assumptions of independence
    that may be unreasonable
  • Are pregnancy and age independent given
    diabetes?
  • Problem same evidence may be incorporated
    multiple times (a rare Glucose level and a rare
    Insulin level over penalize the class variable)
  • The success of naïve Bayes is attributed to
  • Robust estimation
  • Decision may be correct even if probabilities are
    inaccurate
  • Idea improve on naïve Bayes by weakening the
    independence assumptions
  • Bayesian networks provide the appropriate
    mathematical language for this task

16
Tree Augmented Naïve Bayes (TAN)
  • Approximate the dependence among features with a
    tree Bayes net
  • Tree induction algorithm
  • Optimality maximum likelihood tree
  • Efficiency polynomial algorithm
  • Robust parameter estimation

17
Optimal Tree construction algorithm
  • The procedure of Chow and Lui construct a tree
    structure BT that maximizes LL(BT D)
  • Compute the mutual information between every pair
    of attributes
  • Build a complete undirected graph in which the
    vertices are the attributes and each edge is
    annotated with the corresponding mutual
    information as weight.
  • Build a maximum weighted spanning tree of this
    graph.
  • Complexity O(n2N) O(n2) O(n2logn) O(n2N)
    where n is the number of attributes and N is the
    sample size

18
Tree construction algorithm (cont.)
  • It is easy to plant the optimal tree in the TAN
    by revising the algorithm to use a revised
    conditional measure which takes the conditional
    probability on the class into account
  • This measures the gain in the log-likelyhood of
    adding Ai as a parent of Aj when C is already a
    parent.

19
Problem with TAN
  • When evaluating parameters we estimate the
    conditional probability P(AiParents(Ai)). This
    is done by partitionaing the data according to
    possible values of Parents(Ai).
  • When a partition contains just a few instances we
    get an unreliable estimate
  • In Naive Bayes the partition was only on the
    values of the classifier (and we have to assume
    that is adequate)
  • In TAN we have twice the number of partitions and
    get unreliable estimates, especially for small
    data sets.
  • Solution
  • where s is the smoothing bias and typically small.

20
Performance TAN vs. Naïve Bayes
100
25 Data sets from UCI repository Medical Signal
processing Financial Games Accuracy based on
5-fold cross-validation No parameter tuning
95
90
85
Naïve Bayes
80
75
70
65
65
70
75
80
85
90
95
100
TAN
21
Performance TAN vs C4.5
25 Data sets from UCI repository Medical Signal
processing Financial Games Accuracy based on
5-fold cross-validation No parameter tuning
100
95
90
85
C4.5
80
75
70
65
65
70
75
80
85
90
95
100
TAN
22
Beyond TAN
  • Can we do better by learning a more flexible
    structure?
  • Experiment learn a Bayesian network without
    restrictions on the structure

23
Performance TAN vs. Bayesian Networks
25 Data sets from UCI repository Medical Signal
processing Financial Games Accuracy based on
5-fold cross-validation No parameter tuning
Bayesian Networks
TAN
24
Classification Summary
  • Bayesian networks provide a useful language to
    improve Bayesian classifiers
  • Lesson we need to be aware of the task at hand,
    the amount of training data vs dimensionality of
    the problem, etc
  • Additional benefits
  • Missing values
  • Compute the tradeoffs involved in finding out
    feature values
  • Compute misclassification costs
  • Recent progress
  • Combine generative probabilistic models, such as
    Bayesian networks, with decision surface
    approaches such as Support Vector Machines
Write a Comment
User Comments (0)
About PowerShow.com