Title: Learning Parameters
1PGM Tirgul 11Na?ve Bayesian Classifier Tree
Augmented Na?ve Bayes(adapted from tutorial by
Nir Friedman and Moises Goldszmidt
2The Classification Problem
Age Sex ChestPain RestBP Cholesterol BloodSugar EC
G MaxHeartRt Angina OldPeak Heart Disease
- From a data set describing objects by vectors of
features and a class -
- Find a function F features ? class to classify a
new object
Vector1 lt49, 0, 2, 134, 271, 0, 0, 162, 0, 0,
2, 0, 3gt Presence Vector2 lt42, 1, 3, 130, 180,
0, 0, 150, 0, 0, 1, 0, 3gt Presence Vector3
lt39, 0, 3, 94, 199, 0, 0, 179, 0, 0, 1, 0, 3
gt Presence Vector4 lt41, 1, 2, 135, 203, 0, 0,
132, 0, 0, 2, 0, 6 gt Absence Vector5 lt56, 1,
3, 130, 256, 1, 2, 142, 1, 0.6, 2, 1, 6 gt
Absence Vector6 lt70, 1, 2, 156, 245, 0, 2, 143,
0, 0, 1, 0, 3 gt Presence Vector7 lt56, 1, 4,
132, 184, 0, 2, 105, 1, 2.1, 2, 1, 6 gt Absence
3Examples
- Predicting heart disease
- Features cholesterol, chest pain, angina, age,
etc. - Class present, absent
- Finding lemons in cars
- Features make, brand, miles per gallon,
acceleration,etc. - Class normal, lemon
- Digit recognition
- Features matrix of pixel descriptors
- Class 1, 2, 3, 4, 5, 6, 7, 8, 9, 0
- Speech recognition
- Features Signal characteristics, language model
- Class pause/hesitation, retraction
4Approaches
- Memory based
- Define a distance between samples
- Nearest neighbor, support vector machines
- Decision surface
- Find best partition of the space
- CART, decision trees
- Generative models
- Induce a model and impose a decision rule
- Bayesian networks
5Generative Models
- Bayesian classifiers
- Induce a probability describing the data
- P(A1,,An,C)
- Impose a decision rule. Given a new object lt
a1,,an gt - c argmaxC P(C c a1,,an)
- We have shifted the problem to learning
P(A1,,An,C) - We are learning how to do this efficiently learn
a Bayesian network representation for
P(A1,,An,C)
6Optimality of the decision ruleMinimizing the
error rate...
- Let ci be the true class, and let lj be the class
returned by the classifier. - A decision by the classifier is correct if cilj,
and in error if ci? lj. - The error incurred by choose label lj is
- Thus, had we had access to P, we minimize error
rate by choosing li when which is the decision
rule for the Bayesian classifier
7Advantages of the Generative Model Approach
- Output Rank over the outcomes---likelihood of
present vs. absent - Explanation What is the profile of a typical
person with a heart disease - Missing values both in training and testing
- Value of information If the person has high
cholesterol and blood sugar, which other test
should be conducted? - Validation confidence measures over the model
and its parameters - Background knowledge priors and structure
8Evaluating the performance of a classifier
n-fold cross validation
- Partition the data set in n segments
- Do n times
- Train the classifier with the green segments
- Test accuracy on the red segments
- Compute statistics on the n runs
- Variance
- Mean accuracy
- Accuracy on test data of size m
- Acc
D1
D2
D3
Dn
Run 1
Run 2
Run 3
Run n
Original data set
9Advantages of Using a Bayesian Network
- Efficiency in learning and query answering
- Combine knowledge engineering and statistical
induction - Algorithms for decision making, value of
information, diagnosis and repair
Heart disease Accuracy 85 Data source UCI
repository
10Problems with BNs as classifiers
- When evaluating a Bayesian network, we examine
the likelyhood of the model B given the data D
and try to maximize it - When Learning structure we also add penalty for
structure complexity and seek a balance between
the two terms (MDL or variant). The following
properties follow - A Bayesian network minimized the error over all
the variables in the domain and not necessarily
the local error of the class given the attributes
(OK with enough data). - Because of the penalty, a Bayesian network in
effect looks at a small subset of the variables
that effect a given node (its Markov blanket)
11Problems with BNs as classifiers (cont.)
- Lets look closely at the likelyhood term
- The first term estimates just what we want the
probability of the class given the attributes.
The second term estimates the joint probability
of the attributes. - When there are many attributes, the second term
starts to dominate (value of log is increased for
small values). - Why not use the just the first term? We can no
longer factorize and calculations become much
harder.
12The Naïve Bayesian Classifier
Diabetes in Pima Indians (from UCI repository)
- Fixed structure encoding the assumption that
features are independent of each other given the
class. - Learning amounts to estimating the parameters for
each P(FiC) for each Fi.
13The Naïve Bayesian Classifier (cont.)
- What do we gain?
- We ensure that in the learned network, the
probability P(CA1An) will take every attribute
into account. - We will show polynomial time algorithm for
learning the network. - Estimates are robust consisting of low order
statistics requiring few instances - Has proven to be a powerful classifier often
exceeding unrestricted Bayesian networks.
14The Naïve Bayesian Classifier (cont.)
- Common practice is to estimate
- These estimate are identical to MLE for
multinomials
15Improving Naïve Bayes
- Naïve Bayes encodes assumptions of independence
that may be unreasonable - Are pregnancy and age independent given
diabetes? - Problem same evidence may be incorporated
multiple times (a rare Glucose level and a rare
Insulin level over penalize the class variable) - The success of naïve Bayes is attributed to
- Robust estimation
- Decision may be correct even if probabilities are
inaccurate - Idea improve on naïve Bayes by weakening the
independence assumptions - Bayesian networks provide the appropriate
mathematical language for this task
16Tree Augmented Naïve Bayes (TAN)
- Approximate the dependence among features with a
tree Bayes net - Tree induction algorithm
- Optimality maximum likelihood tree
- Efficiency polynomial algorithm
- Robust parameter estimation
17Optimal Tree construction algorithm
- The procedure of Chow and Lui construct a tree
structure BT that maximizes LL(BT D) - Compute the mutual information between every pair
of attributes - Build a complete undirected graph in which the
vertices are the attributes and each edge is
annotated with the corresponding mutual
information as weight. - Build a maximum weighted spanning tree of this
graph. - Complexity O(n2N) O(n2) O(n2logn) O(n2N)
where n is the number of attributes and N is the
sample size
18Tree construction algorithm (cont.)
- It is easy to plant the optimal tree in the TAN
by revising the algorithm to use a revised
conditional measure which takes the conditional
probability on the class into account - This measures the gain in the log-likelyhood of
adding Ai as a parent of Aj when C is already a
parent.
19Problem with TAN
- When evaluating parameters we estimate the
conditional probability P(AiParents(Ai)). This
is done by partitionaing the data according to
possible values of Parents(Ai). - When a partition contains just a few instances we
get an unreliable estimate - In Naive Bayes the partition was only on the
values of the classifier (and we have to assume
that is adequate) - In TAN we have twice the number of partitions and
get unreliable estimates, especially for small
data sets. - Solution
- where s is the smoothing bias and typically small.
20Performance TAN vs. Naïve Bayes
100
25 Data sets from UCI repository Medical Signal
processing Financial Games Accuracy based on
5-fold cross-validation No parameter tuning
95
90
85
Naïve Bayes
80
75
70
65
65
70
75
80
85
90
95
100
TAN
21Performance TAN vs C4.5
25 Data sets from UCI repository Medical Signal
processing Financial Games Accuracy based on
5-fold cross-validation No parameter tuning
100
95
90
85
C4.5
80
75
70
65
65
70
75
80
85
90
95
100
TAN
22Beyond TAN
- Can we do better by learning a more flexible
structure? - Experiment learn a Bayesian network without
restrictions on the structure
23Performance TAN vs. Bayesian Networks
25 Data sets from UCI repository Medical Signal
processing Financial Games Accuracy based on
5-fold cross-validation No parameter tuning
Bayesian Networks
TAN
24Classification Summary
- Bayesian networks provide a useful language to
improve Bayesian classifiers - Lesson we need to be aware of the task at hand,
the amount of training data vs dimensionality of
the problem, etc - Additional benefits
- Missing values
- Compute the tradeoffs involved in finding out
feature values - Compute misclassification costs
- Recent progress
- Combine generative probabilistic models, such as
Bayesian networks, with decision surface
approaches such as Support Vector Machines