Title: Model Averaging with Discrete Bayesian Network Classifiers
1Model Averagingwith Discrete Bayesian Network
Classifiers
- Denver Dash and Gregory F. Cooper
- In the Proceedings of the Ninth International
Workshop on Artificial Intelligence and
Statistics (AISTATS 2003)
2Contents
- Model-averaging over a class of discrete Bayesian
network classifiers - A partial ordering and bounded in-degree k.
- Theoretical results (for N nodes)
- The class has at least
distinct structures. - The summation can be performed in
time. - Approximate averaging in O(N) time.
- Experiments
- The technique can be beneficial even when the
generating distribution is not a member of the
class. - Characterize the performance over several
parameters.
3Bayesian network classifiers
- Naïve Bayes classifier
- General Bayesian network classifiers
C
F1
F2
FN
Optimal in zero-one loss
Poor generalization performance could be improved
by Bayesian model averaging. ? the space of
network structure is super-exponential.
C
F1
F2
FN
4In this paper
- Bayesian model-averaging over a restricted class
of Bayesian network classifiers - A partial order (p) and a bounded in-degree (k).
- Contributions
- The factorization of the conditionals to apply to
the task of classification. - Show that MA over this class can be approximated
by a single network S ? calculation in O(N)
time. - Empirical evaluation of the method compared with
- A single naïve Bayes classifer
- A single Bayesian network learned by a greedy
search - Exact MA on naïve Bayes classifiers.
5Notations
- The classification problem
- A set of features F F1, F2, , FN.
- X0 C, X1 F1, , XN FN. ? X (in Bayesian
networks) - A set of classes C C1, C2, , CNC.
- A database D D1, D2, , DR.
- A Bayesian network
- G(X) a DAG structure
- Xi a multinomial distribution
- Pi a parents of Xi
- A parameter
- Parameter set ?
- Other assumptions parameter independence,
Dirichlet priors,
6Fixed network structures
- With the fixed network parameters ?
- Bayesian averaging over the parameters with
conjugate priors
7Averaging with a fixed ordering (1)
- For a structural feature, e.g. XL ? XM
- The posterior probability P(XL ? XMD),
- The structure modularity
- The marginal likelihood (decomposable)
8Averaging with a fixed ordering (2)
- Then, the posterior probability of a structural
feature can be represented as,
9Averaging with a fixed ordering (3)
- Enumerating the possible parents of Xi given a
partial ordering - p ltX1, X3, X2, X4gt, k 2.
- P20 0, P21 X1, P22 X3, P23 X1, X3.
10Averaging with a fixed ordering (4)
11Averaging with a fixed ordering (5)
12Averaging with a fixed ordering (6)
- Dynamic programming solution
- Finally,
13Model averaging for predictions
- The probability of a new example can be
calculated as similarly as the probability of a
structural feature. Hence, - The parameter value ?ijk is used on behalf of the
Kronecker-delta function.
14Approximation on the model averaging
- The time bound is still severe even for moderate
cases (k 3 or 4). - One approximation
- Order the set of possible parents for Xi based on
the function f(Xi, Pi?D) and prune them.
15Experimental evaluation (1)
- Performance metric d (R1 R2 / T R2)
- Synthetic data sets
- Comparisons between exact averaging and
approximation
16Experimental evaluation (2)
- Approximate model averaging vs. greedy thick-thin
search
17Experimental evaluation (3)
- Synthetic data from the ALARM network
- AMA vs. GTT
18Experimental evaluation (4)
- Real classification data sets from the UCI
repository
19Discussion
- Approximate model averaging outperforms a single
BN classifier. - Simplicity of the implementation.
- Future work
- Find a better method for optimizing for the
ordering. - Applications to the real-world problems.
- Relax the assumption of the complete data.