Title: Testing the Significance of Attribute Interactions
1Testing the Significance of Attribute Interactions
- Aleks Jakulin Ivan Bratko
- Faculty of Computer and Information Science
- University of Ljubljana
- Slovenia
2Overview
- Interactions
- The key to understanding many peculiarities in
machine learning. - Feature importance measures the 2-way interaction
between an attribute and the label, but there are
interactions of higher orders. - An information-theoretic view of interactions
- Information theory provides a simple algebra of
interactions, based on summing and subtracting
entropy terms (e.g., mutual information). - Part-to-whole approximations
- An interaction is an irreducible dependence.
Information-theoretic expressions are model
comparisons! - Significance testing
- As with all model comparisons, we can investigate
the significance of the model difference.
3Example 1 Feature Subset Selection with NBC
- The calibration of the classifier (expected
likelihood of an instances label) first improves
then deteriorates as we add attributes. The
optimal number is 8 attributes. The first few
attributes are important, the rest is noise?
4Example 1 Feature Subset Selection with NBC
- NO! We sorted the attributes from the worst to
the best. It is some of the best attributes that
deteriorate the performance! Why?
5Attribute Correlation Pure Evil?
- No!
- Attributes independent but noisy readings of the
thermometer. - Label the weather.
- These attributes are correlated
(label-conditionally correlated too!), but we
should not select among them, we should use all
of them. - We cannot, however, assume them to be independent
In the Naïve Bayesian model.
6Example 2 Spiral/XOR/Parity Problems
- Either attribute (x, y) is irrelevant when alone.
Together, they make a perfect blue/red classifier.
7Example 2 Spiral/XOR/Parity Problems
- Warning they may be correlated at the same time.
8What is going on? Interactions
9Quantification Shannons Entropy
A
C
10Interaction Information
How informative are A and B together?
I(ABC)
I(ABC)
- I(BC)
- I(AC)
I(BCA) - I(BC) I(ACB) - I(AC)
(Partial) history of independent reinventions
Quastler 53 (Info. Theory in Biology)
- measure of specificity McGill 54
(Psychometrika) - interaction
information Han 80 (Information Control)
- multiple mutual information Yeung 91
(IEEE Trans. On Inf. Theory) - mutual
information GrabischRoubens 99 (I. J. of Game
Theory) - Banzhaf interaction index Matsuda 00
(Physical Review E) - higher-order
mutual inf. Brenner et al. 00 (Neural
Computation) - average synergy Demar
02 (A thesis in machine learning) -
relative information gain Bell 03 (NIPS02,
ICA2003) - co-information Jakulin
02 - interaction gain
11Applications Interaction Graphs
CMC domain the label is the contraceptive
method used by a couple.
12Interaction as Attribute Proximity
weakly interacting
strongly interacting
cluster tightness
loose
tight
13Part-to-Whole Approximation
- Mutual information
- Whole P(A,B) Parts P(A), P(B)
- Approximation
- Kullback-Leibler divergence as the measure of
difference - Also applies for predictive accuracy
14Kirkwood Superposition Approximation
- It is a closed form part-to-whole approximation,
a special case of Kikuchi and mean-field
approximations. is not normalized, explaining
the negative interaction information. It is not
optimal (loglinear models beat it).
15Significance Testing
- Tries to answer the question
- When is P much better than P?
- It is based on the realization that even the
correct probabilistic model P can expect to make
an error for a sample of finite size. - The notion of self-loss captures the distribution
of loss of the complex model (variance). - The notion of approximation loss captures the
loss caused by using a simpler model (bias). - P is significantly better than P when the error
made by P is greater than the self-loss in 99.5
of cases. The P-value can be at most 0.05.
16Test-Bootstrap Protocol
- To obtain the self-loss distribution, we perturb
the test data, which is a bootstrap sample from
the whole data set. As the loss function, we
employ KL-divergence
VERY similar to assuming that D(PP) has a ?2
distribution.
17(No Transcript)
18Self-Loss
19Cross-Validation Protocol
- P-values ignore the variation in approximation
loss and the generalization power of a
classifier. - CV-values are based on the following perturbation
procedure
20The Myth of Average Performance
How much do the mode/median/mean of the above
distribution tell you about which model to select?
? interaction (complex) wins
approximation(simple) wins ?
21Summary
- The existence of an interaction implies the need
for a more complex model that joins the
attributes. - Feature relevance is an interaction of order 2.
- If there is no interaction, a complex model is
unnecessary. - Information theory provides an approximate
algebra for investigating interactions. - The difference between two models is a
distribution, not a scalar. - Occams P-Razor Pick the simplest model among
those that are not significantly worse than the
best one.