Testing the Significance of Attribute Interactions

About This Presentation

Title:

Testing the Significance of Attribute Interactions

Description:

... of an instance's label) first improves then deteriorates as we add attributes. ... It is some of the best attributes that deteriorate the performance! Why? ... – PowerPoint PPT presentation

Number of Views:57

Avg rating:3.0/5.0

Slides: 20

Provided by: AleksJ5

Learn more at: http://stat.columbia.edu

Category:

more less

Transcript and Presenter's Notes

Title: Testing the Significance of Attribute Interactions

1
Testing the Significance of Attribute Interactions

Aleks Jakulin Ivan Bratko
Faculty of Computer and Information Science
University of Ljubljana
Slovenia

2
Overview

Interactions
The key to understanding many peculiarities in
machine learning.
Feature importance measures the 2-way interaction
between an attribute and the label, but there are
interactions of higher orders.
An information-theoretic view of interactions
Information theory provides a simple algebra of
interactions, based on summing and subtracting
entropy terms (e.g., mutual information).
Part-to-whole approximations
An interaction is an irreducible dependence.
Information-theoretic expressions are model
comparisons!
Significance testing
As with all model comparisons, we can investigate
the significance of the model difference.

3
Example 1 Feature Subset Selection with NBC

The calibration of the classifier (expected
likelihood of an instances label) first improves
then deteriorates as we add attributes. The
optimal number is 8 attributes. The first few
attributes are important, the rest is noise?

4
Example 1 Feature Subset Selection with NBC

NO! We sorted the attributes from the worst to
the best. It is some of the best attributes that
deteriorate the performance! Why?

5
Attribute Correlation Pure Evil?

No!
Attributes independent but noisy readings of the
thermometer.
Label the weather.
These attributes are correlated
(label-conditionally correlated too!), but we
should not select among them, we should use all
of them.
We cannot, however, assume them to be independent
In the Naïve Bayesian model.

6
Example 2 Spiral/XOR/Parity Problems

Either attribute (x, y) is irrelevant when alone.
Together, they make a perfect blue/red classifier.

7
Example 2 Spiral/XOR/Parity Problems

Warning they may be correlated at the same time.

8
What is going on? Interactions
9
Quantification Shannons Entropy
A
C
10
Interaction Information
How informative are A and B together?
I(ABC)
I(ABC)
- I(BC)
- I(AC)
I(BCA) - I(BC) I(ACB) - I(AC)
(Partial) history of independent reinventions
Quastler 53 (Info. Theory in Biology)
- measure of specificity McGill 54
(Psychometrika) - interaction
information Han 80 (Information Control)
- multiple mutual information Yeung 91
(IEEE Trans. On Inf. Theory) - mutual
information GrabischRoubens 99 (I. J. of Game
Theory) - Banzhaf interaction index Matsuda 00
(Physical Review E) - higher-order
mutual inf. Brenner et al. 00 (Neural
Computation) - average synergy Demar
02 (A thesis in machine learning) -
relative information gain Bell 03 (NIPS02,
ICA2003) - co-information Jakulin
02 - interaction gain
11
Applications Interaction Graphs
CMC domain the label is the contraceptive
method used by a couple.
12
Interaction as Attribute Proximity
weakly interacting
strongly interacting
cluster tightness
loose
tight
13
Part-to-Whole Approximation

Mutual information
Whole P(A,B) Parts P(A), P(B)
Approximation
Kullback-Leibler divergence as the measure of
difference
Also applies for predictive accuracy

14
Kirkwood Superposition Approximation

It is a closed form part-to-whole approximation,
a special case of Kikuchi and mean-field
approximations. is not normalized, explaining
the negative interaction information. It is not
optimal (loglinear models beat it).

15
Significance Testing

Tries to answer the question
When is P much better than P?
It is based on the realization that even the
correct probabilistic model P can expect to make
an error for a sample of finite size.
The notion of self-loss captures the distribution
of loss of the complex model (variance).
The notion of approximation loss captures the
loss caused by using a simpler model (bias).
P is significantly better than P when the error
made by P is greater than the self-loss in 99.5
of cases. The P-value can be at most 0.05.

16
Test-Bootstrap Protocol

To obtain the self-loss distribution, we perturb
the test data, which is a bootstrap sample from
the whole data set. As the loss function, we
employ KL-divergence

VERY similar to assuming that D(PP) has a ?2
distribution.
17
(No Transcript)
18
Self-Loss
19
Cross-Validation Protocol

P-values ignore the variation in approximation
loss and the generalization power of a
classifier.
CV-values are based on the following perturbation
procedure

20
The Myth of Average Performance

The distribution of

How much do the mode/median/mean of the above
distribution tell you about which model to select?
? interaction (complex) wins
approximation(simple) wins ?
21
Summary

The existence of an interaction implies the need
for a more complex model that joins the
attributes.
Feature relevance is an interaction of order 2.
If there is no interaction, a complex model is
unnecessary.
Information theory provides an approximate
algebra for investigating interactions.
The difference between two models is a
distribution, not a scalar.
Occams P-Razor Pick the simplest model among
those that are not significantly worse than the
best one.