Testing the Significance of Attribute Interactions - PowerPoint PPT Presentation

About This Presentation
Title:

Testing the Significance of Attribute Interactions

Description:

... of an instance's label) first improves then deteriorates as we add attributes. ... It is some of the best attributes that deteriorate the performance! Why? ... – PowerPoint PPT presentation

Number of Views:57
Avg rating:3.0/5.0
Slides: 20
Provided by: AleksJ5
Learn more at: http://stat.columbia.edu
Category:

less

Transcript and Presenter's Notes

Title: Testing the Significance of Attribute Interactions


1
Testing the Significance of Attribute Interactions
  • Aleks Jakulin Ivan Bratko
  • Faculty of Computer and Information Science
  • University of Ljubljana
  • Slovenia

2
Overview
  • Interactions
  • The key to understanding many peculiarities in
    machine learning.
  • Feature importance measures the 2-way interaction
    between an attribute and the label, but there are
    interactions of higher orders.
  • An information-theoretic view of interactions
  • Information theory provides a simple algebra of
    interactions, based on summing and subtracting
    entropy terms (e.g., mutual information).
  • Part-to-whole approximations
  • An interaction is an irreducible dependence.
    Information-theoretic expressions are model
    comparisons!
  • Significance testing
  • As with all model comparisons, we can investigate
    the significance of the model difference.

3
Example 1 Feature Subset Selection with NBC
  • The calibration of the classifier (expected
    likelihood of an instances label) first improves
    then deteriorates as we add attributes. The
    optimal number is 8 attributes. The first few
    attributes are important, the rest is noise?

4
Example 1 Feature Subset Selection with NBC
  • NO! We sorted the attributes from the worst to
    the best. It is some of the best attributes that
    deteriorate the performance! Why?

5
Attribute Correlation Pure Evil?
  • No!
  • Attributes independent but noisy readings of the
    thermometer.
  • Label the weather.
  • These attributes are correlated
    (label-conditionally correlated too!), but we
    should not select among them, we should use all
    of them.
  • We cannot, however, assume them to be independent
    In the Naïve Bayesian model.

6
Example 2 Spiral/XOR/Parity Problems
  • Either attribute (x, y) is irrelevant when alone.
    Together, they make a perfect blue/red classifier.

7
Example 2 Spiral/XOR/Parity Problems
  • Warning they may be correlated at the same time.

8
What is going on? Interactions
9
Quantification Shannons Entropy
A
C
10
Interaction Information
How informative are A and B together?
I(ABC)
I(ABC)
- I(BC)
- I(AC)
I(BCA) - I(BC) I(ACB) - I(AC)
(Partial) history of independent reinventions
Quastler 53 (Info. Theory in Biology)
- measure of specificity McGill 54
(Psychometrika) - interaction
information Han 80 (Information Control)
- multiple mutual information Yeung 91
(IEEE Trans. On Inf. Theory) - mutual
information GrabischRoubens 99 (I. J. of Game
Theory) - Banzhaf interaction index Matsuda 00
(Physical Review E) - higher-order
mutual inf. Brenner et al. 00 (Neural
Computation) - average synergy Demar
02 (A thesis in machine learning) -
relative information gain Bell 03 (NIPS02,
ICA2003) - co-information Jakulin
02 - interaction gain
11
Applications Interaction Graphs
CMC domain the label is the contraceptive
method used by a couple.
12
Interaction as Attribute Proximity
weakly interacting
strongly interacting
cluster tightness
loose
tight
13
Part-to-Whole Approximation
  • Mutual information
  • Whole P(A,B) Parts P(A), P(B)
  • Approximation
  • Kullback-Leibler divergence as the measure of
    difference
  • Also applies for predictive accuracy

14
Kirkwood Superposition Approximation
  • It is a closed form part-to-whole approximation,
    a special case of Kikuchi and mean-field
    approximations. is not normalized, explaining
    the negative interaction information. It is not
    optimal (loglinear models beat it).

15
Significance Testing
  • Tries to answer the question
  • When is P much better than P?
  • It is based on the realization that even the
    correct probabilistic model P can expect to make
    an error for a sample of finite size.
  • The notion of self-loss captures the distribution
    of loss of the complex model (variance).
  • The notion of approximation loss captures the
    loss caused by using a simpler model (bias).
  • P is significantly better than P when the error
    made by P is greater than the self-loss in 99.5
    of cases. The P-value can be at most 0.05.

16
Test-Bootstrap Protocol
  • To obtain the self-loss distribution, we perturb
    the test data, which is a bootstrap sample from
    the whole data set. As the loss function, we
    employ KL-divergence

VERY similar to assuming that D(PP) has a ?2
distribution.
17
(No Transcript)
18
Self-Loss
19
Cross-Validation Protocol
  • P-values ignore the variation in approximation
    loss and the generalization power of a
    classifier.
  • CV-values are based on the following perturbation
    procedure

20
The Myth of Average Performance
  • The distribution of

How much do the mode/median/mean of the above
distribution tell you about which model to select?
? interaction (complex) wins
approximation(simple) wins ?
21
Summary
  • The existence of an interaction implies the need
    for a more complex model that joins the
    attributes.
  • Feature relevance is an interaction of order 2.
  • If there is no interaction, a complex model is
    unnecessary.
  • Information theory provides an approximate
    algebra for investigating interactions.
  • The difference between two models is a
    distribution, not a scalar.
  • Occams P-Razor Pick the simplest model among
    those that are not significantly worse than the
    best one.
Write a Comment
User Comments (0)
About PowerShow.com