Machine Learning based on Attribute Interactions - PowerPoint PPT Presentation

About This Presentation
Title:

Machine Learning based on Attribute Interactions

Description:

model is made of possible hypotheses. model is generated by an algorithm ... Republicans. dark: strong interaction, high mutual information. light: weak interaction ... – PowerPoint PPT presentation

Number of Views:83
Avg rating:3.0/5.0
Slides: 25
Provided by: AleksJ5
Category:

less

Transcript and Presenter's Notes

Title: Machine Learning based on Attribute Interactions


1
Machine Learning based onAttribute Interactions
  • Aleks Jakulin
  • Advised by Acad. Prof. Dr. Ivan Bratko

2003-2005
2
Learning Modelling
  • data shapes the model
  • model is made of possible hypotheses
  • model is generated by an algorithm
  • utility is the goal of a model

Utility -Loss
HypothesisSpace
Data
MODEL
B A A bounds B The fixed data sample
restricts the model to be consistent with it.
Learning Algorithm
3
Our Assumptions about Models
  • Probabilistic Utility logarithmic loss
    (alternatives classification accuracy, Brier
    score, RMSE)
  • Probabilistic Hypotheses multinomial
    distribution, mixture of Gaussians (alternatives
    classification trees, linear models)
  • Algorithm maximum likelihood (greedy), Bayesian
    integration (exhaustive)
  • Data instances attributes

4
Expected Minimum Loss Entropy
The diagram is a visualization of a probabilistic
model P(A,C)
A
C
5
2-Way Interactions
  • Probabilistic models take the form of P(A,B)
  • We have two models
  • Interaction allowed PY(a,b) F(a,b)
  • Interaction disallowed PN(a,b) P(a)P(b)
    F(a)G(b)
  • The error that PN makes when approximating PY
  • D(PY PN) Ex PyL(x,PN) I(AB)
    (mutual information)
  • Also applies for predictive models
  • Also applies for Pearsons correlation
    coefficient

P is a bivariate Gaussian,obtained via max.
likelihood
6
Rajskis Distance
  • The attributes that have more in common can be
    visualized as closer in some imaginary Euclidean
    space.
  • How to avoid the influence of many/few-valued
    attributes? (Complex attributes seem to have more
    in common.)
  • Rajskis distance
  • This is a metric (e.g. the triangle inequality)

?
7
Interactions between US Senators
Democrats
dark strong interaction, high mutual
information light weak interaction low mutual
information
Interaction matrix
Republicans
8
A Taxonomy of Machine Learning Algorithms
Interaction dendrogram
CMC dataset
9
3-Way Interactions
10
Interaction Information
How informative are A and B together?
I(ABC)
I(ABC)
- I(BC)
- I(AC)
I(BCA) - I(BC) I(ACB) - I(AC)
(Partial) history of independent reinventions
Quastler 53 (Info. Theory in Biology)
- measure of specificity McGill 54
(Psychometrika) - interaction
information Han 80 (Information Control)
- multiple mutual information Yeung 91
(IEEE Trans. On Inf. Theory) - mutual
information GrabischRoubens 99 (I. J. of Game
Theory) - Banzhaf interaction index Matsuda 00
(Physical Review E) - higher-order
mutual inf. Brenner et al. 00 (Neural
Computation) - average synergy Demar
02 (A thesis in machine learning) -
relative information gain Bell 03 (NIPS02,
ICA2003) - co-information Jakulin
02 - interaction gain
11
InteractionDendrogram
Useful attributes
farming
soil
In classification taskswe are only interested
inthose interactions that involve the label
vegetation
Useless attributes
12
Interaction Graph
  • The Titanic data set
  • Label survived?
  • Attributes describe the passenger or crew member
  • 2-way interactions
  • Sex then Class Age not as important
  • 3-way interactions
  • negative Crew dummy is wholly contained within
    Class Sex largely explains the death rate
    among the crew.
  • positive
  • Children from the first and second class were
    prioritized.
  • Men from the second class mostly died (third
    class men and the crew were better off)
  • Female crew members had good odds of survival.

blue redundancy, negative int. red synergy,
positive int.
13
An Interaction Drilled
  • Data for 600 people
  • Whats the loss assuming no interaction between
    eyes in hair?
  • Area corresponds to probability
  • black square actual probability
  • colored square predicted probability
  • Colors encode the type of error. The more
    saturated the color, the more significant the
    error. Codes
  • blue overestimate
  • red underestimate
  • white correct estimate

KL-d 0.178
14
Rules Constraints
No interaction
  • Rule 1Blonde hair is connected with blue or
    green eyes.
  • Rule 2Black hair is connected with brown eyes.

KL-d 0.178
KL-d 0.045
KL-d 0.134
15
Attribute Value Taxonomies
  • Interactions can also be computed between pairs
    of attribute (or label) values. This way, we can
    structure attributes with many values (e.g.,
    Cartesian products ?).

ADULT/CENSUS
16
Attribute Selection with Interactions
  • 2-way interactions I(AY) are the staple of
    attribute selection
  • Examples information gain, Gini ratio, etc.
  • Myopia! We ignore both positive and negative
    interactions.
  • Compare this with controlled 2-way interactions
    I(AY B,C,D,E,)
  • Examples Relief, regression coefficients
  • We have to build a model on all attributes
    anyway, making many assumptions What does it buy
    us?
  • We add another attribute, and the usefulness of a
    previous attribute is reduced?

17
Attribute Subset Selection with NBC
  • The calibration of the classifier (expected
    likelihood of an instances label) first improves
    then deteriorates as we add attributes. The
    optimal number is 8 attributes. The first few
    attributes are important, the rest is noise?

18
Attribute Subset Selection with NBC
  • NO! We sorted the attributes from the worst to
    the best. It is some of the best attributes that
    ruin the performance! Why? NBC gets confused by
    redundancies.

19
Accounting for Redundancies
  • At each step, we pick the next best attribute,
    accounting for the attributes already in the
    model
  • Fleurets procedure
  • Our procedure

20
Examplethe naïve Bayesian Classifier
? Interaction-proof
myopic ?
21
Predicting with Interactions
  • Interactions are meaningful self-contained views
    of the data.
  • Can we use these views for prediction?
  • Its easy if the views do not overlap we just
    multiply them together, and normalize
    P(a,b)P(c)P(d,e,f)
  • If they do overlap
  • In a general overlap situation, Kikuchi
    approximation efficiently handles the
    intersections between interactions, and
    intersections-of-intersections.
  • Algorithm select interactions, use Kikuchi
    approximation to fuse them into a joint
    prediction, use this to classify.

22
Interaction Models
  • Transparent and intuitive
  • Efficient
  • Quick
  • Can be improved by replacing Kikuchi with
    Conditional MaxEnt, and Cartesian product with
    something better.

23
Summary of the Talk
  • Interactions are a good metaphor for
    understanding models and data. They can be a part
    of the hypothesis space, but do not have to.
  • Probability is crucial for real-world problems.
  • Watch your assumptions (utility, model,
    algorithm, data)
  • Information theory provides solid notation.
  • The Bayesian approach to modelling is very robust
    (naïve Bayes and Bayes nets are not Bayesian
    approaches)

24
Summary of Contributions
  • Practice
  • A number of novel visualization methods.
  • A heuristic for efficient non-myopic attribute
    selection.
  • An interaction-centered machine learning method,
    Kikuchi-Bayes
  • A family of Bayesian priors for consistent
    modelling with interactions.
  • Theory
  • A meta-model of machine learning.
  • A formal definition of a k-way interaction,
    independent of the utility and hypothesis space.
  • A thorough historic overview of related work.
  • A novel view on interaction significance tests.

25
Information Graphs
  • Zoo data set of many different animals. We
    picked 3 attributes
  • Does an animal breathe?
  • Does it give milk?
  • Does it lay eggs?

I(ABC) I(AB) I(ABC) I(A,BC) I(AC)
I(BC) I(ABC)
26
The Significance of an Interaction
  • Is the loss of PN approximating PY very large in
    comparison to the range of losses PN makes on
    samples drawn from PN?
  • We can reject PN even if PN was true, the
    amount of error it suffered on that particular
    sample is very unlikely. Null PN
  • Is the loss of PN approximating PY very large in
    comparison to the loss PY makes on a randomly
    drawn sample of the same size from PY?
  • We can reject PN PY is safely a winner. Null
    PY
  • Cross-validation is a way of drawing samples from
    PY (all interactions assumed to exist) without
    replacement.

27
Self-Loss
P sample from truth P the truth
28
Confidence Intervals
  • Procedure
  • Sample from the null hypothesis
  • Estimate the distribution of the difference in
    loss between the null and the alternative
    hypothesis.
  • Use a confidence interval to summarize such a
    distribution

D(P(A,B)D(A)P(B)) - D(P(A,B)P(A,B))
Mutual information (max. likelihood) I(AB)
0.081 99 confidence interval 0.053, 0.109
29
The Myth of Average Performance
sample
Yes-interaction
No-interaction
  • The distribution of

Difficult choice
How much do the mode/median/mean of the above
distribution tell you about which model to select?
Easy choice
? interaction (complex) wins
approximation(simple) wins ?
30
(No Transcript)
31
(No Transcript)
32
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com