Title: Machine Learning based on Attribute Interactions
1Machine Learning based onAttribute Interactions
- Aleks Jakulin
- Advised by Acad. Prof. Dr. Ivan Bratko
2003-2005
2Learning Modelling
- data shapes the model
- model is made of possible hypotheses
- model is generated by an algorithm
- utility is the goal of a model
Utility -Loss
HypothesisSpace
Data
MODEL
B A A bounds B The fixed data sample
restricts the model to be consistent with it.
Learning Algorithm
3Our Assumptions about Models
- Probabilistic Utility logarithmic loss
(alternatives classification accuracy, Brier
score, RMSE) - Probabilistic Hypotheses multinomial
distribution, mixture of Gaussians (alternatives
classification trees, linear models) - Algorithm maximum likelihood (greedy), Bayesian
integration (exhaustive) - Data instances attributes
4Expected Minimum Loss Entropy
The diagram is a visualization of a probabilistic
model P(A,C)
A
C
52-Way Interactions
- Probabilistic models take the form of P(A,B)
- We have two models
- Interaction allowed PY(a,b) F(a,b)
- Interaction disallowed PN(a,b) P(a)P(b)
F(a)G(b) - The error that PN makes when approximating PY
- D(PY PN) Ex PyL(x,PN) I(AB)
(mutual information) - Also applies for predictive models
- Also applies for Pearsons correlation
coefficient
P is a bivariate Gaussian,obtained via max.
likelihood
6Rajskis Distance
- The attributes that have more in common can be
visualized as closer in some imaginary Euclidean
space. - How to avoid the influence of many/few-valued
attributes? (Complex attributes seem to have more
in common.) - Rajskis distance
- This is a metric (e.g. the triangle inequality)
?
7Interactions between US Senators
Democrats
dark strong interaction, high mutual
information light weak interaction low mutual
information
Interaction matrix
Republicans
8A Taxonomy of Machine Learning Algorithms
Interaction dendrogram
CMC dataset
93-Way Interactions
10Interaction Information
How informative are A and B together?
I(ABC)
I(ABC)
- I(BC)
- I(AC)
I(BCA) - I(BC) I(ACB) - I(AC)
(Partial) history of independent reinventions
Quastler 53 (Info. Theory in Biology)
- measure of specificity McGill 54
(Psychometrika) - interaction
information Han 80 (Information Control)
- multiple mutual information Yeung 91
(IEEE Trans. On Inf. Theory) - mutual
information GrabischRoubens 99 (I. J. of Game
Theory) - Banzhaf interaction index Matsuda 00
(Physical Review E) - higher-order
mutual inf. Brenner et al. 00 (Neural
Computation) - average synergy Demar
02 (A thesis in machine learning) -
relative information gain Bell 03 (NIPS02,
ICA2003) - co-information Jakulin
02 - interaction gain
11InteractionDendrogram
Useful attributes
farming
soil
In classification taskswe are only interested
inthose interactions that involve the label
vegetation
Useless attributes
12Interaction Graph
- The Titanic data set
- Label survived?
- Attributes describe the passenger or crew member
- 2-way interactions
- Sex then Class Age not as important
- 3-way interactions
- negative Crew dummy is wholly contained within
Class Sex largely explains the death rate
among the crew. - positive
- Children from the first and second class were
prioritized. - Men from the second class mostly died (third
class men and the crew were better off) - Female crew members had good odds of survival.
blue redundancy, negative int. red synergy,
positive int.
13An Interaction Drilled
- Data for 600 people
- Whats the loss assuming no interaction between
eyes in hair? - Area corresponds to probability
- black square actual probability
- colored square predicted probability
- Colors encode the type of error. The more
saturated the color, the more significant the
error. Codes - blue overestimate
- red underestimate
- white correct estimate
KL-d 0.178
14Rules Constraints
No interaction
- Rule 1Blonde hair is connected with blue or
green eyes.
- Rule 2Black hair is connected with brown eyes.
KL-d 0.178
KL-d 0.045
KL-d 0.134
15Attribute Value Taxonomies
- Interactions can also be computed between pairs
of attribute (or label) values. This way, we can
structure attributes with many values (e.g.,
Cartesian products ?).
ADULT/CENSUS
16Attribute Selection with Interactions
- 2-way interactions I(AY) are the staple of
attribute selection - Examples information gain, Gini ratio, etc.
- Myopia! We ignore both positive and negative
interactions. - Compare this with controlled 2-way interactions
I(AY B,C,D,E,) - Examples Relief, regression coefficients
- We have to build a model on all attributes
anyway, making many assumptions What does it buy
us? - We add another attribute, and the usefulness of a
previous attribute is reduced?
17Attribute Subset Selection with NBC
- The calibration of the classifier (expected
likelihood of an instances label) first improves
then deteriorates as we add attributes. The
optimal number is 8 attributes. The first few
attributes are important, the rest is noise?
18Attribute Subset Selection with NBC
- NO! We sorted the attributes from the worst to
the best. It is some of the best attributes that
ruin the performance! Why? NBC gets confused by
redundancies.
19Accounting for Redundancies
- At each step, we pick the next best attribute,
accounting for the attributes already in the
model - Fleurets procedure
- Our procedure
20Examplethe naïve Bayesian Classifier
? Interaction-proof
myopic ?
21Predicting with Interactions
- Interactions are meaningful self-contained views
of the data. - Can we use these views for prediction?
- Its easy if the views do not overlap we just
multiply them together, and normalize
P(a,b)P(c)P(d,e,f) - If they do overlap
- In a general overlap situation, Kikuchi
approximation efficiently handles the
intersections between interactions, and
intersections-of-intersections. - Algorithm select interactions, use Kikuchi
approximation to fuse them into a joint
prediction, use this to classify.
22Interaction Models
- Transparent and intuitive
- Efficient
- Quick
- Can be improved by replacing Kikuchi with
Conditional MaxEnt, and Cartesian product with
something better.
23Summary of the Talk
- Interactions are a good metaphor for
understanding models and data. They can be a part
of the hypothesis space, but do not have to. - Probability is crucial for real-world problems.
- Watch your assumptions (utility, model,
algorithm, data) - Information theory provides solid notation.
- The Bayesian approach to modelling is very robust
(naïve Bayes and Bayes nets are not Bayesian
approaches)
24Summary of Contributions
- Practice
- A number of novel visualization methods.
- A heuristic for efficient non-myopic attribute
selection. - An interaction-centered machine learning method,
Kikuchi-Bayes - A family of Bayesian priors for consistent
modelling with interactions.
- Theory
- A meta-model of machine learning.
- A formal definition of a k-way interaction,
independent of the utility and hypothesis space. - A thorough historic overview of related work.
- A novel view on interaction significance tests.
25Information Graphs
- Zoo data set of many different animals. We
picked 3 attributes - Does an animal breathe?
- Does it give milk?
- Does it lay eggs?
I(ABC) I(AB) I(ABC) I(A,BC) I(AC)
I(BC) I(ABC)
26The Significance of an Interaction
- Is the loss of PN approximating PY very large in
comparison to the range of losses PN makes on
samples drawn from PN? - We can reject PN even if PN was true, the
amount of error it suffered on that particular
sample is very unlikely. Null PN - Is the loss of PN approximating PY very large in
comparison to the loss PY makes on a randomly
drawn sample of the same size from PY? - We can reject PN PY is safely a winner. Null
PY - Cross-validation is a way of drawing samples from
PY (all interactions assumed to exist) without
replacement.
27Self-Loss
P sample from truth P the truth
28Confidence Intervals
- Procedure
- Sample from the null hypothesis
- Estimate the distribution of the difference in
loss between the null and the alternative
hypothesis. - Use a confidence interval to summarize such a
distribution
D(P(A,B)D(A)P(B)) - D(P(A,B)P(A,B))
Mutual information (max. likelihood) I(AB)
0.081 99 confidence interval 0.053, 0.109
29The Myth of Average Performance
sample
Yes-interaction
No-interaction
Difficult choice
How much do the mode/median/mean of the above
distribution tell you about which model to select?
Easy choice
? interaction (complex) wins
approximation(simple) wins ?
30(No Transcript)
31(No Transcript)
32(No Transcript)