Machine Learning based on Attribute Interactions - PowerPoint PPT Presentation

About This Presentation

Title:

Machine Learning based on Attribute Interactions

Description:

model is made of possible hypotheses. model is generated by an algorithm ... Republicans. dark: strong interaction, high mutual information. light: weak interaction ... – PowerPoint PPT presentation

Number of Views:83

Avg rating:3.0/5.0

Slides: 25

Provided by: AleksJ5

Learn more at: http://www.stat.columbia.edu

Category:

more less

Transcript and Presenter's Notes

Title: Machine Learning based on Attribute Interactions

1
Machine Learning based onAttribute Interactions

Aleks Jakulin
Advised by Acad. Prof. Dr. Ivan Bratko

2003-2005
2
Learning Modelling

data shapes the model
model is made of possible hypotheses
model is generated by an algorithm
utility is the goal of a model

Utility -Loss
HypothesisSpace
Data
MODEL
B A A bounds B The fixed data sample
restricts the model to be consistent with it.
Learning Algorithm
3
Our Assumptions about Models

Probabilistic Utility logarithmic loss
(alternatives classification accuracy, Brier
score, RMSE)
Probabilistic Hypotheses multinomial
distribution, mixture of Gaussians (alternatives
classification trees, linear models)
Algorithm maximum likelihood (greedy), Bayesian
integration (exhaustive)
Data instances attributes

4
Expected Minimum Loss Entropy
The diagram is a visualization of a probabilistic
model P(A,C)
A
C
5
2-Way Interactions

Probabilistic models take the form of P(A,B)
We have two models
Interaction allowed PY(a,b) F(a,b)
Interaction disallowed PN(a,b) P(a)P(b)
F(a)G(b)
The error that PN makes when approximating PY
D(PY PN) Ex PyL(x,PN) I(AB)
(mutual information)
Also applies for predictive models
Also applies for Pearsons correlation
coefficient

P is a bivariate Gaussian,obtained via max.
likelihood
6
Rajskis Distance

The attributes that have more in common can be
visualized as closer in some imaginary Euclidean
space.
How to avoid the influence of many/few-valued
attributes? (Complex attributes seem to have more
in common.)
Rajskis distance
This is a metric (e.g. the triangle inequality)

?
7
Interactions between US Senators
Democrats
dark strong interaction, high mutual
information light weak interaction low mutual
information
Interaction matrix
Republicans
8
A Taxonomy of Machine Learning Algorithms
Interaction dendrogram
CMC dataset
9
3-Way Interactions
10
Interaction Information
How informative are A and B together?
I(ABC)
I(ABC)
- I(BC)
- I(AC)
I(BCA) - I(BC) I(ACB) - I(AC)
(Partial) history of independent reinventions
Quastler 53 (Info. Theory in Biology)
- measure of specificity McGill 54
(Psychometrika) - interaction
information Han 80 (Information Control)
- multiple mutual information Yeung 91
(IEEE Trans. On Inf. Theory) - mutual
information GrabischRoubens 99 (I. J. of Game
Theory) - Banzhaf interaction index Matsuda 00
(Physical Review E) - higher-order
mutual inf. Brenner et al. 00 (Neural
Computation) - average synergy Demar
02 (A thesis in machine learning) -
relative information gain Bell 03 (NIPS02,
ICA2003) - co-information Jakulin
02 - interaction gain
11
InteractionDendrogram
Useful attributes
farming
soil
In classification taskswe are only interested
inthose interactions that involve the label
vegetation
Useless attributes
12
Interaction Graph

The Titanic data set
Label survived?
Attributes describe the passenger or crew member
2-way interactions
Sex then Class Age not as important
3-way interactions
negative Crew dummy is wholly contained within
Class Sex largely explains the death rate
among the crew.
positive
Children from the first and second class were
prioritized.
Men from the second class mostly died (third
class men and the crew were better off)
Female crew members had good odds of survival.

blue redundancy, negative int. red synergy,
positive int.
13
An Interaction Drilled

Data for 600 people
Whats the loss assuming no interaction between
eyes in hair?
Area corresponds to probability
black square actual probability
colored square predicted probability
Colors encode the type of error. The more
saturated the color, the more significant the
error. Codes
blue overestimate
red underestimate
white correct estimate

KL-d 0.178
14
Rules Constraints
No interaction

Rule 1Blonde hair is connected with blue or
green eyes.

Rule 2Black hair is connected with brown eyes.

KL-d 0.178
KL-d 0.045
KL-d 0.134
15
Attribute Value Taxonomies

Interactions can also be computed between pairs
of attribute (or label) values. This way, we can
structure attributes with many values (e.g.,
Cartesian products ?).

ADULT/CENSUS
16
Attribute Selection with Interactions

2-way interactions I(AY) are the staple of
attribute selection
Examples information gain, Gini ratio, etc.
Myopia! We ignore both positive and negative
interactions.
Compare this with controlled 2-way interactions
I(AY B,C,D,E,)
Examples Relief, regression coefficients
We have to build a model on all attributes
anyway, making many assumptions What does it buy
us?
We add another attribute, and the usefulness of a
previous attribute is reduced?

17
Attribute Subset Selection with NBC

The calibration of the classifier (expected
likelihood of an instances label) first improves
then deteriorates as we add attributes. The
optimal number is 8 attributes. The first few
attributes are important, the rest is noise?

18
Attribute Subset Selection with NBC

NO! We sorted the attributes from the worst to
the best. It is some of the best attributes that
ruin the performance! Why? NBC gets confused by
redundancies.

19
Accounting for Redundancies

At each step, we pick the next best attribute,
accounting for the attributes already in the
model
Fleurets procedure
Our procedure

20
Examplethe naïve Bayesian Classifier
? Interaction-proof
myopic ?
21
Predicting with Interactions

Interactions are meaningful self-contained views
of the data.
Can we use these views for prediction?
Its easy if the views do not overlap we just
multiply them together, and normalize
P(a,b)P(c)P(d,e,f)
If they do overlap
In a general overlap situation, Kikuchi
approximation efficiently handles the
intersections between interactions, and
intersections-of-intersections.
Algorithm select interactions, use Kikuchi
approximation to fuse them into a joint
prediction, use this to classify.

22
Interaction Models

Transparent and intuitive
Efficient
Quick
Can be improved by replacing Kikuchi with
Conditional MaxEnt, and Cartesian product with
something better.

23
Summary of the Talk

Interactions are a good metaphor for
understanding models and data. They can be a part
of the hypothesis space, but do not have to.
Probability is crucial for real-world problems.
Watch your assumptions (utility, model,
algorithm, data)
Information theory provides solid notation.
The Bayesian approach to modelling is very robust
(naïve Bayes and Bayes nets are not Bayesian
approaches)

24
Summary of Contributions

Practice
A number of novel visualization methods.
A heuristic for efficient non-myopic attribute
selection.
An interaction-centered machine learning method,
Kikuchi-Bayes
A family of Bayesian priors for consistent
modelling with interactions.

Theory
A meta-model of machine learning.
A formal definition of a k-way interaction,
independent of the utility and hypothesis space.
A thorough historic overview of related work.
A novel view on interaction significance tests.

25
Information Graphs

Zoo data set of many different animals. We
picked 3 attributes
Does an animal breathe?
Does it give milk?
Does it lay eggs?

I(ABC) I(AB) I(ABC) I(A,BC) I(AC)
I(BC) I(ABC)
26
The Significance of an Interaction

Is the loss of PN approximating PY very large in
comparison to the range of losses PN makes on
samples drawn from PN?
We can reject PN even if PN was true, the
amount of error it suffered on that particular
sample is very unlikely. Null PN
Is the loss of PN approximating PY very large in
comparison to the loss PY makes on a randomly
drawn sample of the same size from PY?
We can reject PN PY is safely a winner. Null
PY
Cross-validation is a way of drawing samples from
PY (all interactions assumed to exist) without
replacement.

27
Self-Loss
P sample from truth P the truth
28
Confidence Intervals

Procedure
Sample from the null hypothesis
Estimate the distribution of the difference in
loss between the null and the alternative
hypothesis.
Use a confidence interval to summarize such a
distribution

D(P(A,B)D(A)P(B)) - D(P(A,B)P(A,B))
Mutual information (max. likelihood) I(AB)
0.081 99 confidence interval 0.053, 0.109
29
The Myth of Average Performance
sample
Yes-interaction
No-interaction

The distribution of

Difficult choice
How much do the mode/median/mean of the above
distribution tell you about which model to select?
Easy choice
? interaction (complex) wins
approximation(simple) wins ?
30
(No Transcript)
31
(No Transcript)
32
(No Transcript)

Write a Comment

User Comments (0)