Model Accuracy - PowerPoint PPT Presentation

1 / 32

About This Presentation

Title:

Model Accuracy

Description:

(The probability of the data given the Bayesian Network) ... have nodes connected to a known root (hypothesis) node we can ascribe directions. ... – PowerPoint PPT presentation

Number of Views:33

Avg rating:3.0/5.0

Slides: 33

Provided by: dfg8

Category:

more less

Transcript and Presenter's Notes

Title: Model Accuracy

1
Lecture 8

Model Accuracy

2
Model Accuracy

The accuracy of a model is written as
P(DB)
(The probability of the data given the Bayesian
Network)
P(B) is the joint probability of the model
variables and so

3
Example

Suppose we have a Bayesian network with just two
nodes, X and Y.
The joint probability is P(XY) P(X)P(YX)
The model accuracy with a set of data xi,yi is

4
Data set

Suppose we have the following data set and model

5
Probability of the data set
6
Consider a different model with the same data
7
Log Likelihood

The value of P(DB) goes down dramatically with
the number of data points. Consequently it is
more common to use the log likelihood
log2(P(DB))
Taking log2 creates an information measure
since log2N bits are required to represent
integers up to N

8
Log Likelihood of our simple example
9
Model Size

It is not surprising that the first model
represents the data better.
It uses six conditional probabilities rather than
four
Thus log likelihood is not sufficient to compare
competing models since the one with the most arcs
will always be the most likely.

10
Number of Parameters

To estimate the model size, one measure is the
number of parameters.
Each prior probability is a matrix of m
probability values. It can be represented by m-1
parameters
(the mth probability value can be calculated from
the others).
Each link matrix has nm probability values which
can be represented by (n-1)m parameters.

11
Representing each parameter

As the data set increases the parameters can be
represented more accurately.
We noted before that to represent integers up to
N requires at most log2N bits.
Thus given N data points we expect that we need
at most log2N bits to represent each parameter.
(On average we need (log2N)/2)

12
Information measure of model size

Putting this all together we can define a measure
of model size
Size(B) B (log2N)/2
Where B is the number of parameters we need to
represent the probabilities (prior and
conditional)
N is the number of data cases used to calculate
them

13
Small is Beautiful

The minimum description length principle states
that the best model to represent a set of data is
the smallest that will produce the required
accuracy.
This suggests a metric as follows
MDLScore ModelSize - ModelAccuracy

14
The Minimum description Length measure

Combining the two we get
MDL(BD) B (log2N)/2 - log2(P(DB)
We can now assert that between competing models
the best one has the lowest MDL score.

15
Calculating the model size for our example

Number of Data points 4
(log2N)/2 1
Number of parameters 3
For example, we can construct the prior
probability of X from P(x0) and the link matrix
from P(y0x0) and P(y0x1).
Thus model size 3

16
Calculating the model size for our example 2

Number of Data points 4
(log2N)/2 1
Number of parameters 2
For example, we can construct the prior
probability of X from P(x0) and the prior
probability of Y from P(y0)
Thus model size B (log2N)/2 2

17
Comparing the MDL scores
18
Problem break

Given a completely independent data set of two
variables, what does the MDL tell us about the
preferred network?

19
Solution

We have that
P(x0) P(x1) P(y0) P(y1) 0.5
P(y0x0) P(y0x1) P(y1x0) P(y1x1) 0.5
So the model accuracy is the same for both
networks.
However, the model size will be smaller for the
independent network, so it will be preferred.

20
Second Example

Suppose now we have a different data set

21
Second Example 2

Choosing the independent model

22
Second Example 3

The model size this time is bigger, reflecting
the fact that we have twice as many points
Connected B (logN)/2 3.3/2 4.5
Independent B (logN)/2 2.3/2 3

23
Second Example 4

The data this time had a very strong correlation
between X and Y and the connected model fits the
data much better.
MDL(Connected) 4.5 11.25 15.75
MDL (Independent) 3 15.6 18.6
The MDL metric prefers it despite its greater
size.

24
Search for the best network

Having defined a metric on the network and the
data, we can choose between competing models.
This leads to algorithms based on searching the
space of possible models.

25
Exhaustive Search

Let us suppose we have four variables A,B,C and
D.
We start with the independent network (no arcs)
and make a list of possible arcs.
There is just 1 network
There are 4C2 6 possible arcs
A-B, A-C, A-D, B-C, B-D, C-D

26
Exhaustive search

Now we make all possible networks with one arc.
There are 6 possible networks with one arc and
for each five missing arcs.
From each of these we can make five networks with
two arcs, each having four possible new arcs
We continue until three arcs have been added so
that the network is connected

27
Size of the search tree

From the above we can see that the size of the
search tree is 654 1, which we write as 6!/3!
1
in general if we write the number of possible
arcs as Cn (n!/(2!(n-2)!) the search tree will
have
Cn!/(Cn-(n1))! 1 nodes
for four nodes this is 121 and exhaustive search
is possible.

28
Limiting the search

The search tree grows very fast, for 5 variables
it has 5041 nodes and for 6 variables 360361
nodes
We could eliminate from the tree any network that
is not singly connected.
Even so, the tree for more than five nodes will
be too large to compute the parameters for a
large data set.

29
Depth first search

Search methods therefore usually proceed depth
first.
At each stage we choose the network with the
lowest MDL score and add the arc with the largest
mutual information.
This strategy will help to direct the search
towards networks that fit the data well.

30
Terminating the search

We can terminate the search at any time, and
choose the network with the best MDL score.
The longer we search the more likely it is that
we will find the optimal network.
We also expect that the MDL score will tend to
increase as we generate more structures, but this
is just a heuristic rule

31
Causal Directions and MDL score

When we compute the MDL we need to have causal
directions in our network.
If we have nodes connected to a known root
(hypothesis) node we can ascribe directions.
Otherwise we can simply apply them in a
consistent manner.
The MDL score should remain the same

32
Bayesian learning methods

The term Bayesian learning methods usually refers
to the case where there are several competing
networks for which we have some prior probability
value.

Write a Comment

User Comments (0)