Model Accuracy - PowerPoint PPT Presentation

1 / 32
About This Presentation
Title:

Model Accuracy

Description:

(The probability of the data given the Bayesian Network) ... have nodes connected to a known root (hypothesis) node we can ascribe directions. ... – PowerPoint PPT presentation

Number of Views:33
Avg rating:3.0/5.0
Slides: 33
Provided by: dfg8
Category:

less

Transcript and Presenter's Notes

Title: Model Accuracy


1
Lecture 8
  • Model Accuracy

2
Model Accuracy
  • The accuracy of a model is written as
  • P(DB)
  • (The probability of the data given the Bayesian
    Network)
  • P(B) is the joint probability of the model
    variables and so

3
Example
  • Suppose we have a Bayesian network with just two
    nodes, X and Y.
  • The joint probability is P(XY) P(X)P(YX)
  • The model accuracy with a set of data xi,yi is

4
Data set
  • Suppose we have the following data set and model

5
Probability of the data set
6
Consider a different model with the same data
7
Log Likelihood
  • The value of P(DB) goes down dramatically with
    the number of data points. Consequently it is
    more common to use the log likelihood
  • log2(P(DB))
  • Taking log2 creates an information measure
    since log2N bits are required to represent
    integers up to N

8
Log Likelihood of our simple example
9
Model Size
  • It is not surprising that the first model
    represents the data better.
  • It uses six conditional probabilities rather than
    four
  • Thus log likelihood is not sufficient to compare
    competing models since the one with the most arcs
    will always be the most likely.

10
Number of Parameters
  • To estimate the model size, one measure is the
    number of parameters.
  • Each prior probability is a matrix of m
    probability values. It can be represented by m-1
    parameters
  • (the mth probability value can be calculated from
    the others).
  • Each link matrix has nm probability values which
    can be represented by (n-1)m parameters.

11
Representing each parameter
  • As the data set increases the parameters can be
    represented more accurately.
  • We noted before that to represent integers up to
    N requires at most log2N bits.
  • Thus given N data points we expect that we need
    at most log2N bits to represent each parameter.
  • (On average we need (log2N)/2)

12
Information measure of model size
  • Putting this all together we can define a measure
    of model size
  • Size(B) B (log2N)/2
  • Where B is the number of parameters we need to
    represent the probabilities (prior and
    conditional)
  • N is the number of data cases used to calculate
    them

13
Small is Beautiful
  • The minimum description length principle states
    that the best model to represent a set of data is
    the smallest that will produce the required
    accuracy.
  • This suggests a metric as follows
  • MDLScore ModelSize - ModelAccuracy

14
The Minimum description Length measure
  • Combining the two we get
  • MDL(BD) B (log2N)/2 - log2(P(DB)
  • We can now assert that between competing models
    the best one has the lowest MDL score.

15
Calculating the model size for our example
  • Number of Data points 4
  • (log2N)/2 1
  • Number of parameters 3
  • For example, we can construct the prior
    probability of X from P(x0) and the link matrix
    from P(y0x0) and P(y0x1).
  • Thus model size 3

16
Calculating the model size for our example 2
  • Number of Data points 4
  • (log2N)/2 1
  • Number of parameters 2
  • For example, we can construct the prior
    probability of X from P(x0) and the prior
    probability of Y from P(y0)
  • Thus model size B (log2N)/2 2

17
Comparing the MDL scores
18
Problem break
  • Given a completely independent data set of two
    variables, what does the MDL tell us about the
    preferred network?

19
Solution
  • We have that
  • P(x0) P(x1) P(y0) P(y1) 0.5
  • P(y0x0) P(y0x1) P(y1x0) P(y1x1) 0.5
  • So the model accuracy is the same for both
    networks.
  • However, the model size will be smaller for the
    independent network, so it will be preferred.

20
Second Example
  • Suppose now we have a different data set

21
Second Example 2
  • Choosing the independent model

22
Second Example 3
  • The model size this time is bigger, reflecting
    the fact that we have twice as many points
  • Connected B (logN)/2 3.3/2 4.5
  • Independent B (logN)/2 2.3/2 3

23
Second Example 4
  • The data this time had a very strong correlation
    between X and Y and the connected model fits the
    data much better.
  • MDL(Connected) 4.5 11.25 15.75
  • MDL (Independent) 3 15.6 18.6
  • The MDL metric prefers it despite its greater
    size.

24
Search for the best network
  • Having defined a metric on the network and the
    data, we can choose between competing models.
  • This leads to algorithms based on searching the
    space of possible models.

25
Exhaustive Search
  • Let us suppose we have four variables A,B,C and
    D.
  • We start with the independent network (no arcs)
    and make a list of possible arcs.
  • There is just 1 network
  • There are 4C2 6 possible arcs
  • A-B, A-C, A-D, B-C, B-D, C-D

26
Exhaustive search
  • Now we make all possible networks with one arc.
  • There are 6 possible networks with one arc and
    for each five missing arcs.
  • From each of these we can make five networks with
    two arcs, each having four possible new arcs
  • We continue until three arcs have been added so
    that the network is connected

27
Size of the search tree
  • From the above we can see that the size of the
    search tree is 654 1, which we write as 6!/3!
    1
  • in general if we write the number of possible
    arcs as Cn (n!/(2!(n-2)!) the search tree will
    have
  • Cn!/(Cn-(n1))! 1 nodes
  • for four nodes this is 121 and exhaustive search
    is possible.

28
Limiting the search
  • The search tree grows very fast, for 5 variables
    it has 5041 nodes and for 6 variables 360361
    nodes
  • We could eliminate from the tree any network that
    is not singly connected.
  • Even so, the tree for more than five nodes will
    be too large to compute the parameters for a
    large data set.

29
Depth first search
  • Search methods therefore usually proceed depth
    first.
  • At each stage we choose the network with the
    lowest MDL score and add the arc with the largest
    mutual information.
  • This strategy will help to direct the search
    towards networks that fit the data well.

30
Terminating the search
  • We can terminate the search at any time, and
    choose the network with the best MDL score.
  • The longer we search the more likely it is that
    we will find the optimal network.
  • We also expect that the MDL score will tend to
    increase as we generate more structures, but this
    is just a heuristic rule

31
Causal Directions and MDL score
  • When we compute the MDL we need to have causal
    directions in our network.
  • If we have nodes connected to a known root
    (hypothesis) node we can ascribe directions.
  • Otherwise we can simply apply them in a
    consistent manner.
  • The MDL score should remain the same

32
Bayesian learning methods
  • The term Bayesian learning methods usually refers
    to the case where there are several competing
    networks for which we have some prior probability
    value.
Write a Comment
User Comments (0)
About PowerShow.com