Computer Science Department - PowerPoint PPT Presentation

1 / 64

About This Presentation

Title:

Computer Science Department

Description:

Computer Science Department. CS 9633 Machine Learning. Important Features ... A Bayesian Network provides a way to describe the joint probability distribution ... – PowerPoint PPT presentation

Number of Views:23

Avg rating:3.0/5.0

Slides: 65

Provided by: brid157

Category:

more less

Transcript and Presenter's Notes

Title: Computer Science Department

1
Bayesian Learning
2
Bayesian Learning

Probabilistic approach to inference
Assumption
Quantities of interest are governed by
probability distribution
Optimal decisions can be made by reasoning about
probabilities and observations
Provides quantitative approach to weighing how
evidence supports alternative hypotheses

3
Why is Bayesian Learning Important?

Some Bayesian approaches (like naive Bayes) are
very practical learning approaches and
competitive with other approaches
Provides a useful perspective for understanding
many learning algorithms that do not explicitly
manipulate probabilities

4
Important Features

Model is incrementally updated with training
examples
Prior knowledge can be combined with observed
data to determine the final probability of the
hypothesis
Asserting prior probability of candidate
hypotheses
Asserting a probability distribution over
observations for each hypothesis
Can accommodate methods that make probabilistic
predictions
New instances can be classified by combining
predictions of multiple hypotheses
Can provide a gold standard for evaluating
hypotheses

5
Practical Problems

Typically require initial knowledge of many
probabilities. Can be estimated by
Background knowledge
Previously available data
Assumptions about distribution
Significant computational cost of determining
Bayes optimal hypothesis in general
linear in number of hypotheses in general case
Significantly lower for certain situations

6
Bayes Theorem

Goal learn the best hypothesis
Assumption in Bayes learning the best
hypothesis is the most probable hypothesis
Bayes theorem allows computation of most probable
hypothesis based on
Prior probability of hypothesis
Probability of observing certain data given the
hypothesis
Observed data itself

7
Notation

P(h) Prior probability of h
P(D) Prior probability of D
P(Dh) Probability of D given h
posterior probability of D given h
likelihood of Data given h
P(hD) Probability that h holds, given the data

8
Bayes Theorem

Based on definitions of P(Dh) and P(hD)

D
h
9
Maximum A Posteriori Hypothesis

Many learning algorithms try to identify the most
probable hypothesis h ? H given observations D
This is the maximum a posteriori hypothesis (MAP
hypothesis)

10
Identifying the MAP Hypothesis using Bayes
Theorem
11
Equally Probable Hypotheses
Any hypothesis that maximizes P(Dh) is a Maximum
Likelihood (ML) hypothesis
12
Bayes Theorem and Concept Learning

Concept Learning Task
H Hypothesis space
X Instance space
c X?0,1

13
Brute-Force MAP Learning Algorithm

For each hypothesis h in H, calculate the
posterior probability
Output the hypothesis with the highest posterior
probability

14
To Apply Brute Force MAP Learning

Specify P(h)
Specify P(Dh)

15
An Example

Assume
Training data D is noise free (di c(xi))
The target concept is contained in H
We have no a priori reason to believe one
hypothesis is more likely than any other

16
Probability of Data Given Hypothesis
17
Apply the algorithm

Step 1 (2 cases)
Case 1 (D is inconsistent with h)
Case 2 (D is consistent with h)

18
Step 2

Every consistent hypothesis has probability
1/VSH,D
Every inconsistent hypothesis has probability 0

19
MAP hypothesis and consistent learners

FIND-S (finds maximally specific consistent
hypothesis)
Candidate-Elimination (finds all consistent
hypotheses.

20
Maximum Likelihood and Least-Squared Error
Learning

New problem learning a continuous-valued target
function
Will show that under certain assumptions, any
learning algorithm that minimized the squared
error between output hypotheses on training data
will output a maximum likelihood hypothesis.

21
Problem Setting

Learner L
Instance space X
Hypothesis space H h X?R
Task of L is to learn unknown target function
f X?R
Have m examples
Target value for each example is corrupted by
random noise drawn from Normal distribution

22
Work Through Derivation
23
Why Normal Distribution for Noise?

Its easy to work with
Good approximation of many physical processes
Important point we are only dealing with noise
in the target functionnot the attribute values.

24
Bayes Optimal Classifier

Two Questions
What is the most probable hypothesis given the
training data?
Find MAP hypothesis
What is the most probable classification given
the training data?

25
Example

Three hypotheses
P(h1D) 0.35
P(h2D) 0.45
P(h3D) 0.20
New instance x
h1 predicts negative
h2 predicts positive
h3 predicts negative
What is the predicted class using hMAP?
What is the predicted class using all hypotheses?

26
Bayes Optimal Classification

The most probable classification of a new
instance is obtained by combining the predictions
of all hypotheses, weighted by their posterior
probabilities.
Suppose set of values for classification is from
set V (each possible value is vj)
Probability that vj is the correct classification
for new instance is
Pick the vj with the max probability as the
predicted class

27
Bayes Optimal Classifier
Apply this to the previous example
28
Bayes Optimal Classification

Gives the optimal error-minimizing solution to
prediction and classification problems.
Requires probability of exact combination of
evidence
All classification methods can be viewed as
approximations of Bayes rule with varying
assumptions about conditional probabilities
Assume they come from some distribution
Assume conditional independence
Assume underlying model of specific format
(linear combination of evidence, decision tree)

29
Simplifications of Bayes Rule

Given observations of attribute values a1, a2,
an,, compute the most probable target value vMAP
Use Bayes Theorem to rewrite

30
Naïve Bayes

The most usual simplification of Bayes Rule is to
assume conditional independence of the
observations
Because it is approximately true
Because it is computationally convenient
Assume the probability of observing the
conjunction a1, a2, an is the product of the
probabilities of the individual attributes
Learning consists of estimating probabilities

31
Simple Example

Two classes C1 and C2.
Two features
a1 Male, Female
a2 Blue eyes, Brown eyes
Instance (Male with blue eyes) What is the
class?

Probability C1 C2
P(Ci) 0.4 0.6
P(MaleCj) 0.1 0.2
P(BlueEyesCj) 0.3 0.2
32
Estimating Probabilities(Classifying Executables)

Two Classes (Malicious, Benign)
Features
a1 GUI present (yes/no)
a2 Deletes files (yes/no)
a3 Allocates memory (yes/no)
a4 Length (lt 1K, 1-10 K, gt 10K)

33
Instance a1 a2 a3 a4 Class
1 Yes No No Yes B
2 Yes No No No B
3 No Yes Yes No M
4 No No Yes Yes M
5 Yes No No Yes B
6 Yes No No No M
7 Yes Yes Yes No M
8 Yes Yes No Yes M
9 No No No Yes B
10 No No Yes No M
34
Classify the Following Instance

ltYes, No, Yes, Yesgt

35
Estimating Probabilities

To estimate P(CD)
Let n be the number of training examples labeled
D
Let nc be the number labeled D that are also
labeled C
P(CD) was estimated as nc/n
Problems
This is a biased underestimate of the probability
When the term is 0, it dominates all others

36
Use m-estimate of probability

p is prior of what we are trying to estimate
(often assume attribute values equally probable)
m is a constant (called equivalent sample size)
view this augmenting with a virtual sample

37
Repeat Estimates

Use equal priors for attribute values
Use m value of 1

38
Bayesian Belief Networks

Naïve Bayes is based on assumption of conditional
independence
Bayesian networks provide a tractable method for
specifying dependencies among variables

39
Terminology

A Bayesian Belief Network describes the
probability distribution over a set of random
variables Y1, Y2, Yn
Each variable Yi can take on the set of values
V(Yi)
The joint space of the set of variables Y is the
cross product
V(Y1) ? V(Y2) ? ? V(Yn)
Each item in the joint space corresponds to one
possible assignment of values to the tuple of
variables ltY1, Yngt
Joint probability distribution specifies the
probabilities of the items in the joint space
A Bayesian Network provides a way to describe the
joint probability distribution in a compact
manner.

40
Conditional Independence

Let X, Y, and Z be three discrete-valued random
variables.
We say that X is conditionally independent of Y
given Z if the probability distribution governing
X is independent of the value of Y given a value
for Z

41
Bayesian Belief Network

A set of random variables makes up the nodes of
the network
A set of directed links or arrows connects pairs
of nodes. The intuitive meaning of an arrow from
X to Y is that X has a direct influence on Y.
Each node has a conditional probability table
that quantifies the effects that the parents have
on the node. The parents of a node are all those
nodes that have arrows pointing to it.
The graph has no directed cycles (it is a DAG)

42
Example (from Judea Pearl)

You have a new burglar alarm installed at home.
It is fairly reliable at detecting a burglary,
but also responds on occasion to minor
earthquakes. You also have two neighbors, John
and Mary, who have promised to call you at work
when they hear the alarm. John always calls when
he hears the alarm, but sometimes confuses the
telephone ringing with the alarm and calls then,
too. Mary, on the other hand, likes rather loud
music and sometimes misses the alarm altogether.
Given the evidence of who has or has not called,
we would like to estimate the probability of a
burglary.

43
Step 1

Determine what the propositional (random)
variables should be
Determine causal (or another type of influence)
relationships and develop the topology of the
network

44
Topology of Belief Network
Burglary
Earthquake
Alarm
JohnCalls
MaryCalls
45
Step 2

Specify a conditional probability table or CPT
for each node.
Each row in the table contains the conditional
probability of each node value for a conditioning
case (possible combinations of values for parent
nodes).
In the example, the possible values for each node
are true/false.
The sum of the probabilities for each value of a
node given a particular conditioning case is 1.

46
ExampleCPT for Alarm Node
P(AlarmBurglary,Earthquake) True
False
Earthquake
Burglary
True True
0.950 0.050 True False
0.940 0.060 False
True 0.290
0.710 False False
0.001 0.999
47
Complete Belief Network
P(B) 0.001
P(E) 0.002
Burglary
Earthquake
B E P(AB,E) T T 0.95 T
F 0.94 F T 0.29 F
F 0.01
Alarm
A P(JA) T 0.90 F 0.05
A P(MA) T 0.70 F 0.01
JohnCalls
MaryCalls
48
Semantics of Belief Networks

View 1 A belief network is a representation of
the joint probability distribution (joint) of a
domain.
The joint completely specifies an agents
probability assignments to all propositions in
the domain (both simple and complex.)

49
Network as representation of joint

A generic entry in the joint probability
distribution is the probability of a conjunction
of particular assignments to each variable, such
as

Each entry in the joint is represented by the
product of appropriate elements of the CPTs in
the belief network.

50
Example Calculation

Calculate the probability of the event that the
alarm has sounded but neither a burglary nor an
earthquake has occurred, and both John and Mary
call.
P(J M A B E)
P(JA) P(MA) P(AB,E) P(B) P(E)
0.90 0.70 0.001 0.999 0.998
0.00062

51
Semantics

View 2 Encoding of a collection of conditional
independence statements.
JohnCalls is conditionally independent of other
variables in the network given the value of Alarm
This view is useful for understanding inference
procedures for the networks.

52
Inference Methods for Bayesian Networks

We may want to infer the value of some target
variable (Burglary) given observed values for
other variables.
What we generally want is the probability
distribution
Inference straightforward if all other values in
network known
More general case, if we know a subset of the
values of variables, we can infer a probability
distribution over other variables.
NP-Hard problem
But approximations work well

53
Learning Bayesian Belief Networks

Focus of a great deal of research
Several situations of varying complexity
Network structure may be given or not
All variables may be observable or you may have
some variables that cannot be observed
If the network structure is known and all
variables can be observed, the CPTs can be
computed like they were for Naïve Bayes

54
Gradient Ascent Training of Bayesian Networks

Method developed by Russell
Maximizes P(Dh) by following the gradient of
ln P(Dh)
Let wijk be a single entry in CPT table that
variable Yi will take on value yij given that its
immediate parent is Ui takes on values given by
uik

55
Illustration
Uiuik
wijk P(YiyijUiuik)
Yi yij
56
Result
57
Example
Burglary
Earthquake
To compute P(AB,E) we would need P(A,B,Ed) for
each training example
Alarm
JohnCalls
MaryCalls
58
EM Algorithm

The EM algorithm is a general purpose algorithm
that is used in many settings including
Unsupervised learning
Learning CPTs for Bayesian networks
Learning Hidden Markov models
Two-step algorithm for learning hidden variables

59
Two Step Process

For a specific problem with have three quantities
X observed data for instances
Z unobserved data for instances (this is usually
what we are trying to learn)
Y full data
General approach
Determine initial hypothesis for values for Z
Step 1 Estimation
Compute a function Q(hh) using current
hypothesis h and the observed data X to estimate
the probability distribution over Y.
Step 2 Maximization
Revise hypothesis h with h that maximizes the Q
function

60
K-means algorithm
Assume that data comes from 2 Gaussian
distributions. Means (?) are unknown
P(x)
x
61
Generation of data

Select one of the normal distributions at random
Generate a single random instance xi using this
distribution

62
Example Select initial values for h
h lt?1, ?2gt
?2
X
?1
Y
63
E-step Compute the probability that datum xi
generated by component i
h lt?1, ?2gt
?2
X
?1
Y
64
M-step Replace hypothesis h with h that
maximizes Q
h lt?1, ?2gt
?1
X
?2
Y

Write a Comment

User Comments (0)