Machine Learning - PowerPoint PPT Presentation

1 / 63

About This Presentation

Title:

Machine Learning

Description:

The next US president will be Barack Obama. You will get an A in the course ... Xi with k Boolean parents has 2k rows for the combinations of parent values ... – PowerPoint PPT presentation

Number of Views:63

Avg rating:3.0/5.0

Slides: 64

Provided by: bryan81

Category:

Tags: barack | learning | machine | of | parents

more less

Transcript and Presenter's Notes

Title: Machine Learning

1
Machine Learning

Probability and Bayesian Networks

2
An Introduction

Bayesian Decision Theory came long before Version
Spaces, Decision Tree Learning and Neural
Networks. It was studied in the field of
Statistical Theory and more specifically, in the
field of Pattern Recognition.

3
An Introduction

Bayesian Decision Theory is at the basis of
important learning schemes such as
Naïve Bayes Classifier
Bayesian Belief Networks
EM Algorithm
Bayesian Decision Theory is also useful as it
provides a framework within which many
non-Bayesian classifiers can be studied
See Mitchell, Sections 6.3, 4,5,6.

4
Discrete Random Variables

A is a Boolean random variable if it denotes an
event where there is uncertainty about whether it
occurs
Examples
The next US president will be Barack Obama
You will get an A in the course
P(A) probability of A the fraction of all
possible worlds where A is true

5
Vizualizing P(A)
All Possible Worlds
Worlds where A is True
6
Axioms of Probability

Let there be a space S composed of a countable
number of events
The probability of each event is between 0 and 1
The probability of the whole sample space is 1
When two events are mutually exclusive, their
probabilities are additive

7
Vizualizing Two Boolean RVs

A
B
8
Conditional Probability
The conditional probability of A given B is
represented by the following formula

A
B
Only if A and B are independent
9
Independence

variables A and B are said to be independent if
knowing the value of A gives you no knowledge
about the likelihood of Band vice-versa
P(AB) P(A) and P(BA) P(B)

10
An Example Cards

Take a standard deck of 52 cards.
On the first draw I pull the Ace of Spades.
I dont replace the card.
What is the probability Ill pull the Ace of
Spades on the second draw?
Now, I replace the Ace after the 1st draw,
shuffle, and draw again.
What is the chance Ill draw the Ace of Spades on
the 2nd draw?

11
Discrete Random Variables

A is a discrete random variable if it takes a
countable number of distinct values
Examples
Your grade G in the course
The number of heads k in n coin flips
P(Ak) the fraction of all possible worlds
where A equals k
Notation PD(A k) prob. relative to a
distribution D
Pfair grading(G A), Pcheating(G A)

12
Bayes Theorem

Definition of Conditional Probability
Corollary
The Chain Rule
Bayes Rule
(Thomas Bayes, 1763)

13
ML in a Bayesian Framework

Any ML technique can be expressed as reasoning
about probabilities
Goal Find hypothesis h that is most probable
given training data D
Provides a more explicit way of describing
encoding our assumptions

14
Some Definitions

Prior probability of h, P(h)
The background knowledge we have about the chance
that h is a correct hypothesis (before having
observed the data).
Prior probability of D, P(D)
the probability that training data D will be
observed given no knowledge about which
hypothesis h holds.
Conditional Probability of D, P(Dh)
the probability of observing data D given that
hypothesis h holds.
Posterior probability of h, P(hD)
the probability that h is true, given the
observed training data D.
the quantity that Machine Learning researchers
are interested in.

15
Maximum A Posteriori (MAP)

Goal To find the most probable hypothesis h
from a set of candidate hypotheses H given the
observed data D.
MAP Hypothesis, hMAP

16
Maximum Likelihood (ML)

ML hypothesis is a special case of the MAP
hypothesis where all hypotheses are equally
likely to begin with

17
Example Brute Force MAP Learning

Assumptions
The training data D is noise-free
The target concept c is in the hypothesis set H
All hypotheses are equally likely
Choice Probability of D given h

18
Brute Force MAP (continued)
Bayes Theorem
Given our assumptions
VSH,D is the version space
19
Find-S as MAP Learning

We can characterize the FIND-S learner (chapter
2) in Bayesian terms
Again P(D h) is 1 if h is consistent on D, and
0 otherwise
P(h) increases with
specificity of h
Then MAP hypothesis output of Find-S

20
Neural Nets in a Bayesian Framework

Under certain assumptions regarding noise in the
data, minimizing the mean squared error (what
multilayer perceptrons do) corresponds to
computing the maximum likelihood hypothesis.

21
Least Squared Error ML
Assume e is drawn from a normal distribution
22
Least Squared Error ML
23
Least Squared Error ML
24
Decision Trees in Bayes Framework

Decent choice for P(h) simpler hypotheses have
higher probability
Occams razor
This can be encoded in terms of finding the
Minimum Description Length encoding
Provides a way to trade off hypothesis size for
training error
Potentially prevents overfitting

25
Most Compact Coding

Lets minimize the bits used to encode a message
Idea
Assign shorter codes to more probable messages
According to Shannon Weaver
An optimal code assigns log2P(i) bits to encode
item i
thus

26
Minimum Description Length (MDL)
27
Minimum Description Length (MDL)
28
Minimum Description Length (MDL)
29
What does all that mean?

The optimal hypothesis is the one that is the
smallest when we count
How long the hypothesis description must be
How long the data description must be, given the
hypothesis
Key idea since were given h, we need only
encode hs mistakes

30
What does all that mean?

If the hypothesis is perfect, we dont need to
encode any data.
For each misclassification, we must
say which item is misclassified
Takes log2m bits, where m size of the dataset
Say what the right classification is
Takes log2k bits, where k number of classes

31
The best MDL hypothesis

The best hypothesis is the best tradeoff between
Complexity of the hypothesis description
Number of times we have to tell people where it
screwed up.

32
Is MDL always MAP?

Only given significant assumptions
If we know a representation scheme such that size
of h in H is -log2P(h)
Likewise, the size of the exception
representation must be log2P(Dh)
THEN
MDL MAP

33
Making Predictions

The reason we learned h to begin with
Does it make sense to choose just one h?

h1 Looks matter
h2 Money matters
h3 Ideas matter
Obama Elected President
We want a prediction yes or no?
Doug Downey (adapted from Bryan Pardo,
Northwestern University)
34
Maximum A Posteriori (MAP)

Find most probable hypothesis
Use the predictions of that hypothesis

h1 Looks matter
h2 Money matters
h3 Ideas matter
. do we really want to ignore the other
hypotheses? Imagine 8 hypotheses. Seven of them
say yes and have a probability of 0.1 each.
One says no and has a probability of 0.3. Who
do you believe?
Doug Downey (adapted from Bryan Pardo,
Northwestern University)
35
Bayes Optimal Classifier

Bayes Optimal Classification The most probable
classification of a new instance is obtained by
combining the predictions of all hypotheses,
weighted by their posterior probabilities
where V is the set of all the values a
classification can take and v is one possible
such classification.
No other method using the same H and prior
knowledge is better (on average).

36
Naïve Bayes Classifier

Unfortunately, Bayes Optimal Classifier is
usually too costly to apply! gt Naïve Bayes
Classifier
Well be seeing more of these

37
The Joint Distribution

Make a truth table listing all combinations of
variable values
Assign a probability to each row
Make sure the probabilities sum to 1

Doug Downey (adapted from Bryan Pardo,
Northwestern University)
38
Using The Joint Distribution

Find P(A)
Sum the probabilities of all rows where A1
P(A1) 0.05 0.2 0.25 0.05
0.55
P(A)

Doug Downey (adapted from Bryan Pardo,
Northwestern University)
39
Using The Joint Distribution

Find P(AB)
P(A1 B1)P(A1, B1)/P(B1)(0.250.05)/
(0.250.050.10.05)

Doug Downey (adapted from Bryan Pardo,
Northwestern University)
40
Using The Joint Distribution

Are A and B Independent?

NO. They are NOT independent
Doug Downey (adapted from Bryan Pardo,
Northwestern University)
41
Why not use the Joint Distribution?

Given m boolean variables, we need to estimate
2m-1 values.
20 yes-no questions a million values
How do we get around this combinatorial
explosion?
Assume independence of variables!!

Doug Downey (adapted from Bryan Pardo,
Northwestern University)
42
back to Independence

The probability I have an apple in my lunch bag
is independent of the probability of a blizzard
in Japan.
This is DOMAIN Knowledge, typically supplied by
the problem designer

Doug Downey (adapted from Bryan Pardo,
Northwestern University)
43
Naïve Bayes Classifier

Cases described by a conjunction of attribute
values
These attributes are our independent hypotheses
The target function has a finite set of values, V
Could be solved using the joint distribution
table
What if we have 50,000 attributes?
Attribute j is a Boolean signaling presence or
absence of the jth word from the dictionary in my
latest email.

44
Naïve Bayes Classifier
45
Naïve Bayes Continued
Conditional independence step
Instead of one table of size 250000 we have
50,000 tables of size 2
46
Bayesian Belief Networks

Bayes Optimal Classifier
Often too costly to apply (uses full joint
probability)
Naïve Bayes Classifier
Assumes conditional independence to lower costs
This assumption often overly restrictive
Bayesian belief networks
provide an intermediate approach
allows conditional independence assumptions that
apply to subsets of the variable.

47
Example

I'm at work, neighbor John calls to say my alarm
is ringing, but neighbor Mary doesn't call.
Sometimes it's set off by minor earthquakes. Is
there a burglar?
Variables Burglary, Earthquake, Alarm,
JohnCalls, MaryCalls
Network topology reflects "causal" knowledge
A burglar can set the alarm off
An earthquake can set the alarm off
The alarm can cause Mary to call
The alarm can cause John to call

48
Example contd.
49
Bayesian Networks
Pearl 91

Qualitative part
Directed acyclic graph (DAG)
Nodes - random vars.
Edges - direct influence

Traditional Approaches
50
Compactness

A CPT for Boolean Xi with k Boolean parents has
2k rows for the combinations of parent values
Each row requires one number p for Xi true(the
number for Xi false is just 1-p)
If each variable has no more than k parents, the
complete network requires O(n 2k) numbers
I.e., grows linearly with n, vs. O(2n) for the
full joint distribution
For burglary net, 1 1 4 2 2 10 numbers
(vs. 25-1 31)

51
Semantics

The full joint distribution is defined as the
product of the local conditional distributions
P (X1, ,Xn) pi 1 P (Xi Parents(Xi))
Example
P(j ? m ? a ? ?b ? ?e)
P (j a) P (m a) P (a ?b, ?e) P (?b) P
(?e)

n
52
Learning BB Networks 3 cases

The network structure is given in advance and all
the variables are fully observable in the
training examples.
Trivial Case just estimate the conditional
probabilities.
2. The network structure is given in advance but
only some of the variables are observable in the
training data.
Similar to learning the weights for the hidden
units of a Neural Net Gradient Ascent Procedure
3. The network structure is not known in advance.
Use a heuristic search or constraint-based
technique to search through potential structures.

53
Constructing Bayesian networks

1. Choose an ordering of variables X1, ,Xn
2. For i 1 to n
add Xi to the network
select parents from X1, ,Xi-1 such that
P (Xi Parents(Xi)) P (Xi X1, ... Xi-1)
This choice of parents guarantees
P (X1, ,Xn) pi 1 P (Xi X1, , Xi-1)
(chain rule)
pi 1P (Xi Parents(Xi)) (by construction)

n
n
54
Example

Suppose we choose the ordering M, J, A, B, E
P(J M) P(J)?

55
Example

Suppose we choose the ordering M, J, A, B, E
P(J M) P(J)? No
P(A J, M) P(A J)?

56
Example

Suppose we choose the ordering M, J, A, B, E
P(J M) P(J)? No
P(A J, M) P(A J)? P(A J, M) P(A)? No
P(B A, J, M) P(B A)?
P(B A, J, M) P(B)?

57
Example

Suppose we choose the ordering M, J, A, B, E
P(J M) P(J)? No
P(A J, M) P(A J)? P(A J, M) P(A)? No
P(B A, J, M) P(B A)? Yes
P(B A, J, M) P(B)? No
P(E B, A ,J, M) P(E A)?
P(E B, A, J, M) P(E A, B)?

58
Example

Suppose we choose the ordering M, J, A, B, E
P(J M) P(J)?No
P(A J, M) P(A J)? P(A J, M) P(A)? No
P(B A, J, M) P(B A)? Yes
P(B A, J, M) P(B)? No
P(E B, A ,J, M) P(E A)? No
P(E B, A, J, M) P(E A, B)? Yes

59
Example contd.

Deciding conditional independence is hard in
noncausal directions
Causal models and conditional independence seem
hardwired for humans!
Network is less compact

60
Inference in BB Networks

A Bayesian Network can be used to compute the
probability distribution for any subset of
network variables given the values or
distributions for any subset of the remaining
variables.
Unfortunately, exact inference of probabilities
in general for an arbitrary Bayesian Network is
known to be NP-hard (P-complete)
In theory, approximate techniques (such as Monte
Carlo Methods) can also be NP-hard, though in
practice, many such methods are shown to be
useful.

61
Expectation Maximization Algorithm

Learning unobservable relevant variables
ExampleAssume that data points have been
uniformly generated from k distinct Gaussian
with the same known variance. The problem is to
output a hypothesis hlt?1, ?2 ,.., ?kgt
that describes the means of each of the k
distributions. In particular, we are looking for
a maximum likelihood hypothesis for these means.
We extend the problem description as follows for
each point xi, there are k hidden variables
zi1,..,zik such that zil1 if xi was generated by
normal distribution l and ziq 0 for all q?l.

62
The EM Algorithm (Contd)

An arbitrary initial hypothesis hlt?1, ?2 ,..,
?kgt is chosen.
The EM Algorithm iterates over two steps
Step 1 (Estimation, E) Calculate the expected
value Ezij of each hidden variable zij,
assuming that the current hypothesis hlt?1, ?2
,.., ?kgt holds.
Step 2 (Maximization, M)
Calculate a new maximum likelihood hypothesis
hlt?1, ?2 ,.., ?kgt, assuming the value taken
on by each hidden variable zij is its expected
value Ezij calculated in step 1.
Then replace the hypothesis hlt?1, ?2 ,.., ?kgt
by the new hypothesis hlt?1, ?2 ,.., ?kgt and
iterate.
The EM Algorithm can be applied to more general
problems

63
Gibbs Classifier

Bayes optimal classification can be too hard to
compute
Instead, randomly pick a single hypothesis
(according to the probability distribution of the
hypotheses)
use this hypothesis to classify new cases

h2
h1
h3

Write a Comment

User Comments (0)