Machine Learning - PowerPoint PPT Presentation

1 / 63
About This Presentation
Title:

Machine Learning

Description:

The next US president will be Barack Obama. You will get an A in the course ... Xi with k Boolean parents has 2k rows for the combinations of parent values ... – PowerPoint PPT presentation

Number of Views:63
Avg rating:3.0/5.0
Slides: 64
Provided by: bryan81
Category:

less

Transcript and Presenter's Notes

Title: Machine Learning


1
Machine Learning
  • Probability and Bayesian Networks

2
An Introduction
  • Bayesian Decision Theory came long before Version
    Spaces, Decision Tree Learning and Neural
    Networks. It was studied in the field of
    Statistical Theory and more specifically, in the
    field of Pattern Recognition.

3
An Introduction
  • Bayesian Decision Theory is at the basis of
    important learning schemes such as
  • Naïve Bayes Classifier
  • Bayesian Belief Networks
  • EM Algorithm
  • Bayesian Decision Theory is also useful as it
    provides a framework within which many
    non-Bayesian classifiers can be studied
  • See Mitchell, Sections 6.3, 4,5,6.

4
Discrete Random Variables
  • A is a Boolean random variable if it denotes an
    event where there is uncertainty about whether it
    occurs
  • Examples
  • The next US president will be Barack Obama
  • You will get an A in the course
  • P(A) probability of A the fraction of all
    possible worlds where A is true

5
Vizualizing P(A)
All Possible Worlds
Worlds where A is True
6
Axioms of Probability
  • Let there be a space S composed of a countable
    number of events
  • The probability of each event is between 0 and 1
  • The probability of the whole sample space is 1
  • When two events are mutually exclusive, their
    probabilities are additive

7
Vizualizing Two Boolean RVs

A
B
8
Conditional Probability
The conditional probability of A given B is
represented by the following formula

A
B
Only if A and B are independent
9
Independence
  • variables A and B are said to be independent if
    knowing the value of A gives you no knowledge
    about the likelihood of Band vice-versa
  • P(AB) P(A) and P(BA) P(B)

10
An Example Cards
  • Take a standard deck of 52 cards.
  • On the first draw I pull the Ace of Spades.
  • I dont replace the card.
  • What is the probability Ill pull the Ace of
    Spades on the second draw?
  • Now, I replace the Ace after the 1st draw,
    shuffle, and draw again.
  • What is the chance Ill draw the Ace of Spades on
    the 2nd draw?

11
Discrete Random Variables
  • A is a discrete random variable if it takes a
    countable number of distinct values
  • Examples
  • Your grade G in the course
  • The number of heads k in n coin flips
  • P(Ak) the fraction of all possible worlds
    where A equals k
  • Notation PD(A k) prob. relative to a
    distribution D
  • Pfair grading(G A), Pcheating(G A)

12
Bayes Theorem
  • Definition of Conditional Probability
  • Corollary
  • The Chain Rule
  • Bayes Rule
  • (Thomas Bayes, 1763)

13
ML in a Bayesian Framework
  • Any ML technique can be expressed as reasoning
    about probabilities
  • Goal Find hypothesis h that is most probable
    given training data D
  • Provides a more explicit way of describing
    encoding our assumptions

14
Some Definitions
  • Prior probability of h, P(h)
  • The background knowledge we have about the chance
    that h is a correct hypothesis (before having
    observed the data).
  • Prior probability of D, P(D)
  • the probability that training data D will be
    observed given no knowledge about which
    hypothesis h holds.
  • Conditional Probability of D, P(Dh)
  • the probability of observing data D given that
    hypothesis h holds.
  • Posterior probability of h, P(hD)
  • the probability that h is true, given the
    observed training data D.
  • the quantity that Machine Learning researchers
    are interested in.

15
Maximum A Posteriori (MAP)
  • Goal To find the most probable hypothesis h
    from a set of candidate hypotheses H given the
    observed data D.
  • MAP Hypothesis, hMAP

16
Maximum Likelihood (ML)
  • ML hypothesis is a special case of the MAP
    hypothesis where all hypotheses are equally
    likely to begin with

17
Example Brute Force MAP Learning
  • Assumptions
  • The training data D is noise-free
  • The target concept c is in the hypothesis set H
  • All hypotheses are equally likely
  • Choice Probability of D given h

18
Brute Force MAP (continued)
Bayes Theorem
Given our assumptions
VSH,D is the version space
19
Find-S as MAP Learning
  • We can characterize the FIND-S learner (chapter
    2) in Bayesian terms
  • Again P(D h) is 1 if h is consistent on D, and
    0 otherwise
  • P(h) increases with
  • specificity of h
  • Then MAP hypothesis output of Find-S

20
Neural Nets in a Bayesian Framework
  • Under certain assumptions regarding noise in the
    data, minimizing the mean squared error (what
    multilayer perceptrons do) corresponds to
    computing the maximum likelihood hypothesis.

21
Least Squared Error ML
Assume e is drawn from a normal distribution
22
Least Squared Error ML
23
Least Squared Error ML
24
Decision Trees in Bayes Framework
  • Decent choice for P(h) simpler hypotheses have
    higher probability
  • Occams razor
  • This can be encoded in terms of finding the
    Minimum Description Length encoding
  • Provides a way to trade off hypothesis size for
    training error
  • Potentially prevents overfitting

25
Most Compact Coding
  • Lets minimize the bits used to encode a message
  • Idea
  • Assign shorter codes to more probable messages
  • According to Shannon Weaver
  • An optimal code assigns log2P(i) bits to encode
    item i
  • thus

26
Minimum Description Length (MDL)
27
Minimum Description Length (MDL)
28
Minimum Description Length (MDL)
29
What does all that mean?
  • The optimal hypothesis is the one that is the
    smallest when we count
  • How long the hypothesis description must be
  • How long the data description must be, given the
    hypothesis
  • Key idea since were given h, we need only
    encode hs mistakes

30
What does all that mean?
  • If the hypothesis is perfect, we dont need to
    encode any data.
  • For each misclassification, we must
  • say which item is misclassified
  • Takes log2m bits, where m size of the dataset
  • Say what the right classification is
  • Takes log2k bits, where k number of classes

31
The best MDL hypothesis
  • The best hypothesis is the best tradeoff between
  • Complexity of the hypothesis description
  • Number of times we have to tell people where it
    screwed up.

32
Is MDL always MAP?
  • Only given significant assumptions
  • If we know a representation scheme such that size
    of h in H is -log2P(h)
  • Likewise, the size of the exception
    representation must be log2P(Dh)
  • THEN
  • MDL MAP

33
Making Predictions
  • The reason we learned h to begin with
  • Does it make sense to choose just one h?

h1 Looks matter
h2 Money matters
h3 Ideas matter
Obama Elected President
We want a prediction yes or no?
Doug Downey (adapted from Bryan Pardo,
Northwestern University)
34
Maximum A Posteriori (MAP)
  • Find most probable hypothesis
  • Use the predictions of that hypothesis

h1 Looks matter
h2 Money matters
h3 Ideas matter
. do we really want to ignore the other
hypotheses? Imagine 8 hypotheses. Seven of them
say yes and have a probability of 0.1 each.
One says no and has a probability of 0.3. Who
do you believe?
Doug Downey (adapted from Bryan Pardo,
Northwestern University)
35
Bayes Optimal Classifier
  • Bayes Optimal Classification The most probable
    classification of a new instance is obtained by
    combining the predictions of all hypotheses,
    weighted by their posterior probabilities
  • where V is the set of all the values a
    classification can take and v is one possible
    such classification.
  • No other method using the same H and prior
    knowledge is better (on average).

36
Naïve Bayes Classifier
  • Unfortunately, Bayes Optimal Classifier is
    usually too costly to apply! gt Naïve Bayes
    Classifier
  • Well be seeing more of these

37
The Joint Distribution
  • Make a truth table listing all combinations of
    variable values
  • Assign a probability to each row
  • Make sure the probabilities sum to 1

Doug Downey (adapted from Bryan Pardo,
Northwestern University)
38
Using The Joint Distribution
  • Find P(A)
  • Sum the probabilities of all rows where A1
  • P(A1) 0.05 0.2 0.25 0.05
  • 0.55
  • P(A)

Doug Downey (adapted from Bryan Pardo,
Northwestern University)
39
Using The Joint Distribution
  • Find P(AB)
  • P(A1 B1)P(A1, B1)/P(B1)(0.250.05)/
    (0.250.050.10.05)

Doug Downey (adapted from Bryan Pardo,
Northwestern University)
40
Using The Joint Distribution
  • Are A and B Independent?

NO. They are NOT independent
Doug Downey (adapted from Bryan Pardo,
Northwestern University)
41
Why not use the Joint Distribution?
  • Given m boolean variables, we need to estimate
    2m-1 values.
  • 20 yes-no questions a million values
  • How do we get around this combinatorial
    explosion?
  • Assume independence of variables!!

Doug Downey (adapted from Bryan Pardo,
Northwestern University)
42
back to Independence
  • The probability I have an apple in my lunch bag
    is independent of the probability of a blizzard
    in Japan.
  • This is DOMAIN Knowledge, typically supplied by
    the problem designer

Doug Downey (adapted from Bryan Pardo,
Northwestern University)
43
Naïve Bayes Classifier
  • Cases described by a conjunction of attribute
    values
  • These attributes are our independent hypotheses
  • The target function has a finite set of values, V
  • Could be solved using the joint distribution
    table
  • What if we have 50,000 attributes?
  • Attribute j is a Boolean signaling presence or
    absence of the jth word from the dictionary in my
    latest email.

44
Naïve Bayes Classifier
45
Naïve Bayes Continued
Conditional independence step
Instead of one table of size 250000 we have
50,000 tables of size 2
46
Bayesian Belief Networks
  • Bayes Optimal Classifier
  • Often too costly to apply (uses full joint
    probability)
  • Naïve Bayes Classifier
  • Assumes conditional independence to lower costs
  • This assumption often overly restrictive
  • Bayesian belief networks
  • provide an intermediate approach
  • allows conditional independence assumptions that
    apply to subsets of the variable.

47
Example
  • I'm at work, neighbor John calls to say my alarm
    is ringing, but neighbor Mary doesn't call.
    Sometimes it's set off by minor earthquakes. Is
    there a burglar?
  • Variables Burglary, Earthquake, Alarm,
    JohnCalls, MaryCalls
  • Network topology reflects "causal" knowledge
  • A burglar can set the alarm off
  • An earthquake can set the alarm off
  • The alarm can cause Mary to call
  • The alarm can cause John to call

48
Example contd.
49
Bayesian Networks
Pearl 91
  • Qualitative part
  • Directed acyclic graph (DAG)
  • Nodes - random vars.
  • Edges - direct influence

Traditional Approaches
50
Compactness
  • A CPT for Boolean Xi with k Boolean parents has
    2k rows for the combinations of parent values
  • Each row requires one number p for Xi true(the
    number for Xi false is just 1-p)
  • If each variable has no more than k parents, the
    complete network requires O(n 2k) numbers
  • I.e., grows linearly with n, vs. O(2n) for the
    full joint distribution
  • For burglary net, 1 1 4 2 2 10 numbers
    (vs. 25-1 31)

51
Semantics
  • The full joint distribution is defined as the
    product of the local conditional distributions
  • P (X1, ,Xn) pi 1 P (Xi Parents(Xi))
  • Example
  • P(j ? m ? a ? ?b ? ?e)
  • P (j a) P (m a) P (a ?b, ?e) P (?b) P
    (?e)

n
52
Learning BB Networks 3 cases
  • The network structure is given in advance and all
    the variables are fully observable in the
    training examples.
  • Trivial Case just estimate the conditional
    probabilities.
  • 2. The network structure is given in advance but
    only some of the variables are observable in the
    training data.
  • Similar to learning the weights for the hidden
    units of a Neural Net Gradient Ascent Procedure
  • 3. The network structure is not known in advance.
  • Use a heuristic search or constraint-based
    technique to search through potential structures.

53
Constructing Bayesian networks
  • 1. Choose an ordering of variables X1, ,Xn
  • 2. For i 1 to n
  • add Xi to the network
  • select parents from X1, ,Xi-1 such that
  • P (Xi Parents(Xi)) P (Xi X1, ... Xi-1)
  • This choice of parents guarantees
  • P (X1, ,Xn) pi 1 P (Xi X1, , Xi-1)
    (chain rule)
  • pi 1P (Xi Parents(Xi)) (by construction)

n
n
54
Example
  • Suppose we choose the ordering M, J, A, B, E
  • P(J M) P(J)?

55
Example
  • Suppose we choose the ordering M, J, A, B, E
  • P(J M) P(J)? No
  • P(A J, M) P(A J)?

56
Example
  • Suppose we choose the ordering M, J, A, B, E
  • P(J M) P(J)? No
  • P(A J, M) P(A J)? P(A J, M) P(A)? No
  • P(B A, J, M) P(B A)?
  • P(B A, J, M) P(B)?

57
Example
  • Suppose we choose the ordering M, J, A, B, E
  • P(J M) P(J)? No
  • P(A J, M) P(A J)? P(A J, M) P(A)? No
  • P(B A, J, M) P(B A)? Yes
  • P(B A, J, M) P(B)? No
  • P(E B, A ,J, M) P(E A)?
  • P(E B, A, J, M) P(E A, B)?

58
Example
  • Suppose we choose the ordering M, J, A, B, E
  • P(J M) P(J)?No
  • P(A J, M) P(A J)? P(A J, M) P(A)? No
  • P(B A, J, M) P(B A)? Yes
  • P(B A, J, M) P(B)? No
  • P(E B, A ,J, M) P(E A)? No
  • P(E B, A, J, M) P(E A, B)? Yes

59
Example contd.
  • Deciding conditional independence is hard in
    noncausal directions
  • Causal models and conditional independence seem
    hardwired for humans!
  • Network is less compact

60
Inference in BB Networks
  • A Bayesian Network can be used to compute the
    probability distribution for any subset of
    network variables given the values or
    distributions for any subset of the remaining
    variables.
  • Unfortunately, exact inference of probabilities
    in general for an arbitrary Bayesian Network is
    known to be NP-hard (P-complete)
  • In theory, approximate techniques (such as Monte
    Carlo Methods) can also be NP-hard, though in
    practice, many such methods are shown to be
    useful.

61
Expectation Maximization Algorithm
  • Learning unobservable relevant variables
  • ExampleAssume that data points have been
    uniformly generated from k distinct Gaussian
    with the same known variance. The problem is to
    output a hypothesis hlt?1, ?2 ,.., ?kgt
    that describes the means of each of the k
    distributions. In particular, we are looking for
    a maximum likelihood hypothesis for these means.
  • We extend the problem description as follows for
    each point xi, there are k hidden variables
    zi1,..,zik such that zil1 if xi was generated by
    normal distribution l and ziq 0 for all q?l.

62
The EM Algorithm (Contd)
  • An arbitrary initial hypothesis hlt?1, ?2 ,..,
    ?kgt is chosen.
  • The EM Algorithm iterates over two steps
  • Step 1 (Estimation, E) Calculate the expected
    value Ezij of each hidden variable zij,
    assuming that the current hypothesis hlt?1, ?2
    ,.., ?kgt holds.
  • Step 2 (Maximization, M)
  • Calculate a new maximum likelihood hypothesis
    hlt?1, ?2 ,.., ?kgt, assuming the value taken
    on by each hidden variable zij is its expected
    value Ezij calculated in step 1.
  • Then replace the hypothesis hlt?1, ?2 ,.., ?kgt
    by the new hypothesis hlt?1, ?2 ,.., ?kgt and
    iterate.
  • The EM Algorithm can be applied to more general
    problems

63
Gibbs Classifier
  • Bayes optimal classification can be too hard to
    compute
  • Instead, randomly pick a single hypothesis
    (according to the probability distribution of the
    hypotheses)
  • use this hypothesis to classify new cases


h2
h1
h3
Write a Comment
User Comments (0)
About PowerShow.com