Introduction to Statistical Learning - PowerPoint PPT Presentation

About This Presentation
Title:

Introduction to Statistical Learning

Description:

Two kinds of candy, lemon and chocolate. Packed in ... Candy problem parameters: ... model that determines how the samples (of candy) are distributed in a bag ... – PowerPoint PPT presentation

Number of Views:757
Avg rating:3.0/5.0
Slides: 22
Provided by: vladimir5
Category:

less

Transcript and Presenter's Notes

Title: Introduction to Statistical Learning


1
Introduction to Statistical Learning
  • Reading Ch. 20, Sec. 1-4, AIMA 2nd Ed.

2
Learning under uncertainty
  • How to learn probabilistic models such as
    Bayesian networks, Markov models, HMMs, ?
  • Examples
  • Class confusion example how did we come up with
    the CPTs?
  • Earthquake-burglary network structure?
  • How do we learn HMMs for speech recognition?
  • Kalman model (e.g., mass, friction) parameters?
  • User models encoded as Bayesian networks for HCI?

3
Hypotheses and Bayesian theory
  • Problem
  • Two kinds of candy, lemon and chocolate
  • Packed in five types of unmarked bags (100 C,
    0 L) 10 of time, (75 C, 25 L) 10 time, (50
    C, 50 L) 40 of time, (25 C, 75 L) 20 of
    time, (0 C, 100L) 10 of time
  • Task open a bag, ( unwrap candy, observe it, ),
    then predict what the next one will be
  • Formulation
  • H (Hypothesis) h1 (100,0) or h2 (75,25) or h3
    (50,50) or h4 (25,75) or h5(0,100)
  • di (Data)i-th open candy L (lemon) or C
    (chocolate)
  • Goal
  • predict di1 after seeing D d0, d1, , di
  • P( di1 D )

4
Bayesian learning
  • Bayesian solutionEstimate probabilities of
    hypothesis (candy bag types), then predict data
    (candy type) P(hi D ) P( D hi )
    P(hi) P(di D ) ?hi P( di hi ) P( hi
    D )
  • P( D hi ) P( d0 hi ) x x P(di hi )

Hypothesis prior
Data likelihood
Hypothesis posterior
Prediction
I.I.D. (independently, identically distributed)
data points
5
Example
  • P( hi ) ?
  • P( hi ) (0.1, 0.2, 0.4, 0.2, 0.1)
  • P( di hi ) ?
  • P( chocolate h1 ) 1, P( lemon h3 ) 0.5
  • P( C, C, C, C, C h4 ) ?
  • P( C, C, C, C, C h4 ) 0.255
  • P( h5 C, C, C, C, C ) ?
  • P( h5 C, C, C, C, C ) P(C, C, C, C, C h5 )
    P( h5 ) 05 0.1 0
  • P( lemon C, C, C, C, C ) ?
  • P( d h1) P(h1 C, C, C, C, C ) P( d h5
    ) P( h5 C,C,C,C,C) 00.6244 0.250.2963
    0.500.0780 0.750.0012 10 0.1140
  • P( chocolate C, C, C, C, ) ?
  • P (chocolate C, C, C, C, ) -gt 1

6
Bayesian prediction properties
  • True hypothesis eventually dominates
  • Bayesian prediction is optimal (minimizes
    prediction error)
  • Comes at a price usually many hypotheses,
    intractable summation

7
Approximations to Bayesian prediction
  • MAP Maximum a posterioriP( d D ) P( d
    hMAP ), hMAP arg maxhi P( hi D
    )(easier to compute)
  • Role of prior, P(hi) penalizes complex
    hypotheses
  • ML Maximum likelihoodP( d D ) P( d hML
    ), hML arg maxhi P( D hi )

8
Learning from complete data
  • Learn parameters of Bayesian models from data
  • e.g., learn probabilities of C L for a bag of
    candy whose proportions of CL are unknown by
    observing opened candy from that bag
  • Candy problem parameters?u probability of C
    in bag u?u probability of bag u

Bag Parameter
1 ?1
2 ?2
3 ?3
4 ?4
5 ?5
Bag u
Candy Parameter
C ?u
L 1- ?u
9
ML Learning from complete data
  • ML approach select model parameters to maximize
    likelihood of seen data
  • Need to assume distribution model that determines
    how the samples (of candy) are distributed in a
    bag
  • Select parameters of the model that maximize the
    likelihood of the seen data

likelihood
log-likelihood
model binomial
10
Maximum likelihood learning (binomial
distribution)
  • How to find a solution to the above problem?

11
Maximum likelihood learning (contd)
  • Take the first derivative of (log) likelihood and
    set it to zero

12
Naïve Bayes model
  • One set of causes, multiple independent sources
    of evidence

C
E1
E2
EN
Ei

  • Example C ? spam, not spam, Ei ? token i
    present, token i absent
  • Limiting assumption, often works well in practice

13
Inference Decision in NB model
  • Inference

Evidence score
Hypothesis (class) score
Prior score
  • Decision

Log odds ratio
14
Learning in NB models
  • ExampleGiven a set of K email messages, each
    with tokens D dj (e1j,,eNj) , eij ? 0,1,
    and labels Ccj (SPAM or NOT_SPAM), find the
    best set of CPTs P(EiC) and P(C)Assume
    P(EiCc) is binomial with parameter ?i,c, P(C)
    is binomial with parameter ?c

2N1 parameters
  • ML learning maximize likelihood of K messages,
    each one in one of the two classes

Label of message j
Token i in message j present/absent
15
Learning of Bayesian network parameters
  • Naïve Bayes learning can be extended to BNs!
    How?
  • Model each CPT as binomial/multinomial
    distribution. Maximize likelihood of data given
    BN.

Sample E B A N C
1 1 0 0 1 0
2 1 0 1 1 0
3 1 1 0 0 1
4 0 1 0 1 1
16
BN Learning (contd)
  • Issues
  • Priors on parameters. What if
    ? Should we trust
    it?Maybe always add some small pseudo-count
    ??
  • How do we learn a BN graph (structure)?Test all
    possible structures, then pick the one with the
    highest data likelihood?
  • What if we do not observe some nodes (evidence
    not on all nodes)?

17
Learning from incomplete data
  • Example
  • In the alarm network, we received data where we
    only know Newscast, Call, Earthquake, Burglary,
    but have no idea what Alarm state is.
  • In SPAM model, we do not know if a message is
    spam or not (missing label).

Sample E B A N C
1 1 0 N/A 1 0
2 1 0 N/A 1 0
3 1 1 N/A 0 1
4 0 1 N/A 1 1
  • Solution?We can still try to find network
    parameters that maximize likelihood of incomplete
    data.

Hidden variable
18
Completing the data
  • Maximizing incomplete data likelihood is tricky.
  • If we could, somehow, complete the data we would
    know how to select model parameters that maximize
    the completed data.
  • How do we complete the missing data?
  • Randomly complete?
  • Estimate missing data from evidence, P( h
    Evidence ).

19
EM Algorithm
  • With completed data, Dc, maximize completed
    (log)likelihood by weighting contribution from
    each sample with P(hd)
  • E(xpectation) M(aximization) algorithm
  • Pick initial parameter estimates ?0.
  • error Inf
  • While (error gt max error)
  • E-step Complete data, Dc, based on ?k-1.
  • M-step Compute new parameters ?k that maximize
    completed data likelihood.
  • error L( D ?k ) - L( D ?k-1 )

20
EM Example
  • Candy problem, but now we do not know which bag
    the candy came from (bag label missing).
  • E-step
  • M-step

Prior probability of bag u
Candy (C) probability in bag u
21
EM Learning of HMM parameters
  • HMM needs EM for parameter learning (unless we
    know exactly the hidden states at every time
    instance)
  • Need to learn transition and emission
    parameters.
  • E.g.
  • Learning of HMMs for speech modeling.
  • Assume a general (word/language) model.
  • E-step Recognize (your own) speech using this
    model (Viterbi decoding).
  • M-step Tweak parameters to recognize your speech
    a bit better (ML parameter fitting).
  • Go to 2.
Write a Comment
User Comments (0)
About PowerShow.com