Introduction to Statistical Learning - PowerPoint PPT Presentation

About This Presentation

Title:

Introduction to Statistical Learning

Description:

Two kinds of candy, lemon and chocolate. Packed in ... Candy problem parameters: ... model that determines how the samples (of candy) are distributed in a bag ... – PowerPoint PPT presentation

Number of Views:757

Avg rating:3.0/5.0

Slides: 22

Provided by: vladimir5

Learn more at: https://people.cs.rutgers.edu

Category:

more less

Transcript and Presenter's Notes

Title: Introduction to Statistical Learning

1
Introduction to Statistical Learning

Reading Ch. 20, Sec. 1-4, AIMA 2nd Ed.

2
Learning under uncertainty

How to learn probabilistic models such as
Bayesian networks, Markov models, HMMs, ?
Examples
Class confusion example how did we come up with
the CPTs?
Earthquake-burglary network structure?
How do we learn HMMs for speech recognition?
Kalman model (e.g., mass, friction) parameters?
User models encoded as Bayesian networks for HCI?

3
Hypotheses and Bayesian theory

Problem
Two kinds of candy, lemon and chocolate
Packed in five types of unmarked bags (100 C,
0 L) 10 of time, (75 C, 25 L) 10 time, (50
C, 50 L) 40 of time, (25 C, 75 L) 20 of
time, (0 C, 100L) 10 of time
Task open a bag, ( unwrap candy, observe it, ),
then predict what the next one will be
Formulation
H (Hypothesis) h1 (100,0) or h2 (75,25) or h3
(50,50) or h4 (25,75) or h5(0,100)
di (Data)i-th open candy L (lemon) or C
(chocolate)
Goal
predict di1 after seeing D d0, d1, , di
P( di1 D )

4
Bayesian learning

Bayesian solutionEstimate probabilities of
hypothesis (candy bag types), then predict data
(candy type) P(hi D ) P( D hi )
P(hi) P(di D ) ?hi P( di hi ) P( hi
D )
P( D hi ) P( d0 hi ) x x P(di hi )

Hypothesis prior
Data likelihood
Hypothesis posterior
Prediction
I.I.D. (independently, identically distributed)
data points
5
Example

P( hi ) ?
P( hi ) (0.1, 0.2, 0.4, 0.2, 0.1)
P( di hi ) ?
P( chocolate h1 ) 1, P( lemon h3 ) 0.5
P( C, C, C, C, C h4 ) ?
P( C, C, C, C, C h4 ) 0.255
P( h5 C, C, C, C, C ) ?
P( h5 C, C, C, C, C ) P(C, C, C, C, C h5 )
P( h5 ) 05 0.1 0
P( lemon C, C, C, C, C ) ?
P( d h1) P(h1 C, C, C, C, C ) P( d h5
) P( h5 C,C,C,C,C) 00.6244 0.250.2963
0.500.0780 0.750.0012 10 0.1140
P( chocolate C, C, C, C, ) ?
P (chocolate C, C, C, C, ) -gt 1

6
Bayesian prediction properties

True hypothesis eventually dominates
Bayesian prediction is optimal (minimizes
prediction error)
Comes at a price usually many hypotheses,
intractable summation

7
Approximations to Bayesian prediction

MAP Maximum a posterioriP( d D ) P( d
hMAP ), hMAP arg maxhi P( hi D
)(easier to compute)
Role of prior, P(hi) penalizes complex
hypotheses
ML Maximum likelihoodP( d D ) P( d hML
), hML arg maxhi P( D hi )

8
Learning from complete data

Learn parameters of Bayesian models from data
e.g., learn probabilities of C L for a bag of
candy whose proportions of CL are unknown by
observing opened candy from that bag
Candy problem parameters?u probability of C
in bag u?u probability of bag u

Bag Parameter
1 ?1
2 ?2
3 ?3
4 ?4
5 ?5
Bag u
Candy Parameter
C ?u
L 1- ?u
9
ML Learning from complete data

ML approach select model parameters to maximize
likelihood of seen data
Need to assume distribution model that determines
how the samples (of candy) are distributed in a
bag
Select parameters of the model that maximize the
likelihood of the seen data

likelihood
log-likelihood
model binomial
10
Maximum likelihood learning (binomial
distribution)

How to find a solution to the above problem?

11
Maximum likelihood learning (contd)

Take the first derivative of (log) likelihood and
set it to zero

12
Naïve Bayes model

One set of causes, multiple independent sources
of evidence

C
E1
E2
EN
Ei

Example C ? spam, not spam, Ei ? token i
present, token i absent

Limiting assumption, often works well in practice

13
Inference Decision in NB model

Inference

Evidence score
Hypothesis (class) score
Prior score

Decision

Log odds ratio
14
Learning in NB models

ExampleGiven a set of K email messages, each
with tokens D dj (e1j,,eNj) , eij ? 0,1,
and labels Ccj (SPAM or NOT_SPAM), find the
best set of CPTs P(EiC) and P(C)Assume
P(EiCc) is binomial with parameter ?i,c, P(C)
is binomial with parameter ?c

2N1 parameters

ML learning maximize likelihood of K messages,
each one in one of the two classes

Label of message j
Token i in message j present/absent
15
Learning of Bayesian network parameters

Naïve Bayes learning can be extended to BNs!
How?
Model each CPT as binomial/multinomial
distribution. Maximize likelihood of data given
BN.

Sample E B A N C
1 1 0 0 1 0
2 1 0 1 1 0
3 1 1 0 0 1
4 0 1 0 1 1
16
BN Learning (contd)

Issues
Priors on parameters. What if
? Should we trust
it?Maybe always add some small pseudo-count
??
How do we learn a BN graph (structure)?Test all
possible structures, then pick the one with the
highest data likelihood?
What if we do not observe some nodes (evidence
not on all nodes)?

17
Learning from incomplete data

Example
In the alarm network, we received data where we
only know Newscast, Call, Earthquake, Burglary,
but have no idea what Alarm state is.
In SPAM model, we do not know if a message is
spam or not (missing label).

Sample E B A N C
1 1 0 N/A 1 0
2 1 0 N/A 1 0
3 1 1 N/A 0 1
4 0 1 N/A 1 1

Solution?We can still try to find network
parameters that maximize likelihood of incomplete
data.

Hidden variable
18
Completing the data

Maximizing incomplete data likelihood is tricky.
If we could, somehow, complete the data we would
know how to select model parameters that maximize
the completed data.
How do we complete the missing data?
Randomly complete?
Estimate missing data from evidence, P( h
Evidence ).

19
EM Algorithm

With completed data, Dc, maximize completed
(log)likelihood by weighting contribution from
each sample with P(hd)

E(xpectation) M(aximization) algorithm
Pick initial parameter estimates ?0.
error Inf
While (error gt max error)
E-step Complete data, Dc, based on ?k-1.
M-step Compute new parameters ?k that maximize
completed data likelihood.
error L( D ?k ) - L( D ?k-1 )

20
EM Example

Candy problem, but now we do not know which bag
the candy came from (bag label missing).
E-step
M-step

Prior probability of bag u
Candy (C) probability in bag u
21
EM Learning of HMM parameters

HMM needs EM for parameter learning (unless we
know exactly the hidden states at every time
instance)
Need to learn transition and emission
parameters.
E.g.
Learning of HMMs for speech modeling.
Assume a general (word/language) model.
E-step Recognize (your own) speech using this
model (Viterbi decoding).
M-step Tweak parameters to recognize your speech
a bit better (ML parameter fitting).
Go to 2.