Title: Introduction to Statistical Learning
1Introduction to Statistical Learning
- Reading Ch. 20, Sec. 1-4, AIMA 2nd Ed.
2Learning under uncertainty
- How to learn probabilistic models such as
Bayesian networks, Markov models, HMMs, ? - Examples
- Class confusion example how did we come up with
the CPTs? - Earthquake-burglary network structure?
- How do we learn HMMs for speech recognition?
- Kalman model (e.g., mass, friction) parameters?
- User models encoded as Bayesian networks for HCI?
3Hypotheses and Bayesian theory
- Problem
- Two kinds of candy, lemon and chocolate
- Packed in five types of unmarked bags (100 C,
0 L) 10 of time, (75 C, 25 L) 10 time, (50
C, 50 L) 40 of time, (25 C, 75 L) 20 of
time, (0 C, 100L) 10 of time - Task open a bag, ( unwrap candy, observe it, ),
then predict what the next one will be - Formulation
- H (Hypothesis) h1 (100,0) or h2 (75,25) or h3
(50,50) or h4 (25,75) or h5(0,100) - di (Data)i-th open candy L (lemon) or C
(chocolate) - Goal
- predict di1 after seeing D d0, d1, , di
- P( di1 D )
4Bayesian learning
- Bayesian solutionEstimate probabilities of
hypothesis (candy bag types), then predict data
(candy type) P(hi D ) P( D hi )
P(hi) P(di D ) ?hi P( di hi ) P( hi
D ) - P( D hi ) P( d0 hi ) x x P(di hi )
Hypothesis prior
Data likelihood
Hypothesis posterior
Prediction
I.I.D. (independently, identically distributed)
data points
5Example
- P( hi ) ?
- P( hi ) (0.1, 0.2, 0.4, 0.2, 0.1)
- P( di hi ) ?
- P( chocolate h1 ) 1, P( lemon h3 ) 0.5
- P( C, C, C, C, C h4 ) ?
- P( C, C, C, C, C h4 ) 0.255
- P( h5 C, C, C, C, C ) ?
- P( h5 C, C, C, C, C ) P(C, C, C, C, C h5 )
P( h5 ) 05 0.1 0 - P( lemon C, C, C, C, C ) ?
- P( d h1) P(h1 C, C, C, C, C ) P( d h5
) P( h5 C,C,C,C,C) 00.6244 0.250.2963
0.500.0780 0.750.0012 10 0.1140 - P( chocolate C, C, C, C, ) ?
- P (chocolate C, C, C, C, ) -gt 1
6Bayesian prediction properties
- True hypothesis eventually dominates
- Bayesian prediction is optimal (minimizes
prediction error) - Comes at a price usually many hypotheses,
intractable summation
7Approximations to Bayesian prediction
- MAP Maximum a posterioriP( d D ) P( d
hMAP ), hMAP arg maxhi P( hi D
)(easier to compute) - Role of prior, P(hi) penalizes complex
hypotheses - ML Maximum likelihoodP( d D ) P( d hML
), hML arg maxhi P( D hi )
8Learning from complete data
- Learn parameters of Bayesian models from data
- e.g., learn probabilities of C L for a bag of
candy whose proportions of CL are unknown by
observing opened candy from that bag - Candy problem parameters?u probability of C
in bag u?u probability of bag u
Bag Parameter
1 ?1
2 ?2
3 ?3
4 ?4
5 ?5
Bag u
Candy Parameter
C ?u
L 1- ?u
9ML Learning from complete data
- ML approach select model parameters to maximize
likelihood of seen data - Need to assume distribution model that determines
how the samples (of candy) are distributed in a
bag - Select parameters of the model that maximize the
likelihood of the seen data
likelihood
log-likelihood
model binomial
10Maximum likelihood learning (binomial
distribution)
- How to find a solution to the above problem?
11Maximum likelihood learning (contd)
- Take the first derivative of (log) likelihood and
set it to zero
12Naïve Bayes model
- One set of causes, multiple independent sources
of evidence
C
E1
E2
EN
Ei
- Example C ? spam, not spam, Ei ? token i
present, token i absent
- Limiting assumption, often works well in practice
13Inference Decision in NB model
Evidence score
Hypothesis (class) score
Prior score
Log odds ratio
14Learning in NB models
- ExampleGiven a set of K email messages, each
with tokens D dj (e1j,,eNj) , eij ? 0,1,
and labels Ccj (SPAM or NOT_SPAM), find the
best set of CPTs P(EiC) and P(C)Assume
P(EiCc) is binomial with parameter ?i,c, P(C)
is binomial with parameter ?c
2N1 parameters
- ML learning maximize likelihood of K messages,
each one in one of the two classes
Label of message j
Token i in message j present/absent
15Learning of Bayesian network parameters
- Naïve Bayes learning can be extended to BNs!
How? - Model each CPT as binomial/multinomial
distribution. Maximize likelihood of data given
BN.
Sample E B A N C
1 1 0 0 1 0
2 1 0 1 1 0
3 1 1 0 0 1
4 0 1 0 1 1
16BN Learning (contd)
- Issues
- Priors on parameters. What if
? Should we trust
it?Maybe always add some small pseudo-count
?? - How do we learn a BN graph (structure)?Test all
possible structures, then pick the one with the
highest data likelihood? - What if we do not observe some nodes (evidence
not on all nodes)?
17Learning from incomplete data
- Example
- In the alarm network, we received data where we
only know Newscast, Call, Earthquake, Burglary,
but have no idea what Alarm state is. - In SPAM model, we do not know if a message is
spam or not (missing label).
Sample E B A N C
1 1 0 N/A 1 0
2 1 0 N/A 1 0
3 1 1 N/A 0 1
4 0 1 N/A 1 1
- Solution?We can still try to find network
parameters that maximize likelihood of incomplete
data.
Hidden variable
18Completing the data
- Maximizing incomplete data likelihood is tricky.
- If we could, somehow, complete the data we would
know how to select model parameters that maximize
the completed data. - How do we complete the missing data?
- Randomly complete?
- Estimate missing data from evidence, P( h
Evidence ).
19EM Algorithm
- With completed data, Dc, maximize completed
(log)likelihood by weighting contribution from
each sample with P(hd)
- E(xpectation) M(aximization) algorithm
- Pick initial parameter estimates ?0.
- error Inf
- While (error gt max error)
- E-step Complete data, Dc, based on ?k-1.
- M-step Compute new parameters ?k that maximize
completed data likelihood. - error L( D ?k ) - L( D ?k-1 )
20EM Example
- Candy problem, but now we do not know which bag
the candy came from (bag label missing). - E-step
- M-step
Prior probability of bag u
Candy (C) probability in bag u
21EM Learning of HMM parameters
- HMM needs EM for parameter learning (unless we
know exactly the hidden states at every time
instance) - Need to learn transition and emission
parameters. - E.g.
- Learning of HMMs for speech modeling.
- Assume a general (word/language) model.
- E-step Recognize (your own) speech using this
model (Viterbi decoding). - M-step Tweak parameters to recognize your speech
a bit better (ML parameter fitting). - Go to 2.