Principles and Applications of Probabilistic Learnin - PowerPoint PPT Presentation

About This Presentation
Title:

Principles and Applications of Probabilistic Learnin

Description:

Principles and Applications of Probabilistic Learning Padhraic Smyth Department of Computer Science University of California, Irvine www.ics.uci.edu/~smyth – PowerPoint PPT presentation

Number of Views:84
Avg rating:3.0/5.0
Slides: 245
Provided by: icsUciEd97
Learn more at: https://ics.uci.edu
Category:

less

Transcript and Presenter's Notes

Title: Principles and Applications of Probabilistic Learnin


1
Principles and Applications ofProbabilistic
Learning
  • Padhraic Smyth
  • Department of Computer Science
  • University of California, Irvine
  • www.ics.uci.edu/smyth

2
New Slides
NEW
  • Original slides created in mid-July for ACM
  • Some new slides have been added
  • new logo in upper left

3
New Slides
UPDATED
  • Original slides created in mid-July for ACM
  • Some new slides have been added
  • new logo in upper left
  • A few slides have been updated
  • updated logo in upper left
  • Current slides (including new and updated) at
  • www.ics.uci.edu/smyth/talks

4
NEW
  • From the tutorial Web page
  • The intent of this tutorial is to provide a
    starting point for students and researchers

5
Probabilistic Modeling vs. Function Approximation
  • Two major themes in machine learning
  • 1. Function approximation/black box methods
  • e.g., for classification and regression
  • Learn a flexible function y f(x)
  • e.g., SVMs, decision trees, boosting, etc
  • 2. Probabilistic learning
  • e.g., for regression, model p(yx) or p(y,x)
  • e.g, graphical models, mixture models, hidden
    Markov models, etc
  • Both approaches are useful in general
  • In this tutorial we will focus only on the 2nd
    approach, probabilistic modeling

6
Motivations for Probabilistic Modeling
  • leverage prior knowledge
  • generalize beyond data analysis in vector-spaces
  • handle missing data
  • combine multiple types of information into an
    analysis
  • generate calibrated probability outputs
  • quantify uncertainty about parameters, models,
    and predictions in a statistical manner

7
Learning object models in vision Weber,
Welling, Perona, 2000
NEW
8
Learning object models in vision Weber,
Welling, Perona, 2000
NEW
9
Learning to Extract Information from Documents
NEW
e.g., Seymore, McCallum, Rosenfeld, 1999
10
NEW
11
NEW
Segal, Friedman, Koller, et al, Nature Genetics,
2005
12
P(Data Parameters)
Probabilistic Model
Real World Data
13
P(Data Parameters)
Probabilistic Model
Real World Data
P(Parameters Data)
14
(Generative Model)
P(Data Parameters)
Probabilistic Model
Real World Data
P(Parameters Data)
(Inference)
15
Outline
  • Review of probability
  • Graphical models
  • Connecting probability models to data
  • Models with hidden variables
  • Case studies
  • (i) Simulating and forecasting rainfall data
    (ii) Curve clustering with cyclone
    trajectories(iii) Topic modeling from text
    documents

16
Part 1 Review of Probability
17
Notation and Definitions
  • X is a random variable
  • Lower-case x is some possible value for X
  • X x is a logical proposition that X takes
    value x
  • There is uncertainty about the value of X
  • e.g., X is the Dow Jones index at 5pm tomorrow
  • p(X x) is the probability that proposition Xx
    is true
  • often shortened to p(x)
  • If the set of possible xs is finite, we have a
    probability distribution and S p(x) 1
  • If the set of possible xs is infinite, p(x) is a
    density function, and p(x) integrates to 1 over
    the range of X

18
Example
  • Let X be the Dow Jones Index (DJI) at 5pm Monday
    August 22nd (tomorrow)
  • X can take real values from 0 to some large
    number
  • p(x) is a density representing our uncertainty
    about X
  • This density could be constructed from historical
    data, e.g.,
  • After 5pm p(x) becomes infinitely narrow around
    the true known x (no uncertainty)

19
Probability as Degree of Belief
  • Different agents can have different p(x)s
  • Your p(x) and the p(x) of a Wall Street expert
    might be quite different
  • OR if we were on vacation we might not have
    access to stock market information
  • we would still be uncertain about p(x) after 5pm
  • So we should really think of p(x) as p(x BI)
  • Where BI is background information available to
    agent I
  • (will drop explicit conditioning on BI in
    notation)
  • Thus, p(x) represents the degree of belief that
    agent I has in proposition x, conditioned on
    available background information

20
Comments on Degree of Belief
  • Different agents can have different probability
    models
  • There is no necessarily correct p(x)
  • Why? Because p(x) is a model built on whatever
    assumptions or background information we use
  • Naturally leads to the notion of updating
  • p(x BI) -gt p(x BI, CI)
  • This is the subjective Bayesian interpretation of
    probability
  • Generalizes other interpretations (such as
    frequentist)
  • Can be used in cases where frequentist reasoning
    is not applicable
  • We will use degree of belief as our
    interpretation of p(x) in this tutorial
  • Note!
  • Degree of belief is just our semantic
    interpretation of p(x)
  • The mathematics of probability (e.g., Bayes rule)
    remain the same regardless of our semantic
    interpretation

21
Multiple Variables
  • p(x, y, z)
  • Probability that Xx AND Yy AND Z z
  • Possible values cross-product of X Y Z
  • e.g., X, Y, Z each take 10 possible values
  • x,y,z can take 103 possible values
  • p(x,y,z) is a 3-dimensional array/table
  • Defines 103 probabilities
  • Note the exponential increase as we add more
    variables
  • e.g., X, Y, Z are all real-valued
  • x,y,z live in a 3-dimensional vector space
  • p(x,y,z) is a positive function defined over this
    space, integrates to 1

22
Conditional Probability
  • p(x y, z)
  • Probability of x given that Yy and Z z
  • Could be
  • hypothetical, e.g., if Yy and if Z z
  • observational, e.g., we observed values y and z
  • can also have p(x, y z), etc
  • all probabilities are conditional probabilities
  • Computing conditional probabilities is the basis
    of many prediction and learning problems, e.g.,
  • p(DJI tomorrow DJI index last week)
  • expected value of DJI tomorrow DJI index next
    week)
  • most likely value of parameter a given observed
    data

23
Computing Conditional Probabilities
  • Variables A, B, C, D
  • All distributions of interest related to A,B,C,D
    can be computed from the full joint distribution
    p(a,b,c,d)
  • Examples, using the Law of Total Probability
  • p(a) Sb,c,d p(a, b, c, d)
  • p(c,d) Sa,b p(a, b, c, d)
  • p(a,c d) Sb p(a, b, c d)
  • where p(a, b, c d) p(a,b,c,d)/p(d)
  • These are standard probability manipulations
    however, we will see how to use these to make
    inferences about parameters and unobserved
    variables, given data

24
Conditional Independence
  • A is conditionally independent of B given C iff
  • p(a b, c) p(a c)
  • (also implies that B is conditionally independent
    of A given C)
  • In words, B provides no information about A, if
    value of C is known
  • Example
  • a patient has upset stomach
  • b patient has headache
  • c patient has flu
  • Note that conditional independence does not imply
    marginal independence

25
Two Practical Problems
  • (Assume for simplicity each variable takes K
    values)
  • Problem 1 Computational Complexity
  • Conditional probability computations scale as
    O(KN)
  • where N is the number of variables being summed
    over
  • Problem 2 Model Specification
  • To specify a joint distribution we need a table
    of O(KN) numbers
  • Where do these numbers come from?

26
Two Key Ideas
  • Problem 1 Computational Complexity
  • Idea Graphical models
  • Structured probability models lead to tractable
    inference
  • Problem 2 Model Specification
  • Idea Probabilistic learning
  • General principles for learning from data

27
Part 2 Graphical Models
28
probability theory is more fundamentally
concerned with the structure of reasoning and
causation than with numbers.
Glenn Shafer and Judea Pearl Introduction to
Readings in Uncertain Reasoning, Morgan Kaufmann,
1990
29
Graphical Models
  • Represent dependency structure with a directed
    graph
  • Node lt-gt random variable
  • Edges encode dependencies
  • Absence of edge -gt conditional independence
  • Directed and undirected versions
  • Why is this useful?
  • A language for communication
  • A language for computation
  • Origins
  • Wright 1920s
  • Independently developed by Spiegelhalter and
    Lauritzen in statistics and Pearl in computer
    science in the late 1980s

30
Examples of 3-way Graphical Models
Marginal Independence p(A,B,C) p(A) p(B) p(C)
31
Examples of 3-way Graphical Models
Conditionally independent effects p(A,B,C)
p(BA)p(CA)p(A) B and C are conditionally
independent Given A e.g., A is a disease, and we
model B and C as conditionally
independent symptoms given A
32
Examples of 3-way Graphical Models
Independent Causes p(A,B,C) p(CA,B)p(A)p(B)
33
Examples of 3-way Graphical Models
Markov dependence p(A,B,C) p(CB) p(BA)p(A)
34
Real-World Example
  • Monitoring Intensive-Care Patients
  • 37 variables
  • 509 parameters
  • instead of 237
  • (figure courtesy of Kevin
  • Murphy/Nir Friedman)

35
Directed Graphical Models
p(A,B,C) p(CA,B)p(A)p(B)

36
Directed Graphical Models
p(A,B,C) p(CA,B)p(A)p(B)
In general, p(X1, X2,....XN) ? p(Xi
parents(Xi ) )

37
Directed Graphical Models
p(A,B,C) p(CA,B)p(A)p(B)
In general, p(X1, X2,....XN) ? p(Xi
parents(Xi ) )
  • Probability model has simple factored form
  • Directed edges gt direct dependence
  • Absence of an edge gt conditional independence
  • Also known as belief networks, Bayesian
    networks, causal networks


38
Example
D
B
E
C
A
F
G
39
Example
D
B
E
C
A
F
G
p(A, B, C, D, E, F, G) ? p( variable
parents ) p(AB)p(CB)p(BD)p(FE)p(GE)
p(ED) p(D)
40
Example
D
B
E
c
A
g
F
Say we want to compute p(a c, g)
41
Example
D
B
E
c
A
g
F
Direct calculation p(ac,g) Sbdef p(a,b,d,e,f
c,g) Complexity of the sum is O(K4)
42
Example
D
B
E
c
A
g
F
Reordering (using factorization) Sb p(ab) Sd
p(bd,c) Se p(de) Sf p(e,f g)
43
Example
D
B
E
c
A
g
F
Reordering Sb p(ab) Sd p(bd,c) Se p(de) Sf
p(e,f g)
p(eg)
44
Example
D
B
E
c
A
g
F
Reordering Sb p(ab) Sd p(bd,c) Se p(de)
p(eg)
p(dg)
45
Example
D
B
E
c
A
g
F
Reordering Sb p(ab) Sd p(bd,c) p(dg)
p(bc,g)
46
Example
D
B
E
c
A
g
F
Reordering Sb p(ab) p(bc,g)
p(ac,g)
Complexity is O(K), compared to O(K4)
47
A More General Algorithm
  • Message Passing (MP) Algorithm
  • Pearl, 1988 Lauritzen and Spiegelhalter, 1988
  • Declare 1 node (any node) to be a root
  • Schedule two phases of message-passing
  • nodes pass messages up to the root
  • messages are distributed back to the leaves
  • In time O(N), we can compute P(.)

48
Sketch of the MP algorithm in action
49
Sketch of the MP algorithm in action
1
50
Sketch of the MP algorithm in action
2
1
51
Sketch of the MP algorithm in action
2
1
3
52
Sketch of the MP algorithm in action
2
1
3
4
53
Complexity of the MP Algorithm
  • Efficient
  • Complexity scales as O(N K m)
  • N number of variables
  • K arity of variables
  • m maximum number of parents for any node
  • Compare to O(KN) for brute-force method

54
Graphs with loops
D
B
E
C
A
F
G
Message passing algorithm does not work
when there are multiple paths between 2 nodes
55
Graphs with loops
D
B
E
C
A
F
G
General approach cluster variables together to
convert graph to a tree
56
Reduce to a Tree
D
B, E
C
A
F
G
57
Reduce to a Tree
D
B, E
C
A
F
G
Good news can perform MP algorithm on this
tree Bad news complexity is now O(K2)
58
Probability Calculations on Graphs
  • Structure of the graph reveals
  • Computational strategy
  • Dependency relations
  • Complexity is typically O(K max(number of
    parents) )
  • If single parents (e.g., tree), -gt O(K)
  • The sparser the graph the lower the complexity
  • Technique can be automated
  • i.e., a fully general algorithm for arbitrary
    graphs
  • For continuous variables
  • replace sum with integral
  • For identification of most likely values
  • Replace sum with max operator

59
Hidden Markov Model (HMM)
Observed
Y3
Yn
Y1
Y2
- - - - - - - - - - - - - - - - - - - - - - - - -
- - - - - - - - - - - - - - - - - - - - - - - - -
- -
Hidden
S3
Sn
S1
S2
Two key assumptions 1. hidden state sequence is
Markov 2. observation Yt is CI of all
other variables given St Widely used in speech
recognition, protein sequence models Motivation
switching dynamics, low-d representation of Ys,
etc
60
HMMs as graphical models
  • Computations of interest
  • p( Y ) S p(Y , S s) -gt
    forward-backward algorithm
  • arg maxs p(S s Y) -gt Viterbi
    algorithm
  • Both algorithms.
  • computation time linear in T
  • special cases of MP algorithm
  • Many generalizations and extensions.
  • Make state S continuous -gt Kalman filters
  • Add inputs -gt convolutional decoding
  • Add additional dependencies in the model
  • Generalized HMMs


61
Part 3 Connecting Probability Models to Data
  • Recommended References for this Section
  • All of Statistics, L. Wasserman, Chapman and
    Hall, 2004 (Chapters 6,9,11)
  • Pattern Classification and Scene Analysis, 1st
    ed, R. Duda and P. Hart, Wiley, 1973, Chapter
    3.

62
(Generative Model)
P(Data Parameters)
Probabilistic Model
Real World Data
P(Parameters Data)
(Inference)
63
Conditionally Independent Observations
NEW
q
Model parameters
Data
yn-1
yn
y1
y2
64
Plate Notation
q
Model parameters
Data y1,yn
yi
i1n
Plate rectangle in graphical model
variables within a plate are replicated
in a conditionally independent manner
65
Example Gaussian Model
s
m
yi
i1n
Generative model p(y1,yn m, s) P p(yi
m, s)
p(data parameters)
p(D q) where q
m, s
66
The Likelihood Function
  • Likelihood p(data parameters)
  • p( D q )
  • L (q)
  • Likelihood tells us how likely the observed data
    are conditioned on a particular setting of the
    parameters
  • Details
  • Constants that do not involve q can be dropped in
    defining L (q)
  • Often easier to work with log L (q)

67
Comments on the Likelihood Function
  • Constructing a likelihood function L (q) is the
    first step in probabilistic modeling
  • The likelihood function implicitly assumes an
    underlying probabilistic model M with parameters
    q
  • L (q) connects the model to the observed data
  • Graphical models provide a useful language for
    constructing likelihoods

68
Binomial Likelihood
NEW
  • Binomial model
  • N memoryless trials
  • probability q of success at each trial
  • Observed data
  • r successes in n trials
  • Defines a likelihood
  • L(q) p(D q)
  • p(succeses)
    p(non-successes)
  • q r (1-q) n-r

q
yi
i1n
69
Binomial Likelihood Examples
NEW
70
Gaussian Model and Likelihood
Model assumptions 1. ys are
conditionally independent given model 2.
each y comes from a Gaussian (Normal) density
71
(No Transcript)
72
Conditional Independence (CI)
  • CI in a likelihood model means that we are
    assuming data points provide no information about
    each other, if the model parameters are assumed
    known.
  • p( D q ) p(y1, yN q ) P
    p(yi q )
  • Works well for (e.g.)
  • Patients randomly arriving at a clinic
  • Web surfers randomly arriving at a Web site
  • Does not work well for
  • Time-dependent data (e.g., stock market)
  • Spatial data (e.g., pixel correlations)

CI assumption
73
Example Markov Likelihood
  • Motivation wish to model data in a sequence
    where there is sequential dependence,
  • e.g., a first-order Markov chain for a DNA
    sequence
  • Markov modeling assumption
  • p(yt yt-1, yt-2, yt) p(yt yt-1)
  • q matrix of K x K transition matrix
    probabilities
  • L( q ) p( D q ) p(y1, yN q ) P
    p(yt yt-1 , q )

74
Maximum Likelihood (ML) Principle
(R. Fisher 1922)
Model parameters
q
Data y1,yn
yi
i1n
L (q) p(Data q ) P p(yi q ) Maximum
Likelihood qML arg max
Likelihood(q) Select the parameters that make
the observed data most likely
75
Example ML for Gaussian Model
Maximum Likelhood Estimate qML
76
Maximizing the Likelihood
  • More generally, we analytically solve for the q
    value that maximizes the function L (q)
  • With p parameters, L (q) is a scalar function
    defined over a p-dimensional space
  • 2 situations
  • We can analytically solve for the maxima of L (q)
  • This is rare
  • We have to resort to iterative techniques to find
    qML
  • More common
  • General approach
  • Define a generative probabilistic model
  • Define an associated likelihood (connect model to
    data)
  • Solve an optimization problem to find qML

77
Analytical Solution for Gaussian Likelihood
78
Graphical Model for Regression
xi
s
q
yi
i1n
79
Example
y
x
80
Example
y
f(x q ) this is unknown
x
81
Example ML for Linear Regression
  • Generative model
  • y ax b Gaussian noise
  • p(y) N(ax b, s)
  • Conditional Likelihood
  • L(q) p(y1, yN x1, xN, q )
  • P p(yi xi , q ) ,
    q a, b
  • Can show (homework problem!) that
  • log L(q) - S yi - (a xi b) 2
  • i.e., finding a,b to maximize log-
    likelihood is the same as finding a,b that
    minimizes least squares

82
ML and Regression
  • Multivariate case
  • multiple xs, multiple regression coefficients
  • with Gaussian noise, the ML solution is again
    equivalent to least-squares (solutions to a set
    of linear equations)
  • Non-linear multivariate model
  • With Gaussian noise we get
  • log L(q) - S yi - f (xi q) 2
  • Conditions for the q that maximizes L(q) leads
    to a set of p non-linear equations in p variables
  • e.g., f (xi q) a multilayer neural network
    with 1000 weights
  • Optimization finding the maximum of a
    non-convex function in 1000 dimensional space!
  • Typically use iterative local search based on
    gradient (many possible variations)

83
Probabilistic Learning and Classification
NEW
  • 2 main approaches
  • p(c x) p(xc) p(c) / p(x) p(xc) p(c)
  • -gt learn a model for p(xc) for each class,
    use Bayes rule to classify
  • - example naïve Bayes
  • - advantage theoretically optimal if
    p(xc) is correct
  • - disadvantage not directly
    optimizing predictive accuracy
  • Learn p(cx) directly, e.g.,
  • logistic regression (see tutorial notes from D.
    Lewis)
  • other regression methods such as neural networks,
    etc.
  • Often quite effective in practice very useful
    for ranking, scoring, etc
  • Contrast with purely discriminative methods such
    as SVMs, trees

84
The Bayesian Approach to Learning
a
Prior(q) p( q a )
q
yi
i1n
Maximum A Posteriori qMAP arg max
Likelihood(q) x Prior(q) Fully Bayesian
p( q Data) p(Data q ) p(q) / p(Data)
85
The Bayesian Approach
Fully Bayesian p( q Data) p(Data q ) p(q)
/ p(Data) Likelihood x Prior
/ Normalization term Estimating p( q Data) can
be viewed as inference in a graphical model ML
is a special case MAP with a flat prior
a
q
yi
i1n
86
More Comments on Bayesian Learning
  • fully Bayesian report full posterior density
    p(q D)
  • For simple models, we can calculate p(q D)
    analytically
  • Otherwise we empirically estimate p(q D)
  • Monte Carlo sampling methods are very useful
  • Bayesian prediction (e.g., for regression)
  • p(y x, D ) integral p(y, q x, D) dq
  • integral p(y q, x) p(q
    D) dq
  • -gt prediction at each q is
    weighted by p(qD)
  • theoretically preferable to picking a single q
    (as in ML)

87
More Comments on Bayesian Learning
  • In practice
  • Fully Bayesian is theoretically optimal but not
    always the most practical approach
  • E.g., computational limitations with large
    numbers of parameters
  • assessing priors can be tricky
  • Bayesian approach particularly useful for small
    data sets
  • For large data sets, Bayesian, MAP, ML tend to
    agree
  • ML/MAP are much simpler gt often used in practice

88
Example of Bayesian Estimation
  • Definition of Beta prior
  • Definition of Binomial likelihood
  • Form of Beta posterior
  • Examples of plots with priorlikelihood -gt
    posterior

89
Beta Density as a Prior
NEW
  • Let q be a proportion,
  • e.g., fraction of customers that respond to an
    email ad
  • p(q) is a prior for q
  • e.g. p(q a, b) Beta density with parameters a
    and b
  • p(q a, b) q a-1 (1-q) b-1
  • a /(a b) influences the location
  • a b controls the width

a
q
b
90
Examples of Beta Density Priors
NEW
91
Binomial Likelihood
NEW
  • Binomial model
  • N memoryless trials
  • probability q of success at each trial
  • Observed data
  • r successes in n trials
  • Defines a likelihood
  • p(D q) p(succeses)
    p(non-successes)
  • q r (1-q) n-r

92
Beta Binomial -gt Beta
NEW
  • p(q D) Posterior Likelihood x Prior
  • Binomial x
    Beta
  • q r (1-q)
    n-r x q a-1 (1-q) b-1
  • Beta(a r, b
    n r)
  • Prior is updated using data
  • Parameters a -gt ar, b -gt b n
    r
  • Sample size a b -gt a b n
  • Mean a /(a b) -gt (a r)/(a b n)

93
NEW
94
NEW
95
NEW
96
Extensions
NEW
  • K categories with K probabilities that sum to 1
  • Dirichlet prior Multinomial likelihood -gt
    Dirichlet posterior
  • Used in text modeling, protein alignment
    algorithms, etc
  • E.g. Biological Sequence Analysis, R. Durbin et
    al., Cambridge University Press, 1998.
  • Hierarchical modeling
  • Multiple trials for different individuals
  • Each individual has their own q
  • The qs common population distribution
  • For applications in marketing see
  • Market Segmentation Conceptual and
    Methodological Foundations, M. Wedel and W. A.
    Kamakura, Kluwer, 1998

97
Example Bayesian Gaussian Model
a
b
s
m
yi
i1n
Note priors and parameters are assumed
independent here
98
Example Bayesian Regression
b
a
xi
s
q
yi
i1n
Model yi f xiq e, e N(0, s2)
p(yi xi) N ( fxiq , s2 )
99
Other Examples
UPDATED
  • Bayesian examples
  • Bayesian neural networks
  • Richer probabilistic models
  • Random effects models
  • E.g., Learning to align curves
  • Learning model structure
  • Chow-Liu trees
  • General graphical model structures
  • e.g. gene regulation networks
  • Comprehensive reference
  • Bayesian Data Analysis, A. Gelman, J. B. Carlin.
    H. S. Stern, and D. B. Rubin, Chapman and Hall,
    2nd edition, 2003.

100
Learning Shapes and Shifts
Data smoothed growth acceleration data from
teenagers EM used to learn a spline model
time-shift for each curve
Data after Learning
Original data
101
Learning to Track People Sidenbladh, Black ,
Fleet, 2000
102
Model Uncertainty
  • How do we know what model M to select for our
    likelihood function?
  • In general, we dont!
  • However, we can use the data to help us infer
    which model from a set of possible models is best

103
Method 1 Bayesian Approach
  • Can evaluate the evidence for each model,
  • p(M D) p(DM) p(M)/ p(D)
  • Can get p(DM) by integrating p(D, q M) over
    parameter space (this is the marginal
    likelihood)
  • in theory p(M D) is how much evidence exists in
    the data for model M
  • More complex models are automatically penalized
    because of the integration over
    higher-dimensional parameter spaces
  • in practice p(MD) can rarely be computed
    directly
  • Monte Carlo schemes are popular
  • Also approximations such as BIC, Laplace, etc

104
Comments on Bayesian Approach
  • Bayesian Model Averaging (BMA)
  • Instead of selecting the single best model, for
    prediction average over all available models
    (theoretically the correct thing to do)
  • Weights used for averaging are p(MD)
  • Empirical alternatives
  • e.g., Stacking, Bagging
  • Idea is to learn a set of unconstrained combining
    weights from the data, weights that optimize
    predictive accuracy
  • emulate BMA approach
  • may be more effective in practice

105
Method 2 Predictive Validation
  • Instead of the Bayesian approach, we could use
    the probability of new unseen test data as our
    metric for selecting models
  • E.g., 2 models
  • If p(D M1) gt p(D M2) then M1 is assigning
    higher probability to new data than M2
  • This will (with enough data) select the model
    that predicts the best, in a probabilistic sense
  • Useful for problems where we have very large
    amounts of data and it is easy to create a large
    validation data set D

106
The Prediction Game
NEW
0
10
Observed Data
x
What is a good guess at p(x)?
Model A for p(x)
0
10
Model B for p(x)
0
10
107
Which of Model A or B is better?
NEW
Test data generated from the true underlying q(x)
Model A
Model B
We can score each model in terms of p(new data
model) Asymptotically, this is a fair unbiased
score (irrespective of the complexities of
the models) Note empirical average of log
p(data) scores negative entropy
108
NEW
Model-based clustering and visualization of
navigation patterns on a Web site Cadez et al,
Journal of Data Mining and Knowledge Discovery,
2003
109
Simple Model Class
110
Data-generating process (truth)
Simple Model Class
111
Data-generating process (truth)
Closest model in terms of KL distance
Simple Model Class
Best model is relatively far from Truth gt High
Bias
112
Data-generating process (truth)
Simple Model Class
Complex Model Class
113
Data-generating process (truth)
Simple Model Class
Best model is closer to Truth gt Low Bias
Complex Model Class
114
However,. this could be the model that best fits
the observed data gt High Variance
Data-generating process (truth)
Simple Model Class
Complex Model Class
115
Part 4 Models with Hidden Variables
116
Hidden or Latent Variables
  • In many applications there are 2 sets of
    variables
  • Variables whose values we can directly measure
  • Variables that are hidden, cannot be measured
  • Examples
  • Speech recognition
  • Observed acoustic voice signal
  • Hidden label of the word spoken
  • Face tracking in images
  • Observed pixel intensities
  • Hidden position of the face in the image
  • Text modeling
  • Observed counts of words in a document
  • Hidden topics that the document is about

117
Mixture Models
Pearson, 1894, Phil. Trans. Roy. Soc. A.
p(Y) Sk p(Y Sk) p(Sk)
S
Hidden discrete variable
Y
Observed variable(s)
Motivation 1. models a true process (e.g., fish
example) 2. approximation for a
complex process
118
(No Transcript)
119
(No Transcript)
120
(No Transcript)
121
A Graphical Model for Clustering
Hidden discrete (cluster) variable
S
Yj
Y1
Yd
Observed variable(s) (assumed conditionally
independent given S)
Clusters p(Y1,Yd S s) Probabilistic
Clustering learning these probability
distributions from data
122
Hidden Markov Model (HMM)
Observed
Y3
Yn
Y1
Y2
- - - - - - - - - - - - - - - - - - - - - - - - -
- - - - - - - - - - - - - - - - - - - - - - - - -
- -
Hidden
S3
Sn
S1
S2
Two key assumptions 1. hidden state sequence is
Markov 2. observation Yt is CI of all
other variables given St Widely used in speech
recognition, protein sequence models Motivation?
- S can provide non-linear switching
- S can encode low-dim time-dependence for
high-dim Y
123
Generalizing HMMs
Y3
Yn
Y1
Y2
S3
Sn
S1
S2
T3
Tn
T1
T2
Two independent state variables, e.g., two
processes evolving at different time-scales
124
Generalizing HMMs
Y3
Yn
Y1
Y2
S3
Sn
S1
S2
I3
In
I1
I2
  • Inputs I provide context to influence switching,
  • e.g., external forcing variables
  • Model is still a tree -gt inference is still linear

125
Generalizing HMMs
Y3
Yn
Y1
Y2
S3
Sn
S1
S2
I3
In
I1
I2
  • Add direct dependence between Ys to better model
    persistence
  • Can merge each St and Yt to construct a
    tree-structured model

126
Mixture Model
q
Si
yi
i1n
Likelihood(q) p(Data q ) Pi
p(yi q ) Pi Sk p(yi si
k , q ) p(si k)
127
Learning with Missing Data
  • Guess at some initial parameters q0
  • E-step (Inference)
  • For each case, and each unknown variable compute
  • p(S known data, q0 )
  • M-step (Optimization)
  • Maximize L(q) using p(S .. )
  • This yields new parameter estimates q1
  • This is the EM algorithm
  • Guaranteed to converge to a (local) maximum of
    L(q)
  • Dempster, Laird, Rubin, 1977

128
E-Step
q
Si
yi
i1n
129
M-Step
q
Si
yi
i1n
130
E-Step
q
Si
yi
i1n
131
The E (Expectation) Step
Current K components and parameters
n objects
E step Compute p(object i is in group k)
132
The M (Maximization) Step
New parameters for the K components
n objects
M step Compute q, given n objects and memberships
133
Complexity of EM for mixtures
K models
n objects
Complexity per iteration scales as O( n K f(d) )
134
Data from Prof. Christine McLaren, Dept of
Epidemiology, UC Irvine
135
(No Transcript)
136
(No Transcript)
137
(No Transcript)
138
(No Transcript)
139
(No Transcript)
140
(No Transcript)
141
Control Group
Anemia Group
142
(No Transcript)
143
Example of a Log-Likelihood Surface

Mean 2
Log Scale for Sigma 2
144
(No Transcript)
145
HMMs
Y3
YN
Y1
Y2
S3
SN
S1
S2
146
q1
Y3
YN
Y1
Y2
S3
SN
S1
S2
147
q1
Y3
YN
Y1
Y2
S3
SN
S1
S2
q2
148
E-Step (linear inference)
q1
Y3
YN
Y1
Y2
S3
SN
S1
S2
q2
149
M-Step (closed form)
q1
Y3
YN
Y1
Y2
S3
SN
S1
S2
q2
150
Alternatives to EM
  • Method of Moments
  • EM is more efficient
  • Direct optimization
  • e.g., gradient descent, Newton methods
  • EM is usually simpler to implement
  • Sampling (e.g., MCMC)
  • Minimum distance, e.g.,

151
Mixtures as Data Simulators
For i 1 to N classk p(class1, class2, .,
class K) xi p(x classk) end
152
Mixtures with Markov Dependence
For i 1 to N classk p(class1, class2, .,
class K classxi-1 ) xi p(x classk) end
Current class depends on previous class (Markov
dependence) This is a hidden Markov model
153
Mixtures of Sequences
For i 1 to N classk p(class1, class2, .,
class K) while non-end state xij p(xj
xj-1, classk) end end
Produces a variable length sequence
Markov sequence model
154
Mixtures of Curves
For i 1 to N classk p(class1, class2, .,
class K) Li p(Li classk) for i 1 to Li
yij f(y xj, classk) ek end end
Length of curve
Independent variable x
Class-dependent curve model
155
Mixtures of Image Models
For i 1 to N classk p(class1, class2, .,
class K) sizei p(sizeclassk) for i 1 to
Vi-1 intensityi p(intensity classk)
end end
Global scale
Number of vertices
Pixel generation model
156
More generally..
Generative Model - select a component ck for
individual i - generate data according to p(Di
ck) - p(Di ck) can be very general - e.g.,
sets of sequences, spatial patterns, etc Note
given p(Di ck), we can define an EM algorithm
157
References
NEW
  • The EM Algorithm and Mixture Models
  • The EM Algorithm and ExtensionsG. McLachlan and
    T. Krishnan. John Wiley and Sons, New York, 1997.
  • Mixture models
  • Statistical analysis of finite mixture
    distributions.
  • D. M. Titterington, A. F. M. Smith U. E.
    Makov. Wiley Sons, Inc., New York, 1985.
  • Finite Mixture Models
  • G.J. McLachlan and D. Peel, New York Wiley
    (2000)
  • Model-based clustering, discriminant analysis,
    and density estimation, C. Fraley and A. E.
    Raftery, Journal of the American Statistical
    Association 97611-631 (2002).

158
References
NEW
  • Hidden Markov Models
  • A tutorial on hidden Markov models and selected
    applications in speech recognition, L. R.
    Rabiner, Proceedings of the IEEE, vol. 77, no.2,
    257-287, 1989.
  • Probabilistic independence networks for hidden
    Markov modelsP. Smyth, D. Heckerman, and M.
    Jordan, Neural Computation , vol.9, no. 2,
    227-269, 1997.
  • Hidden Markov models, A. Moore, online tutorial
    slides, http//www.autonlab.org/tutorials/hmm12.pd
    f

159
Part 5 Case Studies(i) Simulating and
forecasting rainfall data (ii) Curve clustering
with cyclones(iii) Topic modeling from text
documentsand if time permits..(iv) Sequence
clustering for Web data(v) Analysis of
time-course gene expression data
160
Case Study 1Simulating and Predicting Rainfall
PatternsJoint work withAndy Robertson,
International Research Institute for Climate
PredictionSergey Kirshner, Department of
Computer Science, UC Irvine
161
Spatio-Temporal Rainfall Data
Northeast Brazil 1975-2002 90-day time
series 24 years 10 stations
162
(No Transcript)
163
Modeling Goals
  • Downscaling
  • Modeling interannual variability
  • coupling rainfall to large-scale effects like El
    Nino
  • Prediction
  • e.g., hindcasting of missing data
  • Seasonal Forecasts
  • E.g. on Dec 1 produce simulations of likely
    90-day winters

164
HMMs for Rainfall Modeling
Y3
YN
Y1
Y2
S3
SN
S1
S2
I3
IN
I1
I2
S unobserved weather state Y spatial
rainfall pattern (outputs) I atmospheric
variables (inputs)
165
Learned Weather States
  • States provide an interpretable view of
    spatio-temporal relationships in the data

166
(No Transcript)
167
WeatherStatesfor Kenya
168
(No Transcript)
169
(No Transcript)
170
  • Spatial Chow-Liu Trees
  • Spatial distribution given a state is a tree
    structure
  • (a graphical model)
  • Useful intermediate between full pair-wise model
    and conditional independence
  • Optimal topology learned from data using minimum
    spanningtree algorithm
  • Can use priors based on distance, topography
  • Tree-structure over time also

171
Missing Data
172
Error rate v. fraction of missing data
173
References
NEW
  • Trees and Hidden Markov Models
  • Conditional Chow-Liu tree structures for modeling
    discrete-valued vector time seriesS. Kirshner,
    P. Smyth, and A. Robertsonin Proceedings of the
    20th International Conference on Uncertainty in
    AI , 2004.
  • Applications to rainfall modeling
  • Hidden Markov models for modeling daily rainfall
    occurrence over BrazilA. Robertson, S. Kirshner,
    and P. Smyth Journal of Climate, November 2005.

174
Summary
  • Simple empirical probabilistic models can be
    very helpful in interpreting large scientific
    data sets
  • e.g., HMM states provide scientists with a basic
    but useful classification of historical spatial
    rainfall patterns
  • Graphical models provide glue to link together
    different information
  • Spatial
  • Temporal
  • Hidden states, etc
  • Generative aspect of probabilistic models can
    be quite useful, e.g., for simulation
  • Missing data is handled naturally in a
    probabilistic framework

175
Case Study 2Clustering Cyclone
TrajectoriesJoint work withSuzana Camargo,
Andy Robertson, International Research Institute
for Climate PredictionScott Gaffney, Department
of Computer Science, UC Irvine
176
Storm Trajectories
177
Microarray Gene Expression Data
178
Clustering non-vector data
  • Challenges with the data.
  • May be of different lengths, sizes, etc
  • Not easily representable in vector spaces
  • Distance is not naturally defined a priori
  • Possible approaches
  • convert into a fixed-dimensional vector space
  • Apply standard vector clustering but loses
    information
  • use hierarchical clustering
  • But O(N2) and requires a distance measure
  • probabilistic clustering with mixtures
  • Define a generative mixture model for the data
  • Learn distance and clustering simultaneously

179
Graphical Models for Curves
Data (y1,t1),. yT, tT)
t
q
y
T
y f(t q ) e.g., y
at2 bt c, q a, b, c
180
Graphical Models for Curves
t
q
s
y
T
y Gaussian density with mean f(t q ),
variance s2
181
Example
y
t
182
Example
f(t q ) lt- this is hidden
y
t
183
Graphical Models for Sets of Curves
q
t
s
y
T
N curves
Each curve P(yi ti, q ) product of
Gaussians
184
Curve-Specific Transformations
Note we can learn function parameters and
shifts simultaneously with EM
q
t
a
s
y
T
N curves
e.g., yi at2 bt c ai, q a, b, c,
a1,.aN
185
Learning Shapes and Shifts
Data smoothed growth acceleration data from
teenagers EM used to learn a spline model
time-shift for each curve
Data after Learning
Original data
186
Clustering Mixtures of Curves
c
q
t
a
s
y
T
N curves
Each set of trajectory points comes from 1 of K
models Model for group k is a Gaussian curve
model Marginal probability for a trajectory
mixture model
187
The Learning Problem
  • K cluster models
  • Each cluster is a shape model EY f(Xq) with
    its own parameters
  • N observed curves for each curve we learn
  • P(cluster k curve data)
  • distribution on alignments, shifts, scaling, etc,
    given data
  • Requires simultaneous learning of
  • Cluster models
  • Curve transformation parameters
  • Results in an EM algorithm where E and M step are
    tractable

188
(No Transcript)
189
(No Transcript)
190
Results on Simulated Data
Standard EM
0.89
-7.87
0.171
0.105
Averaged over 50 train/test sets
191
Clusters of Trajectories
192
Cluster Shapes for Pacific Cyclones
193
TROPICAL CYCLONES Western North Pacific 1983-2002
194
(No Transcript)
195
References on Curve Clustering
NEW
  • Functional Data Analysis
  • J. O. Ramsay and B. W. Silverman, Springer,
    1997.
  • Probabilistic curve-aligned clustering and
    prediction with regression mixture models S. J.
    Gaffney, Phd Thesis, Department of Computer
    Science, University of California, Irvine, March
    2004.
  • Joint probabilistic curve clustering and
    alignment S. Gaffney and P. Smyth Advances in
    Neural Information Processing 17 , in press,
    2005.
  • Probabilistic clustering of extratropical
    cyclones using regression mixture modelsS.
    Gaffney, A. Robertson, P. Smyth, S. Camargo, M.
    Ghil preprint, online at www.datalab.uci.edu.

196
Summary
  • Graphical models provide a flexible
    representational language for modeling complex
    scientific data
  • can build complex models from simpler building
    blocks
  • Systematic variability in the data can be handled
    in a principled way
  • Variable length time-series
  • Misalignments in trajectories
  • Generative probabilistic models are interpretable
    and understandable by scientists

197
Case Study 3Topic Modeling from Text
DocumentsJoint work withMark Steyvers, Dave
Newman, Chaitanya Chemudugunta, UC
IrvineMichal Rosen-Zvi, Hebrew University,
JerusalemTom Griffiths, Brown University
198
Enron email data
250,000 emails 5000 authors 1999-2002
199
Questions of Interest
  • What topics do these documents span?
  • Which documents are about a particular topic?
  • How have topics changed over time?
  • What does author X write about?
  • Who is likely to write about topic Y?
  • Who wrote this specific document?
  • and so on..

200
Graphical Model for Clustering
Cluster for document
z
f
Cluster-Word distributions
w
Word
n
D
201
Graphical Model for Topics
Document-Topic distributions
q
z
Topic
f
Topic-Word distributions
w
Word
n
D
202
Topic probability distribution over words
203
Key Features of Topic Models
NEW
  • Generative model for documents in form of bags of
    words
  • Allows a document to be composed of multiple
    topics
  • Much more powerful than 1 doc -gt 1 cluster
  • Completely unsupervised
  • Topics learned directly from data
  • Leverages strong dependencies at word level AND
    large data sets
  • Learning algorithm
  • Gibbs sampling is the method of choice
  • Scalable
  • Linear in number of word tokens
  • Can be run on millions of documents

204
Topics vs. Other Approaches
  • Clustering documents
  • Computationally simpler
  • But a less accurate and less flexible model
  • LSI/LSA
  • Projects words into a K-dimensional hidden space
  • Less interpretable
  • Not generalizable
  • E.g., authors or other side-information
  • Not as accurate
  • E.g., precision-recall Hoffman, Blei et al,
    Buntine, etc
  • Topic Models (aka LDA model)
  • next-generation text modeling, after LSI
  • More flexible and more accurate (in prediction)
  • Linear time complexity in fitting the model

205
Examples of Topics learned from Proceedings of
the National Academy of Sciences
Griffiths and Steyvers, 2004
NEW
STRUCTURE ANGSTROM CRYSTAL RESIDUES STRUCTURES STR
UCTURAL RESOLUTION HELIX THREE HELICES DETERMINED
RAY CONFORMATION HELICAL HYDROPHOBIC SIDE DIMENSIO
NAL INTERACTIONS MOLECULE SURFACE
NEURONS BRAIN CORTEX CORTICAL OLFACTORY NUCLEUS NE
URONAL LAYER RAT NUCLEI CEREBELLUM CEREBELLAR LATE
RAL CEREBRAL LAYERS GRANULE LABELED HIPPOCAMPUS AR
EAS THALAMIC
TUMOR CANCER TUMORS HUMAN CELLS BREAST MELANOMA GR
OWTH CARCINOMA PROSTATE NORMAL CELL METASTATIC MAL
IGNANT LUNG CANCERS MICE NUDE PRIMARY OVARIAN
MUSCLE CARDIAC HEART SKELETAL MYOCYTES VENTRICULAR
MUSCLES SMOOTH HYPERTROPHY DYSTROPHIN HEARTS CONT
RACTION FIBERS FUNCTION TISSUE RAT MYOCARDIAL ISOL
ATED MYOD FAILURE
HIV VIRUS INFECTED IMMUNODEFICIENCY CD4 INFECTION
HUMAN VIRAL TAT GP120 REPLICATION TYPE ENVELOPE AI
DS REV BLOOD CCR5 INDIVIDUALS ENV PERIPHERAL
FORCE SURFACE MOLECULES SOLUTION SURFACES MICROSCO
PY WATER FORCES PARTICLES STRENGTH POLYMER IONIC A
TOMIC AQUEOUS MOLECULAR PROPERTIES LIQUID SOLUTION
S BEADS MECHANICAL
206
What can Topic Models be used for?
  • Queries
  • Who writes on this topic?
  • e.g., finding experts or reviewers in a
    particular area
  • What topics does this person do research on?
  • Comparing groups of authors or documents
  • Discovering trends over time
  • Detecting unusual papers and authors
  • Interactive browsing of a digital library via
    topics
  • Parsing documents (and parts of documents) by
    topic
  • and more..

207
What is this paper about?
NEW
  • Empirical Bayes screening for multi-item
    associations
  • Bill DuMouchel and Daryl Pregibon, ACM SIGKDD
    2001
  • Most likely topics according to the model are
  • data, mining, discovery, association, attribute..
  • set, subset, maximal, minimal, complete,
  • measurements, correlation, statistical,
    variation,
  • Bayesian, model, prior, data, mixture,..

208
(No Transcript)
209
(No Transcript)
210
Pennsylvania Gazette
NEW
(courtesy of David Newman Sharon Block, UC
Irvine)
1728-1800 80,000 articles
211
Historical Trends in Pennsylvania Gazette
NEW
(courtesy of David Newman Sharon Block, UC
Irvine)
STATE GOVERNMENT CONSTITUTION LAW UNITED POWER CIT
IZEN PEOPLE PUBLIC CONGRESS
SILK COTTON DITTO WHITE BLACK LINEN CLOTH WOMEN BL
UE WORSTED
212
Enron email data
250,000 emails 5000 authors 1999-2002
213
Enron email topics
214
Non-work Topics
215
Topical Topics
216
Using Topic Models for Information Retrieval
UPDATED
217
Author-Topic Models
  • The author-topic model
  • a probabilistic model linking authors and topics
  • authors -gt topics -gt words
  • Topic distribution over words
  • Author distribution over topics
  • Document generated from a mixture of author
    distributions
  • Learns about entities based on associated text
  • Can be generalized
  • Replace author with any categorical doc
    information
  • e.g., publication type, source, year, country of
    origin, etc

218
Author-Topic Graphical Model
a
x
Author
Author-Topic distributions
q
z
Topic
f
Topic-Word distributions
w
Word
n
D
219
Learning Author-Topic Models from Text
  • Full probabilistic model
  • Power of statistical learning can be leveraged
  • Learning algorithm is linear in number of word
    occurrences
  • Scalable to very large data sets
  • Completely automated (no tweaking required)
  • completely unsupervised, no labels
  • Query answering
  • A wide variety of queries can be answered
  • Which authors write on topic X?
  • What are the spatial patterns in usage of topic
    Y?
  • How have authors A, B and C changed over time?
  • Queries answered using probabilistic inference
  • Query time is real-time (learning is offline)

220
Author-Topic Models for CiteSeer
221
Author-Profiles
  • Author Andrew McCallum, U Mass
  • Topic 1 classification, training,
    generalization, decision, data,
  • Topic 2 learning, machine, examples,
    reinforcement, inductive,..
  • Topic 3 retrieval, text, document, information,
    content,
  • Author Hector Garcia-Molina, Stanford
  • - Topic 1 query, index, data, join, processing,
    aggregate.
  • - Topic 2 transaction, concurrency, copy,
    permission, distributed.
  • - Topic 3 source, separation, paper,
    heterogeneous, merging..
  • Author Jerry Friedman, Stanford
  • Topic 1 regression, estimate, variance, data,
    series,
  • Topic 2 classification, training, accuracy,
    decision, data,
  • Topic 3 distance, metric, similarity, measure,
    nearest,

222
(No Transcript)
223
PubMed-Query Topics
224
PubMed-Query Topics
225
PubMed Topics by Country
226
PubMed-Query Topics by Country
227
Extended Models
  • Conditioning on non-authors
  • side-information other than authors
  • e.g., date, publication venue, country, etc
  • can use citations as authors
  • Fictitious authors and common author
  • Allow 1 unique fictitious author per document
  • Captures document specific effects
  • Assign 1 common fictitious author to each
    document
  • Captures broad topics that are used in many
    documents
  • Semantics and syntax model
  • Semantic topics topics that are specific to
    certain documents
  • Syntactic topics broad, across many documents
  • Probabilistic model that learns each type
    automatically

228
Scientific syntax and semantics(Griffiths et
al., NIPS 2004 slides courtesy of Mark Steyvers
and Tom Griffiths, PNAS Symposium presentation,
2003)
Factorization of language based on statistical
dependency patterns long-range, document
specific dependencies short-range
dependencies constant across all documents
semantics probabilistic topics
q
z
z
z
w
w
w
x
x
x
syntax probabilistic regular grammar
229
x 2
OF 0.6 FOR 0.3 BETWEEN 0.1
x 1
0.8
z 1 0.4
z 2 0.6
HEART 0.2 LOVE 0.2 SOUL 0.2 TEARS 0.2 JOY
0.2
SCIENTIFIC 0.2 KNOWLEDGE 0.2 WORK
0.2 RESEARCH 0.2 MATHEMATICS 0.2
0.7
0.1
0.3
0.2
x 3
THE 0.6 A 0.3 MANY 0.1
0.9
230
x 2
OF 0.6 FOR 0.3 BETWEEN 0.1
x 1
0.8
z 1 0.4
z 2 0.6
HEART 0.2 LOVE 0.2 SOUL 0.2 TEARS 0.2 JOY
0.2
SCIENTIFIC 0.2 KNOWLEDGE 0.2 WORK
0.2 RESEARCH 0.2 MATHEMATICS 0.2
0.7
0.1
0.3
0.2
x 3
THE 0.6 A 0.3 MANY 0.1
0.9
THE
231
x 2
OF 0.6 FOR 0.3 BETWEEN 0.1
x 1
0.8
z 1 0.4
z 2 0.6
HEART 0.2 LOVE 0.2 SOUL 0.2 TEARS 0.2 JOY
0.2
SCIENTIFIC 0.2 KNOWLEDGE 0.2 WORK
0.2 RESEARCH 0.2 MATHEMATICS 0.2
0.7
0.1
0.3
0.2
x 3
THE 0.6 A 0.3 MANY 0.1
0.9
THE LOVE
Write a Comment
User Comments (0)
About PowerShow.com