Title: Probabilistic Model of Sequences
1Probabilistic Model of Sequences
- Ata Kaban
- The University of Birmingham
2Sequence
- Example1 a b a c a b a b a c
- Example2 1 0 0 1 1 0 1 0 0 1
- Example3 1 2 3 4 5 6 1 2 3 4 5 6 1 2 3
- Roll a six-sided die N times. You get a sequence.
- Roll it again You get another sequence.
- Here is a sequence of characters, can you see it?
- What is a sequence?
- Alphabet1 a,b,c, Alphabet20,1,
Alphabet31,2,3,4,5,6
3Probabilistic Model
- Model system that simulates the sequence under
consideration - Probabilistic model model that produces
different outcomes with different probabilities - It includes uncertainty
- It can therefore simulate a whole class of
sequences assigns a probability to each
individual sequence - Could you simulate any of the sequences on the
previous slide?
4Random sequence model
- Back to the die example (can possibly be loaded)
- Model of a roll has 6 parameters
p1,p2,p3,p4,p5,p6 - Here, p_i is the probability of throwing i
- To be probabilities, these must be non-negative
and must sum to one. - What is the probability of the sequence 1, 6,
3? - p1p6p3
- NOTE in the random sequence model, the
individual symbols in a sequence do not depend on
each other. This is the simplest sequence model.
5Maximum Likelihood parameter estimation
- The parameters of a probabilistic model are
typically estimated from large sets of trusted
examples, called training set. - Example (ttail, hhead) t t t h t h h t
- Count up the frequencies t?5, h?3
- Compute probabilities
- p(t)5/(53), p(h)3/(53)
- These are the Maximum Likelihood (ML) estimates
of the parameters of the coin. - Does it make sense?
- What if you know the coin is fair?
6Overfitting
- A fair coin has probabilities p(t)0.5, p(h)0.5
- If you throw it 3 times and get t, t, t, then
the ML estimates for this sequence are p(t)1,
p(h)0. - Consequently, from these estimates, the
probability of e.g. the sequence h, t, h, t
. - This is an example of what is called overfitting.
Overfitting is the greatest enemy of Machine
Learning! - Solution1 get more data
- Solution2 build in what you already know into
the model. (Will return to it during the module)
7Why is it called Maximum Likelihood?
- It can be shown that using the frequencies to
compute probabilities maximises the total
probability of all the sequences given the model
(the likelihood). P(Dataparameters)
8Probabilities
- Have two dice D1 and D2
- The probability of rolling I given die D1 is
called P(iD1). This is a conditional probability - Pick a die at random with probability P(Dj), j1
or 2. The probability for picking die Dj and
rolling i is is called joint probability and is
P(I,Dj)P(Dj)P(IDj). - For any events X and Y, P(X,Y)P(XY)P(Y)
- If we know P(X,Y), then the so-called marginal
probability p(X) can be computed as
9- Now, we show that maximising P(Dataparameters)
for the random sequence model leads to the
frequency-based computation that we did
intuitively.
10Why did we bother? Because in more complicated
models we cannot guess the result.
11Markov Chains
- Further examples of sequences
- Bio-sequences
- Web page request sequences while browsing
- These are not anymore random sequences, but have
a time-structure. - How many parameters would such a model have?
- We need to make simplifying assumptions to end up
with a reasonable number of parameters - The first order Markov assumption the
observation only depends on the immediately
previous one, no longer history - Markov Chain sequence model which makes the
Markov assumption
12Markov Chains
- The probability of a Markov sequence
- The alphabets symbols are also called states
- Once the parameters are estimated from training
data, the Markov chain can be used for prediction - Amongst others, Markov Chains are successful for
web browsing behavior prediction
13Markov Chains
- A Markov Chain is stationary if at any time, it
has the same transition probabilities. - We assume stationary models here.
- Then the parameters of the model consist of the
transition probability matrix initial state
probabilities.
14ML parameter estimation
- We can derive how to compute the parameters of a
Markov Chain from data, using Maximum Likelihood,
as we did for random sequences. - The ML estimate of the transition matrix will be
again very intuitive
Remember that
15Simple example
- If it is raining today, it will rain tomorrow
with probability 0.8 ?implies the contrary has
probability 0.2 - If it is not raining today, it will rain tomorrow
with probability 0.6 ?implies the contrary has
probability 0.4 - Build the transition matrix
- Be careful which numbers need to sum to one and
which dont. Such a matrix is called stochastic
matrix. - Q It rained all week, including today. What does
this model predict for tomorrow? Why? What does
it predict for a day from tomorrow? (Homework)
16Examples of Web Applications
- HTTP request prediction
- To predict the probabilities of the next requests
from the same user based on the history of
requests from that client. - Adaptive Web navigation
- To build a navigation agent which suggests which
other links would be of interest to the user
based on the statistics of previous visits. - The predicted link does not strictly have to be a
link present in the Web page currently being
viewed. - Tour generation
- Is given as input the starting URL and generates
a sequence of states (or URLs) using the Markov
chain process.
17Building Markov Models from Web Log Files
- A Web log file is a collection of records of user
requests for documents on a Web site, an example - Transition matrix can be seen as a graph
- Link pair (r - referrer, u - requested page, w
- hyperlink weight) - Link graph it is called the state diagram of the
MarkovChain - a directed weighted graph
- a hierarchy from the homepage down to multiple
levels
177.21.3.4 - - 04/Apr/1999000111 0100 "GET
/studaffairs/ccampus.html HTTP/1.1" 200 5327
"http//www.ulst.ac.uk/studaffairs/accomm.html"
"Mozilla/4.0 (compatible MSIE 4.01 Windows 95)"
18Link Graph an example (University of Ulster site)
Zhu et al. 2002
State diagram - Nodes states - Weighted arrows
number of transitions
19Experimental Results(Sarukkai, 2000)
- Simulations
- Correct link refers to the actual link chosen
at the next step. - depth of the correct link is measured by
counting the umber of links which have a
probability greater than or equal to the correct
link. - Over 70 of correct links are in the top 20
scoring states. - Difficulties very large state space
20Simple exercise
- Build the Markov transition matrix of the
following sequence - a b b a c a b c b b d e e d e d e d
- State space .
-
21Further topics
- Hidden Markov Model
- Does not make the Markov assumption on the
observed sequence - Instead, it assumes that the observed sequence
was generated by another sequence which is
unobservable (hidden), and this other sequence is
assumed to be Markovian - More powerful
- Estimation is more complicated
- Aggregate Markov model
- Useful for clustering sub-graphs of a transition
graph
22HMM at an intuitive level
- Suppose that we know all the parameters of the
following HMM, as shown on the state-diagram
below. What is the probability of observing the
sequence A,B if the initial state is S1? The
same question if the initial state is chosen
randomly with equal probabilities.
ANSWER If the initial state is S1
0.2(0.40.80.60.7) 0.148. In the second
case 0.50.1480.50.3(0.30.70.70.8)
0.1895.
23Conclusions
- Probabilistic Model
- Maximum Likelihood parameter estimation
- Random sequence model
- Markov chain model
- ---------------------------------
- Hidden Markov Model
- Aggregate Markov Model
24Any questions?