Probabilistic Suffix Trees - PowerPoint PPT Presentation

1 / 28

About This Presentation

Title:

Probabilistic Suffix Trees

Description:

Provide efficient prediction ... PSTs were introduced by Ron, Singer, Tishby. Bejerano, Yona made further ... performance: Pfarm, SCOP. 22. Conclusions ... – PowerPoint PPT presentation

Number of Views:592

Avg rating:3.0/5.0

Slides: 29

Provided by: nmc49

Category:

more less

Transcript and Presenter's Notes

Title: Probabilistic Suffix Trees

1
Probabilistic Suffix Trees

CMPUT 606

Maria Cutumisu
October 13, 2004
2
Goal

Provide efficient prediction for protein families
Probabilistic Suffix Trees (PSTs) are variable
length Markov models (VMMs)

3
Conceptual Map
4
Background

PSTs were introduced by Ron, Singer, Tishby
Bejerano, Yona made further improvements (bPST)
Poulin efficient PSTs (ePSTs)
PSTs a.k.a. prediction suffix trees

5
Higher Order Markov Models

A k-order Markov chain history of length k for
conditional probabilities
Exponential storage requirements
Order of the chain increases, amount of training
data increases to improve estimation accuracy

6
Variable Length Markov Models (VMMs)

Space and parameter-estimation efficient
variable length of the history sequence for
prediction
only needed parameters are stored
Created from less training data

Training sequences
Is T1 in the training set?
gtT1 Test sequence AHGSGYMNAB
7
VMMs

P(sequence) product of the probabilities of
each amino acid given those that precede it
Conditional probability based on the context of
each amino acid
A context function k() can select the history
length based on the context x1 . . . xi-1 xi
VMMs were first introduced as PSTs

8
PSTs

VMMs for efficient prediction
Pruned during training to contain only required
parameters
bPST represents histories
ePST represents sequences

9
bPST

Used to represent the histories for prediction
instead of the training sequences
The possible histories are the reversed strings
of all the substrings of the training sequences

10
Prediction with bPSTs

The conditional probabilities P(xixi-1) are
obtained for each position by tracing a path from
the root that matches the preceding residues

11
Construction bPST

We add histories for the training data
Nodes parameters that estimate the conditional
probabilities
?history(a) P(ahistory)
PbPST (xixi-1, . . . , x1) ?x1...xi-1(xi) if
in bPST
else ?x2...xi-1(xi) if in bPST etc.
else ?(xi)

12
bPST created and pruned using 01001001001111010110
0010111
Brett Poulin
P(01001) P(0)P(10)P(001)P(0010)P(10100)
?(0) ?0(1) ?01(0) ?0(0) ?00(1)
(13/27)(8/13)(5/8)(5/13)(4/5) 10400/182520
0.057
13
Complexity bPST

bPST building process requires O(Ln2) time
L is the length limit of the tree
n is the total length of the training set.
bPST building requires all training sequences at
once (in order to get all the reverse substrings)
and cannot be done online (the bPST cannot be
built as the training data is encountered)
Prediction O(mL), m sequence length

14
Improved bPST

Idea tree with training sequences
n length of all training sequences
m length of tested sequence
Result (theoretical)
linear time building O(n)
linear time prediction O(m).

15
Efficient PST (ePST)

Used for predicting protein function
ePST represents sequences
Linear construction and prediction

16
Example ePST
Brett Poulin
17
Prediction with ePSTs

The probabilities for a substring are obtained
for each position by tracing the path
representing the sequence from the root
If the entire sequence is not found in the tree,
suffix links are followed

18
Construction ePST

ePSTs gain efficiency by representing the
training sequences in the PST
Nodes store counts of the subsequence occurrences
in the training data (with respect to the
complete tree)
Conditional probabilities deducted from the
counts are stored as well

19
Example ePST - AYYYA
Brett Poulin
20
Complexity ePST

Linear time and space with regards to the
combined length of the training sequences O(n)
Linear prediction time O(m)

21
Advantages and Disadvantages

Avoid exponential space requirements and
parameter estimation problems of higher order
Markov chains
Pruned during training to contain only required
parameters
bPSTs for local predictions more accurate
prediction than global
Loss in classification performance Pfarm, SCOP

22
Conclusions

PSTs require less training and prediction time
than HMMs
Despite some loss in classification performance,
PSTs compete with HMMs due to PSTs reduced
resource demands
PSTs take advantage of VMMs higher order
correlations

23
References

Brett Poulin, Sequence-based Protein Function
Prediction, Master Thesis, University of Alberta,
2004
G Bejerano, G Yona, Modeling protein families
using probabilistic suffix trees, RECOMB99
G Bejerano, Algorithms for variable length markov
chain modeling, Bioinformatics Applications Note,
20(5)788789, 2004

24
PSTs and HMMs

HMMs do not capture any higher-order
correlations. An HMM assumes that the identity of
a particular position is independent of the
identity of all other positions. 1
PSTs are variable length Markov models for
efficient prediction. The prediction uses the
longest available context matching the history of
the current amino acid.
For protein prediction in general, the main
advantage of PSTs over HMMs is that the training
and prediction time requirements of PSTs are much
less than for the equivalent HMMs. 1

25
Suffix Trees (ST)
Brett Poulin
26
bPST

Histories added to the tree must occur more
frequently than a threshold Pmin
The substrings are added in order of length from
smallest to largest

27
bPST vs ST

The string s is only added to the tree if the
resulting conditional probability at the node to
be created will be greater than the minimum
prediction probability ?min a and the
probability for the prefix of the string is
different (with some ratio r) from the
probability assigned to the next shortest
substring suf(s) (which is already in the tree).
After all the substrings are added to the tree,
the probabilities are smoothed according to the
parameter ?min.
The smoothing (as calculated by the equation
below) prevents any probability from being less
than ?min