Probabilistic Suffix Trees - PowerPoint PPT Presentation

1 / 28
About This Presentation
Title:

Probabilistic Suffix Trees

Description:

Provide efficient prediction ... PSTs were introduced by Ron, Singer, Tishby. Bejerano, Yona made further ... performance: Pfarm, SCOP. 22. Conclusions ... – PowerPoint PPT presentation

Number of Views:592
Avg rating:3.0/5.0
Slides: 29
Provided by: nmc49
Category:

less

Transcript and Presenter's Notes

Title: Probabilistic Suffix Trees


1
Probabilistic Suffix Trees
  • CMPUT 606

Maria Cutumisu
October 13, 2004
2
Goal
  • Provide efficient prediction for protein families
  • Probabilistic Suffix Trees (PSTs) are variable
    length Markov models (VMMs)

3
Conceptual Map
4
Background
  • PSTs were introduced by Ron, Singer, Tishby
  • Bejerano, Yona made further improvements (bPST)
  • Poulin efficient PSTs (ePSTs)
  • PSTs a.k.a. prediction suffix trees

5
Higher Order Markov Models
  • A k-order Markov chain history of length k for
    conditional probabilities
  • Exponential storage requirements
  • Order of the chain increases, amount of training
    data increases to improve estimation accuracy

6
Variable Length Markov Models (VMMs)
  • Space and parameter-estimation efficient
  • variable length of the history sequence for
    prediction
  • only needed parameters are stored
  • Created from less training data

Training sequences
Is T1 in the training set?
gtT1 Test sequence AHGSGYMNAB
7
VMMs
  • P(sequence) product of the probabilities of
    each amino acid given those that precede it
  • Conditional probability based on the context of
    each amino acid
  • A context function k() can select the history
    length based on the context x1 . . . xi-1 xi
  • VMMs were first introduced as PSTs

8
PSTs
  • VMMs for efficient prediction
  • Pruned during training to contain only required
    parameters
  • bPST represents histories
  • ePST represents sequences

9
bPST
  • Used to represent the histories for prediction
    instead of the training sequences
  • The possible histories are the reversed strings
    of all the substrings of the training sequences

10
Prediction with bPSTs
  • The conditional probabilities P(xixi-1) are
    obtained for each position by tracing a path from
    the root that matches the preceding residues

11
Construction bPST
  • We add histories for the training data
  • Nodes parameters that estimate the conditional
    probabilities
  • ?history(a) P(ahistory)
  • PbPST (xixi-1, . . . , x1) ?x1...xi-1(xi) if
    in bPST
  • else ?x2...xi-1(xi) if in bPST etc.
  • else ?(xi)

12
bPST created and pruned using 01001001001111010110
0010111
Brett Poulin
P(01001) P(0)P(10)P(001)P(0010)P(10100)
?(0) ?0(1) ?01(0) ?0(0) ?00(1)
(13/27)(8/13)(5/8)(5/13)(4/5) 10400/182520
0.057
13
Complexity bPST
  • bPST building process requires O(Ln2) time
  • L is the length limit of the tree
  • n is the total length of the training set.
  • bPST building requires all training sequences at
    once (in order to get all the reverse substrings)
    and cannot be done online (the bPST cannot be
    built as the training data is encountered)
  • Prediction O(mL), m sequence length

14
Improved bPST
  • Idea tree with training sequences
  • n length of all training sequences
  • m length of tested sequence
  • Result (theoretical)
  • linear time building O(n)
  • linear time prediction O(m).

15
Efficient PST (ePST)
  • Used for predicting protein function
  • ePST represents sequences
  • Linear construction and prediction

16
Example ePST
Brett Poulin
17
Prediction with ePSTs
  • The probabilities for a substring are obtained
    for each position by tracing the path
    representing the sequence from the root
  • If the entire sequence is not found in the tree,
    suffix links are followed

18
Construction ePST
  • ePSTs gain efficiency by representing the
    training sequences in the PST
  • Nodes store counts of the subsequence occurrences
    in the training data (with respect to the
    complete tree)
  • Conditional probabilities deducted from the
    counts are stored as well

19
Example ePST - AYYYA
Brett Poulin
20
Complexity ePST
  • Linear time and space with regards to the
    combined length of the training sequences O(n)
  • Linear prediction time O(m)

21
Advantages and Disadvantages
  • Avoid exponential space requirements and
    parameter estimation problems of higher order
    Markov chains
  • Pruned during training to contain only required
    parameters
  • bPSTs for local predictions more accurate
    prediction than global
  • Loss in classification performance Pfarm, SCOP

22
Conclusions
  • PSTs require less training and prediction time
    than HMMs
  • Despite some loss in classification performance,
    PSTs compete with HMMs due to PSTs reduced
    resource demands
  • PSTs take advantage of VMMs higher order
    correlations

23
References
  • Brett Poulin, Sequence-based Protein Function
    Prediction, Master Thesis, University of Alberta,
    2004
  • G Bejerano, G Yona, Modeling protein families
    using probabilistic suffix trees, RECOMB99
  • G Bejerano, Algorithms for variable length markov
    chain modeling, Bioinformatics Applications Note,
    20(5)788789, 2004

24
PSTs and HMMs
  • HMMs do not capture any higher-order
    correlations. An HMM assumes that the identity of
    a particular position is independent of the
    identity of all other positions. 1
  • PSTs are variable length Markov models for
    efficient prediction. The prediction uses the
    longest available context matching the history of
    the current amino acid.
  • For protein prediction in general, the main
    advantage of PSTs over HMMs is that the training
    and prediction time requirements of PSTs are much
    less than for the equivalent HMMs. 1

25
Suffix Trees (ST)
Brett Poulin
26
bPST
  • Histories added to the tree must occur more
    frequently than a threshold Pmin
  • The substrings are added in order of length from
    smallest to largest

27
bPST vs ST
  • The string s is only added to the tree if the
    resulting conditional probability at the node to
    be created will be greater than the minimum
    prediction probability ?min a and the
    probability for the prefix of the string is
    different (with some ratio r) from the
    probability assigned to the next shortest
    substring suf(s) (which is already in the tree).
    After all the substrings are added to the tree,
    the probabilities are smoothed according to the
    parameter ?min.
  • The smoothing (as calculated by the equation
    below) prevents any probability from being less
    than ?min

28
New!
Write a Comment
User Comments (0)
About PowerShow.com