Pedro Ferreira, Paulo Azevedo - PowerPoint PPT Presentation

1 / 21
About This Presentation
Title:

Pedro Ferreira, Paulo Azevedo

Description:

'Protein Sequence Classification' is one of the most important problem in protein ... composed by successive atomic elements, generically called events (amino-acids) ... – PowerPoint PPT presentation

Number of Views:29
Avg rating:3.0/5.0
Slides: 22
Provided by: pedrogabri
Category:

less

Transcript and Presenter's Notes

Title: Pedro Ferreira, Paulo Azevedo


1
Protein Sequence Classification Through Relevant
Sequence Mining and Bayes Classifiers
Pedro Ferreira, Paulo Azevedo Dep. Informatics -
University of Minho
12th EPIA 2005 CMB workshop Covilhã, Portugal
5 of December 2005
2
Outline
  • Motivation
  • Types of Patterns
  • Method
  • Results
  • Conclusions

1
3
Motivation
Protein Sequence Classification is one of the
most important problem in protein sequence
analysis, having application in many area
domains. Due to the exponential growth of newly
generated sequences it requires automatic and
efficient methods. Sequence Patterns or Motifs
are elements conserved across different
proteins. Since these patterns are tightly
related to function and structure of the
proteins, they can be used as a tool to classify
the function or family of the proteins. Automatic
classification of protein sequence patterns
concentrates large effort from BIO DM
communities!
2
4
Some Notations
A linear sequence is a sequence composed by
successive atomic elements, generically called
events (amino-acids). Frequent Sequence Pattern
if it is subsequence of a number of sequences in
the dataset greater or equal to a specified
threshold value, minimum support.
3
5
Types of Patterns
Patterns or Motifs are typically classified in
two types Deterministic Patterns consist in
words over a defined syntax. Besides the protein
alphabet (amino-acids) they may contain
wild-cards, fixed or variable length gaps to
enhance the expressive power. Ex PROSITE
database. C - x(2,4) - C - x(3) - LIVMFYWC -
x(8) - H - x(3,5)-H Probabilistic Patterns
describe a model that assigns a probability of
the pattern matching a given sequence. EX PWM
Position Weight Matrix We will only consider
deterministic patterns!!
4
6
Types of Patterns
Consider patterns in the form A1 - x(p1 q1) -
A2 - x(p2 q2) -- An Flexible Gap Patterns
contains gaps with a size equal or greater to
zero, pi qi for any i. From biological point of
view FPs allow to find relations in larger sets
of proteins with larger span! Rigid Gap Patterns
gaps contain a fixed size for all the database
occurrences of the sequence pattern, pi qi for
any i. RPs express strongly conserved regions,
tightly related with function or structure of the
proteins! Relevant Pattern Frequent Satisfy a
Minimal Length
5
7
Patterns Constraints
Event Constraints define the set of allowed
events. Gap Constraints maxGap and minGap
min and max distance between adjacent
events. Window Constraints define the window
distance of the pattern.
6
8
Example
1 2 2 3 4 1 2 3 4 5 1 6 3 7 5 Flexible Pattern
1 - x(1, 2)- 3 4 1 2 2 3 4 1 2 3 4 5 1 6 3 7
5 Rigid Pattern 1 . 3 . 5
7
9
Method Goal
  • Our goal is to suggest a robust and adaptable
    classification method using a straightforward
    algorithm.
  • Our method
  • performs multi-class classification
  • does not require sequence transformation (direct
    sequence classifier)
  • does not require multiple alignment or
    background knowledge

8
10
Method formulation
  • Given a collection of classified sequences D, a
    query sequence Q, a minimum support s and a
    minimal length L, determine the similarity of Q
    w.r.t to all the classes in D.
  • We used a query driven sequence algorithm that
    for each Q, D, s and L extracts the number of
    relevant patterns and the average length of the
    patterns.

9
11
Method Bayes Classifier
The goal is to assign a probability to Q w.r.t.
all of the classes C1, C2, ..., Cn based on the
vector of observed parameters
This can be achieved through the
conditional probability Using the Bayes
Theorem (Eq 1)
- Apriori probability of the class
- Probability of the parameters (class
independent)
10
12
Method Bayes Classifier
We weight the parameters and rewrite the Bayes
formula. We assume that parameters are
statistically independent (not entirely
true). (Eq 2) is a constant for the
respective class Ci
11
13
Method Bayes Classifier
We suggest three models based on the model of Eq
2. (A) The apriori probability of the classes
is not taken into account. (B) where
(inverse apriori prob) To avoid bias due to
different family sizes, apriori probability is
normalized by the lenght of the
class. (C) is raised to a power of
three. Parameter number of patterns is given a
greater relative weight.
12
14
Method Bayes Classifier
Given a query sequence Q and the respective
parameter vector the classification
is simply given by
13
15
Results Setup
  • Use Query Driven Miner to extract Rigid Gap
    Patterns, maxGap 15 and WindowSize 20.
  • Three Collections of protein families
  • Pfam version 17.0 (26 families mostly taken from
    top-20 list April 2005)
  • Pfam version 1.0 (50 families)
  • Prosite Receptors Group (27 families)
  • Competitors Probabilistic Suffix Trees (PST) and
    Sparse Markov Transducers (SMT).
  • Evaluation based on leave-one-out methodology
    according to the precision rate (PR)

14
16
Results
Pfam version 17.0 (26 families)
Average results
15
17
Results
Prosite receptors (27 families) Similarity
matrix based on True Positives (main diagonal)
and False Negatives.
15
18
Results
Pfam 1.0 (50 families)
  • Applied a 2-tailed signed rank test, to test the
    null hypothesis that medians of pairs of
    classifiers C and PST and C and SMT are equal.
  • The medians for C and PST are significantly
    different.
  • For C and SMT null hypothesis is accepted, there
    is no significant difference.
  • Previously published results

16
19
Conclusions Factors Performance
  • Model C has higher Precision Rate
  • Average length parameter has bigger impact
  • Lower support value result in higher Precision
    Rate (allows to find patterns between smaller
    subsets of sequences)
  • Support values used are a trade-off
    precision/performance
  • Patterns reveal local and global similarity
  • The method does not discriminate patterns based
    on biological or statistical relevance

17
20
Conclusions
  • We propose a straightforward method to perform
    multi-class and multi-domain classification.
  • Based on Bayesian classifier, three
    probabilistic models are suggested.
  • Shows equivalent performance to state-of-the-art
    methods.
  • Greatest drawback apriori determination of
    support values.
  • In order to improve the precision of the method,
    patterns need to be discriminated according to
    their biological and statistical relevance.

18
21
THANKS QUESTIONS ???
Write a Comment
User Comments (0)
About PowerShow.com