An Efficient Algorithm for LargeScale Detection of Protein Families TRIBEMCL A' J' Enright, S' Van D - PowerPoint PPT Presentation

1 / 29
About This Presentation
Title:

An Efficient Algorithm for LargeScale Detection of Protein Families TRIBEMCL A' J' Enright, S' Van D

Description:

An Efficient Algorithm for Large-Scale Detection of Protein ... Mij - probability of ... by favoring more probable walks over less probable walks ... – PowerPoint PPT presentation

Number of Views:43
Avg rating:3.0/5.0
Slides: 30
Provided by: dsim2
Category:

less

Transcript and Presenter's Notes

Title: An Efficient Algorithm for LargeScale Detection of Protein Families TRIBEMCL A' J' Enright, S' Van D


1
An Efficient Algorithm for Large-Scale Detection
of Protein Families (TRIBE-MCL)A. J. Enright, S.
Van Dongen and C. A. Ouzounis
  • Fachseminar D Simmen
  • 5. April 2008

2
Overview
  • Motivation
  • Proteins Problems
  • Earlier solutions
  • MCL
  • Example
  • Conclusion

3
Goals
  • Understand MCL
  • See why this works with proteins
  • Understand the example

4
Motivation
  • Why do we want to group proteins into families?
  • Detection of protein families in large databases
    is one of the principal research objectives in
    structural and functional genomics
  • Protein family classification can significantly
    contribute to
  • the delineation of functional diversity of
    homologous proteins
  • prediction of function based on domain
    architecture
  • presence of sequence motifs
  • comparative genomics
  • Valuable evolutionary insights

5
Proteins
  • Linear Structure
  • Made of amino acids peptide bonds
  • Consists of one or more domain
  • A domain is independent structure

6
Problems
  • Multi- domain structure of many protein families
  • confounds grouping methods Shared Domain vs
    Biochemical Function
  • result in the incorrect grouping of proteins
  • Promiscuous Domains - Smaller, quite widespread
    protein modules
  • Families based on such domains are unlikely to
    share a common evolutionary history

7
Example P1
Families (A), (CD), (BEFH), (G)
8
Earlier Solutions
  • Attempted Solutions
  • detection of individual domains
  • using BLAST reports
  • domain database dictionaries
  • iterative sequence comparison
  • manual intervention for family assignment of
    multi-domain proteins
  • Drawbacks
  • either too computationally intensive
  • or somewhat inaccurate
  • or not fully automatic

9
Is there another way?
  • Represent proteins a graph
  • Protein as node
  • Similarity as edge
  • Higher similarity -gt shorter edge
  • Ignores domains or the other problems statet
  • Families are represented by clusters

10
Clusters
11
MCL
  • Detect clusters in graphs
  • Uses
  • A markov matrix
  • Expansion operation
  • Inflation operation

12
Markov matrix
  • Represents a graph with the probabilites of
    changing from one node to another.
  • Each column can be read as the probabilities of
    passing from the coresponding node to another
  • Each column sums to one

13
Markov chains
  • Given an Markov matrix M we can compute the
    possibility to be in a point after one step
  • To do this we have to compute MM

14
Expansion
  • corresponds to computing random walks of higher
    length (which means random walks with many
    steps)
  • Premise Higher length paths are more common
    within clusters than between different clusters
  • the probabilities associated with node pairs
    within a cluster will, in general, be relatively
    large
  • Inflation will then have the effect of boosting
    the probabilities of intra-cluster walks and will
    demote inter-cluster walks

15
Hadamard Product
16
Inflation
  • corresponds with taking the Hadamard power of a
    matrix (taking powers entry wise)
  • followed by a scaling step, such that the
    resulting matrix is stochastic again
  • ?r -gt Inflation Operator
  • M -gt Stochastic matrix
  • r -gt a real number (power coefficient) gt1
  • Mij -gt probability of going from j to I
  • For values of r gt 1, inflation changes the
    probabilities by favoring more probable walks
    over less probable walks

17
All together
  • MCL simulates random walks by expand and inflate
  • As long as the matrix changes MCL continues with
    these steps

18
Result
  • The resulting matrix has to be intepreted as a
    graph
  • The resulting matrix shows the clusters according
    to the chosen granularity
  • The clusters look like stars with one node as
    center

19
TRIBE MCL
  • The actual Algortihm for Proteins
  • BLASTp computes the similarities
  • With these values a Markov matrix is created
  • MCL runs on this matrix and returns a result

20
Example P2
  • Similarities between Proteins
  • Based on E-Values given by BLASTb

21
Example P3
  • Corresponding Markov matrix

22
Example P4
  • r 5
  • it 7
  • A B C D E F G H
  • A 1 0 0 0 0 0 0 0
  • B 0 1 0 0 0 0 0 0
  • C 0 0 1 0 0 0 0 0
  • D 0 0 0 1 0 0 0 0
  • E 0 0 0 0 1 0 0 0
  • F 0 0 0 0 0 1 0 0
  • G 0 0 0 0 0 0 1 0
  • H 0 0 0 0 0 0 0 1

23
Example P5
  • r 1
  • it 8
  • A B C D
    E F G
    H
  • A 0.0953 0.0953 0.0953 0.0953 0.0953
    0.0953 0.0953 0.0953
  • B 0.1588 0.1588 0.1588 0.1588 0.1588
    0.1588 0.1588 0.1588
  • C 0.0946 0.0946 0.0946 0.0946 0.0946
    0.0946 0.0946 0.0946
  • D 0.0946 0.0946 0.0946 0.0946 0.0946
    0.0946 0.0946 0.0946
  • E 0.1588 0.1588 0.1588 0.1588 0.1588
    0.1588 0.1588 0.1588
  • F 0.1588 0.1588 0.1588 0.1588 0.1588
    0.1588 0.1588 0.1588
  • G 0.0806 0.0806 0.0806 0.0806 0.0806
    0.0806 0.0806 0.0806
  • H 0.1588 0.1588 0.1588 0.1588 0.1588
    0.1588 0.1588 0.1588

24
Example P6
  • r 1.1
  • it 456
  • A B C D E F
    G H
  • A 0 0 0 0 0
    0 0 0
  • B 0 0 0 0 0
    0 0 0
  • C 0 0 0 0 0
    0 0 0
  • D 0 0 0 0 0
    0 0 0
  • E 0.5 0.5 0.5 0.5 0.5
    0.5 0.5 0.5
  • F 0.5 0.5 0.5 0.5 0.5
    0.5 0.5 0.5
  • G 0 0 0 0 0
    0 0 0
  • H 0 0 0 0 0
    0 0 0

25
Example P7
  • r 2.1
  • it 63
  • A B C D E F
    G H
  • A 1 0 0 0 0
    0 0 0
  • B 0 0 0 0 0
    0 0 0
  • C 0 0 1 1 0
    0 0 0
  • D 0 0 0 0 0
    0 0 0
  • E 0 0 0 0 0
    0 0 0
  • F 0 1 0 0 1
    1 0 1
  • G 0 0 0 0 0
    0 1 0
  • H 0 0 0 0 0
    0 0 0

26
Example P8
B
F
A
E
H
G
C
D
Famillies predicted (A), (CD), (BEFH), (G)
27
Performance
  • Examples of tests done so far

28
Conlcusion
  • Fast and remarkable accurate Algorithm
  • Highly experimental
  • MCL so far solution for ca 200 different
    applications

29
Questions
  • Questions?
Write a Comment
User Comments (0)
About PowerShow.com