Author: Jason Weston et., al

About This Presentation

Title:

Author: Jason Weston et., al

Description:

Search biosequences from online ... Use protein 3-D structure database SCOP as golden standard. ... 7329 protein domains with known 3D structure on SCOP. ... – PowerPoint PPT presentation

Number of Views:122

Avg rating:3.0/5.0

Slides: 26

Provided by: Xin110

Category:

more less

Transcript and Presenter's Notes

Title: Author: Jason Weston et., al

1
Protein Ranking From Local to global structure
in protein similarity network

Author Jason Weston et., al
PANS
Presented by Tie Wang

2
Outline

Introduction
Background
Method
Experiment
Analysis

3
Introduction

Pairwise subtle sequence similarities imply
structural functional and evolutionary relations
among DNA and protein seqences
Search biosequences from online database is
analogous to searching the WWW (search engine
search the db for query and return a ranked
list)
A protein ranking algorithm is presented for
biosequence query

4
Background

Early algorithms only focus on pair-wise sequence
similarity (SW LA search)
Statistical models use multiple alignments for
similarity search (profile based, psi-blast)
Global similarity search can be mapped onto
protein similarity network.

5
How to perform protein ranking?

Underlying idea Google ranking
Key feature Exploiting global structure by
interring it from local hyperlink structure.
Construct a protein similarity network
Add query sequence
Weight diffusion
Rank proteins upon convergence

6
Algorithm
7
Experiment

Use protein 3-D structure database SCOP as golden
standard.
Sequences have no more than 95 similarity.
7329 proteins are splitted into 379 superfamilies
as training and 332 for testing
3 networks are generated using BLAST and
PSI-BLAST.

8
Experiment

Value
Compare with other two experiments
1. only local structure are considered
2. non-local edges without weak edges
The result shows that the second one is only
slightly worse than our algorithm

Where Sj(i) is E value assigned to protein I
given query j.
9
Analysis
Bower et al, Science vol 306, 2004
Cluster structure
10
Motif based protein ranking by network propagation

Author Kuang Rui et., al
Bioinformatics
Presented by Tie Wang

11
Outline

Introduction
Background
Method
Experiment
Analysis

12
Background

Direct measure of pairwise sequence is proved to
be effective on classification.
Performance is dropped down when detecting subtle
remotely homology sequences.
Those sequences share a conserved structure at
least at some components.
Formulate problem based on this statement.

13
Protein motif bipartite network

Each protein contains a set of motifs.
Each motif belongs to a set of proteins.
Their relationship are mapped to a
Bipartite graph as shown on the left.
The edge weight indicates the probi-
lity that motif x is in protein y.

14
Motifdrop Algorithm

Set P represents protein sequences and set F
represents motifs. H is the connectivity matrix.

is row normalized version of H.
is a vector of initial value for H.
is a vector of initial value for P.
15
MotifProp Algorithm

The convergence of motifdrop is guranteed.
The problem is reformulated based on the
following rule,

is row normalized version of H.
is a vector of initial value for H.
is a vector of initial value for P.
16
Edge weighting scheme

PSI-BLAST E-value is assigned between pair-wise
protein nodes.
Gaussian edge weights are calculated.
The Gaussian weights from query to each protein
are assigned as initial value.

17
Value estimation

Sq(i) is the E-value of protein i and query q.
Eq(j) is the E-value of the jth motif and ith
protein.

(1)
???
18
Estimation on substitution score

Substitutions score between a kmer f and sequence
x can be estimated as,
where
and
sl is a log value which implied the S score
below threshold can be a motif hits against
sequence x.

19
Sequential MotifProp

Empirical experiments suggest that using a
weighted linear combination of multiple motifs
does not improve the results.
Apply a simple multiple motif sets scheme.
Motif nodes F can be divided into n set partition
in which F(i) is
a set of motif from ith motif set.
F set represents the motifs instead of individual
ones.

20
Motif-rich regions
21
Experiments

7329 protein domains with known 3D structure on
SCOP.
They are divided into training (4246) and testing
(3083).
Apply additional 10602 from swiss-prot db.
Evaluation on ROC curve.

22
Results of classification
23
Results of classification (cont)
24
Results on Motif rich region
25
Conclusion

Two methods are presented on protein
classification using protein ranking methods.
Similarity matrix and protein/motif propagation
network are base structures.
Simple methods but innovative formulation.
Better results compared with current approaches.
Analysis on results play an important roles.

Write a Comment

User Comments (0)