Testing sequence comparison methods with structure similarity - PowerPoint PPT Presentation

1 / 36

About This Presentation

Title:

Testing sequence comparison methods with structure similarity

Description:

ASTRAL SCOP: Structural Classification Of Proteins ... True positives: in same SCOP family, or false positives: not in same family ... – PowerPoint PPT presentation

Number of Views:91

Avg rating:3.0/5.0

Slides: 37

Provided by: Hul78

Category:

more less

Transcript and Presenter's Notes

Title: Testing sequence comparison methods with structure similarity

1
Testing sequence comparison methods with
structure similarity

_at_ Organon, Oss
2006-02-07
Tim Hulsen

2
Introduction

Main goal transfer function of proteins in model
organisms to proteins in humans
Make use of orthology proteins evolved from
common ancestor in different species (very
similar function!)
Several ortholog identification methods, relying
on
Sequence comparisons
(Phylogenies)

3
Introduction

Quality of ortholog identification depends on
1.) Quality of sequence comparison algorithm
- Smith-Waterman vs. BLAST, FASTA, etc.
- Z-value vs. E-value
2.) Quality of ortholog identification itself
(phylogenies, clustering, etc.)
2 -gt previous research
1 -gt this presentation

4
Previous research

Comparison of several ortholog identification
methods
Orthologs should have similar function
Functional data of orthologs should behave
similar
Gene expression data
Protein interaction data
Interpro IDs
Gene order

5
Orthology method comparison

Compared methods
BBH, Best Bidirectional Hit
INP, InParanoid
KOG, euKaryotic Orthologous Groups
MCL, OrthoMCL
PGT, PhyloGenetic Tree
Z1H, Z-value gt 1 Hundred

6
Orthology method comparison

e.g. correlation in expression profiles
Affymetrix human and mouse expr. data, using
SNOMED tissue classification
Check if the expression profile of a protein is
similar to the expression profile of its ortholog

Hs
Mm
7
Orthology method comparison
8
Orthology method comparison

e.g. conservation of protein interaction
DIP (Database of Interacting Proteins)
Check if the orthologs of two interacting
proteins are still interacting in the other
species -gt calculate fraction

Hs
Hs
Mm
Mm
9
Orthology method comparison
10
Orthology method comparison
11
Orthology method comparison

Trade-off between sensitivity and selectivity
BBH and INP are most sensitive but also most
selective
Results can differ depending on what sequence
comparison algorithm is used
- BLAST, FASTA, Smith-Waterman?
- E-value or Z-value?

12
E-value or Z-value?

Smith-Waterman with Z-value statistics
100 randomized shuffles to test significance of
SW score

rnd ori 5SD ? Z 5
O. MFTGQEYHSV shuffle 1. GQHMSVFTEY 2.
YMSHQFTVGE etc.
seqs
SW score
13
E-value or Z-value?

Z-value calculation takes much time (2x100
randomizations)
Comet et al. (1999) and Bastien et al. (2004)
Z-value is theoretically more sensitive and more
selective than E-value?
Advantage of Z-value has never been proven by
experimental results

14
How to compare?

Structural comparison is better than sequence
comparison
ASTRAL SCOP Structural Classification Of
Proteins
e.g. a.2.1.3, c.1.2.4 same number same
structure
Use structural classification as benchmark for
sequence comparison methods

15
ASTRAL SCOP statistics
16
Methods (1)

Smith-Waterman algorithms
dynamic programming computationally
intensive
Paracel with e-value (PA E)
SW implementation of Paracel
Biofacet with z-value (BF Z)
SW implementation of Gene-IT
ParAlign with e-value (PA E)
SW implementation of Sencel
SSEARCH with e-value (SS E)
SW implementation of FASTA (see next page)

17
Methods (2)

Heuristic algorithms
FASTA (FA E)
Pearson Lipman, 1988
Heuristic approximation performs better than
BLAST with strongly diverged proteins
BLAST (BL E)
Altschul et al., 1990
Heuristic approximation stretches local
alignments (HSPs) to global alignment
Should be faster than FASTA

18
Method parameters

all
matrix BLOSUM62
gap open penalty 12
gap extension penalty 1
Biofacet with z-value 100 randomizations

19
Receiver Operating Characteristic

R.O.C. statistical value, mostly used in
clinical medicine
Proposed by Gribskov Robinson (1996) to be used
for sequence comparison analysis

20
ROC50 Example

Take 100 best hits
True positives in same SCOP family, or false
positives not in same family
For each of first 50 false positives calculate
number of true positives higher in list
(0,4,4,4,5,5,6,9,12,12,12,12,12)
- Divide sum of these numbers by number of false
positives (50) and by total number of possible
true positives (size of family -1) ROC50
(0,167)
- Take average of ROC50 scores for all entries

21
ROC50 results
22
Coverage vs. Error

C.V.E. Coverage vs. Error (Brenner et al.,
1998)
E.P.Q. selectivity indicator (how much false
positives?)
Coverage sensitivity indicator (how much true
positives of total?)

23
CVE Example

Vary threshold above which a hit is seen as a
positive e.g. e10,e1,e0.1,e0.01
True positives in same SCOP family, or false
positives not in same family
For each threshold, calculate the coverage
number of true positives divided by the total
number of possible true positives
For each treshold, calculate the
errors-per-query number of false positives
divided by the number of queries
- Plot coverage on x-axis and errors-per-query on
y-axis right-bottom is best

24
CVE results
-

(only PDB095)
25
Mean Average Precision

A.P. borrowed from information retrieval search
(Salton, 1991)
Recall true positives divided by number of
homologs
Precision true positives divided by number of
hits
A.P. approximate integral to calculate area
under recall-precision curve

26
Mean AP Example

- Take 100 best hits
- True positives in same SCOP family, or false
positives not in same family
For each of the true positives divide the true
positive rank (1,2,3,4,5,6,7,8,9,10,11,12) by the
positive rank (2,3,4,5,9,12,14,15,16,18,19,20)
Divide the sum of all of these numbers by the
total number of hits (100) AP (0.140)
Take average of AP scores for all entries mean
AP

27
Mean AP results
28
Time consumption

PDB095 all-against-all comparison
Biofacet multiple days (z value calc.!)
BLAST 2d,4h,16m
SSEARCH 5h49m
ParAlign 47m
FASTA 40m

29
Preliminary conclusions

SSEARCH gives best results
When time is important, FASTA is a good
alternative
Z-value seems to have no advantage over E-value

30
Problems

Bias in PDB?
Sequence length
Amino acid composition
Difference in matrices?
Difference in SW implementations?

31
Bias in PDB sequence length?
? Yes! Short sequences are over-represented in
the ASTRAL SCOP PDB sets
32
Bias in PDB aa distribution?
? No! Approximately equal amino acid distribution
in the ASTRAL SCOP PDB sets
33
Difference in matrices?
34
Difference in SW implementations?
35
Conclusions

E-value better than Z-value!
SW implementations are (more or less) the same
(SSEARCH, ParAlign and Biofacet), but SSEARCH
with e-value scores best of all
Larger structural comparison database needed for
better analysis

36
Credits