Title: Testing sequence comparison methods with structure similarity
1Testing sequence comparison methods with
structure similarity
- _at_ Organon, Oss
- 2006-02-07
- Tim Hulsen
2Introduction
- Main goal transfer function of proteins in model
organisms to proteins in humans - Make use of orthology proteins evolved from
common ancestor in different species (very
similar function!) - Several ortholog identification methods, relying
on - Sequence comparisons
- (Phylogenies)
3Introduction
- Quality of ortholog identification depends on
- 1.) Quality of sequence comparison algorithm
- - Smith-Waterman vs. BLAST, FASTA, etc.
- - Z-value vs. E-value
- 2.) Quality of ortholog identification itself
(phylogenies, clustering, etc.) - 2 -gt previous research
- 1 -gt this presentation
4Previous research
- Comparison of several ortholog identification
methods - Orthologs should have similar function
- Functional data of orthologs should behave
similar - Gene expression data
- Protein interaction data
- Interpro IDs
- Gene order
5Orthology method comparison
- Compared methods
- BBH, Best Bidirectional Hit
- INP, InParanoid
- KOG, euKaryotic Orthologous Groups
- MCL, OrthoMCL
- PGT, PhyloGenetic Tree
- Z1H, Z-value gt 1 Hundred
6Orthology method comparison
- e.g. correlation in expression profiles
- Affymetrix human and mouse expr. data, using
SNOMED tissue classification - Check if the expression profile of a protein is
similar to the expression profile of its ortholog
Hs
Mm
7Orthology method comparison
8Orthology method comparison
- e.g. conservation of protein interaction
- DIP (Database of Interacting Proteins)
- Check if the orthologs of two interacting
proteins are still interacting in the other
species -gt calculate fraction
Hs
Hs
Mm
Mm
9Orthology method comparison
10Orthology method comparison
11Orthology method comparison
- Trade-off between sensitivity and selectivity
- BBH and INP are most sensitive but also most
selective - Results can differ depending on what sequence
comparison algorithm is used - - BLAST, FASTA, Smith-Waterman?
- - E-value or Z-value?
12E-value or Z-value?
- Smith-Waterman with Z-value statistics
- 100 randomized shuffles to test significance of
SW score
rnd ori 5SD ? Z 5
O. MFTGQEYHSV shuffle 1. GQHMSVFTEY 2.
YMSHQFTVGE etc.
seqs
SW score
13E-value or Z-value?
- Z-value calculation takes much time (2x100
randomizations) - Comet et al. (1999) and Bastien et al. (2004)
Z-value is theoretically more sensitive and more
selective than E-value? - Advantage of Z-value has never been proven by
experimental results
14How to compare?
- Structural comparison is better than sequence
comparison - ASTRAL SCOP Structural Classification Of
Proteins - e.g. a.2.1.3, c.1.2.4 same number same
structure - Use structural classification as benchmark for
sequence comparison methods
15ASTRAL SCOP statistics
16Methods (1)
- Smith-Waterman algorithms
- dynamic programming computationally
intensive - Paracel with e-value (PA E)
- SW implementation of Paracel
- Biofacet with z-value (BF Z)
- SW implementation of Gene-IT
- ParAlign with e-value (PA E)
- SW implementation of Sencel
- SSEARCH with e-value (SS E)
- SW implementation of FASTA (see next page)
17Methods (2)
- Heuristic algorithms
- FASTA (FA E)
- Pearson Lipman, 1988
- Heuristic approximation performs better than
BLAST with strongly diverged proteins - BLAST (BL E)
- Altschul et al., 1990
- Heuristic approximation stretches local
alignments (HSPs) to global alignment - Should be faster than FASTA
18Method parameters
- all
- matrix BLOSUM62
- gap open penalty 12
- gap extension penalty 1
- Biofacet with z-value 100 randomizations
19Receiver Operating Characteristic
- R.O.C. statistical value, mostly used in
clinical medicine - Proposed by Gribskov Robinson (1996) to be used
for sequence comparison analysis
20ROC50 Example
- Take 100 best hits
- True positives in same SCOP family, or false
positives not in same family - For each of first 50 false positives calculate
number of true positives higher in list
(0,4,4,4,5,5,6,9,12,12,12,12,12) - - Divide sum of these numbers by number of false
positives (50) and by total number of possible
true positives (size of family -1) ROC50
(0,167) - - Take average of ROC50 scores for all entries
21ROC50 results
22Coverage vs. Error
- C.V.E. Coverage vs. Error (Brenner et al.,
1998) - E.P.Q. selectivity indicator (how much false
positives?) - Coverage sensitivity indicator (how much true
positives of total?)
23CVE Example
- Vary threshold above which a hit is seen as a
positive e.g. e10,e1,e0.1,e0.01 - True positives in same SCOP family, or false
positives not in same family - For each threshold, calculate the coverage
number of true positives divided by the total
number of possible true positives - For each treshold, calculate the
errors-per-query number of false positives
divided by the number of queries - - Plot coverage on x-axis and errors-per-query on
y-axis right-bottom is best
24CVE results
-
(only PDB095)
25Mean Average Precision
- A.P. borrowed from information retrieval search
(Salton, 1991) - Recall true positives divided by number of
homologs - Precision true positives divided by number of
hits - A.P. approximate integral to calculate area
under recall-precision curve
26Mean AP Example
- - Take 100 best hits
- - True positives in same SCOP family, or false
positives not in same family - For each of the true positives divide the true
positive rank (1,2,3,4,5,6,7,8,9,10,11,12) by the
positive rank (2,3,4,5,9,12,14,15,16,18,19,20) - Divide the sum of all of these numbers by the
total number of hits (100) AP (0.140) - Take average of AP scores for all entries mean
AP
27Mean AP results
28Time consumption
- PDB095 all-against-all comparison
- Biofacet multiple days (z value calc.!)
- BLAST 2d,4h,16m
- SSEARCH 5h49m
- ParAlign 47m
- FASTA 40m
29Preliminary conclusions
- SSEARCH gives best results
- When time is important, FASTA is a good
alternative - Z-value seems to have no advantage over E-value
30Problems
- Bias in PDB?
- Sequence length
- Amino acid composition
- Difference in matrices?
- Difference in SW implementations?
31Bias in PDB sequence length?
? Yes! Short sequences are over-represented in
the ASTRAL SCOP PDB sets
32Bias in PDB aa distribution?
? No! Approximately equal amino acid distribution
in the ASTRAL SCOP PDB sets
33Difference in matrices?
34Difference in SW implementations?
35Conclusions
- E-value better than Z-value!
- SW implementations are (more or less) the same
(SSEARCH, ParAlign and Biofacet), but SSEARCH
with e-value scores best of all - Larger structural comparison database needed for
better analysis
36Credits
- NV Organon
- Peter Groenen
- Wilco Fleuren
- Wageningen UR
- Jack Leunissen