Testing sequence comparison methods with structure similarity - PowerPoint PPT Presentation

1 / 36
About This Presentation
Title:

Testing sequence comparison methods with structure similarity

Description:

ASTRAL SCOP: Structural Classification Of Proteins ... True positives: in same SCOP family, or false positives: not in same family ... – PowerPoint PPT presentation

Number of Views:91
Avg rating:3.0/5.0
Slides: 37
Provided by: Hul78
Category:

less

Transcript and Presenter's Notes

Title: Testing sequence comparison methods with structure similarity


1
Testing sequence comparison methods with
structure similarity
  • _at_ Organon, Oss
  • 2006-02-07
  • Tim Hulsen

2
Introduction
  • Main goal transfer function of proteins in model
    organisms to proteins in humans
  • Make use of orthology proteins evolved from
    common ancestor in different species (very
    similar function!)
  • Several ortholog identification methods, relying
    on
  • Sequence comparisons
  • (Phylogenies)

3
Introduction
  • Quality of ortholog identification depends on
  • 1.) Quality of sequence comparison algorithm
  • - Smith-Waterman vs. BLAST, FASTA, etc.
  • - Z-value vs. E-value
  • 2.) Quality of ortholog identification itself
    (phylogenies, clustering, etc.)
  • 2 -gt previous research
  • 1 -gt this presentation

4
Previous research
  • Comparison of several ortholog identification
    methods
  • Orthologs should have similar function
  • Functional data of orthologs should behave
    similar
  • Gene expression data
  • Protein interaction data
  • Interpro IDs
  • Gene order

5
Orthology method comparison
  • Compared methods
  • BBH, Best Bidirectional Hit
  • INP, InParanoid
  • KOG, euKaryotic Orthologous Groups
  • MCL, OrthoMCL
  • PGT, PhyloGenetic Tree
  • Z1H, Z-value gt 1 Hundred

6
Orthology method comparison
  • e.g. correlation in expression profiles
  • Affymetrix human and mouse expr. data, using
    SNOMED tissue classification
  • Check if the expression profile of a protein is
    similar to the expression profile of its ortholog

Hs
Mm
7
Orthology method comparison
8
Orthology method comparison
  • e.g. conservation of protein interaction
  • DIP (Database of Interacting Proteins)
  • Check if the orthologs of two interacting
    proteins are still interacting in the other
    species -gt calculate fraction

Hs
Hs
Mm
Mm
9
Orthology method comparison
10
Orthology method comparison
11
Orthology method comparison
  • Trade-off between sensitivity and selectivity
  • BBH and INP are most sensitive but also most
    selective
  • Results can differ depending on what sequence
    comparison algorithm is used
  • - BLAST, FASTA, Smith-Waterman?
  • - E-value or Z-value?

12
E-value or Z-value?
  • Smith-Waterman with Z-value statistics
  • 100 randomized shuffles to test significance of
    SW score

rnd ori 5SD ? Z 5
O. MFTGQEYHSV shuffle 1. GQHMSVFTEY 2.
YMSHQFTVGE etc.
seqs
SW score
13
E-value or Z-value?
  • Z-value calculation takes much time (2x100
    randomizations)
  • Comet et al. (1999) and Bastien et al. (2004)
    Z-value is theoretically more sensitive and more
    selective than E-value?
  • Advantage of Z-value has never been proven by
    experimental results

14
How to compare?
  • Structural comparison is better than sequence
    comparison
  • ASTRAL SCOP Structural Classification Of
    Proteins
  • e.g. a.2.1.3, c.1.2.4 same number same
    structure
  • Use structural classification as benchmark for
    sequence comparison methods

15
ASTRAL SCOP statistics
16
Methods (1)
  • Smith-Waterman algorithms
  • dynamic programming computationally
    intensive
  • Paracel with e-value (PA E)
  • SW implementation of Paracel
  • Biofacet with z-value (BF Z)
  • SW implementation of Gene-IT
  • ParAlign with e-value (PA E)
  • SW implementation of Sencel
  • SSEARCH with e-value (SS E)
  • SW implementation of FASTA (see next page)

17
Methods (2)
  • Heuristic algorithms
  • FASTA (FA E)
  • Pearson Lipman, 1988
  • Heuristic approximation performs better than
    BLAST with strongly diverged proteins
  • BLAST (BL E)
  • Altschul et al., 1990
  • Heuristic approximation stretches local
    alignments (HSPs) to global alignment
  • Should be faster than FASTA

18
Method parameters
  • all
  • matrix BLOSUM62
  • gap open penalty 12
  • gap extension penalty 1
  • Biofacet with z-value 100 randomizations

19
Receiver Operating Characteristic
  • R.O.C. statistical value, mostly used in
    clinical medicine
  • Proposed by Gribskov Robinson (1996) to be used
    for sequence comparison analysis

20
ROC50 Example
  • Take 100 best hits
  • True positives in same SCOP family, or false
    positives not in same family
  • For each of first 50 false positives calculate
    number of true positives higher in list
    (0,4,4,4,5,5,6,9,12,12,12,12,12)
  • - Divide sum of these numbers by number of false
    positives (50) and by total number of possible
    true positives (size of family -1) ROC50
    (0,167)
  • - Take average of ROC50 scores for all entries

21
ROC50 results
22
Coverage vs. Error
  • C.V.E. Coverage vs. Error (Brenner et al.,
    1998)
  • E.P.Q. selectivity indicator (how much false
    positives?)
  • Coverage sensitivity indicator (how much true
    positives of total?)

23
CVE Example
  • Vary threshold above which a hit is seen as a
    positive e.g. e10,e1,e0.1,e0.01
  • True positives in same SCOP family, or false
    positives not in same family
  • For each threshold, calculate the coverage
    number of true positives divided by the total
    number of possible true positives
  • For each treshold, calculate the
    errors-per-query number of false positives
    divided by the number of queries
  • - Plot coverage on x-axis and errors-per-query on
    y-axis right-bottom is best

24
CVE results
-

(only PDB095)
25
Mean Average Precision
  • A.P. borrowed from information retrieval search
    (Salton, 1991)
  • Recall true positives divided by number of
    homologs
  • Precision true positives divided by number of
    hits
  • A.P. approximate integral to calculate area
    under recall-precision curve

26
Mean AP Example
  • - Take 100 best hits
  • - True positives in same SCOP family, or false
    positives not in same family
  • For each of the true positives divide the true
    positive rank (1,2,3,4,5,6,7,8,9,10,11,12) by the
    positive rank (2,3,4,5,9,12,14,15,16,18,19,20)
  • Divide the sum of all of these numbers by the
    total number of hits (100) AP (0.140)
  • Take average of AP scores for all entries mean
    AP

27
Mean AP results
28
Time consumption
  • PDB095 all-against-all comparison
  • Biofacet multiple days (z value calc.!)
  • BLAST 2d,4h,16m
  • SSEARCH 5h49m
  • ParAlign 47m
  • FASTA 40m

29
Preliminary conclusions
  • SSEARCH gives best results
  • When time is important, FASTA is a good
    alternative
  • Z-value seems to have no advantage over E-value

30
Problems
  • Bias in PDB?
  • Sequence length
  • Amino acid composition
  • Difference in matrices?
  • Difference in SW implementations?

31
Bias in PDB sequence length?
? Yes! Short sequences are over-represented in
the ASTRAL SCOP PDB sets
32
Bias in PDB aa distribution?
? No! Approximately equal amino acid distribution
in the ASTRAL SCOP PDB sets
33
Difference in matrices?
34
Difference in SW implementations?
35
Conclusions
  • E-value better than Z-value!
  • SW implementations are (more or less) the same
    (SSEARCH, ParAlign and Biofacet), but SSEARCH
    with e-value scores best of all
  • Larger structural comparison database needed for
    better analysis

36
Credits
  • NV Organon
  • Peter Groenen
  • Wilco Fleuren
  • Wageningen UR
  • Jack Leunissen
Write a Comment
User Comments (0)
About PowerShow.com