DDPIn Distance and Density Based Protein Indexing - PowerPoint PPT Presentation

1 / 17
About This Presentation
Title:

DDPIn Distance and Density Based Protein Indexing

Description:

SCOP (Structural Classification of Proteins) no need for an alignment. indexing various features ... classification against SCOP. Experimental results ... – PowerPoint PPT presentation

Number of Views:43
Avg rating:3.0/5.0
Slides: 18
Provided by: tup
Category:

less

Transcript and Presenter's Notes

Title: DDPIn Distance and Density Based Protein Indexing


1
DDPIn Distance and Density Based Protein
Indexing
  • David Hoksza
  • Charles University in Prague Department of
    Software Engineering Czech Republic

2
Presentation Outline
  • Biological background
  • Similarity search in protein structure databases
  • DDPIn
  • feature vector extraction
  • metrics
  • querying
  • one-step approach
  • multi-step approach
  • Experimental results
  • Conclusion

3
Biological Background
  • Proteins
  • molecules
  • translated from mRNA in ribosomes
  • DNA ? RNA ? protein
  • sequence of amino acids (20 AAs)
  • coded by codon (triplet of nucleotides)
  • Function of a protein derived from its three
    dimensional structure
  • ? similar proteins have similar functions
  • similar proteins have a common ancestor
  • Identifying protein structure ? finding similar
    proteins ? getting clue to the function

4
Similarity Search in Protein Databases
  • Similarity between a pair of proteins
  • alignment similarity score
  • RMSD, TM-score,
  • visual inspection
  • DALI, CE, SAP, VAST
  • Classification
  • SCOP (Structural Classification of Proteins)
  • no need for an alignment
  • indexing various features
  • PSI, PSIST, ProGreSS, CTSS, DDPIn

5
DDPIn - Overview
  • Distance and Density based Protein Indexing
  • Classification method
  • Indexing of protein features
  • distances among Ca atoms used
  • each AA represents a feature ? protein p consists
    of p features
  • various semantics used
  • based on clustering Ca atoms into rings
  • metric indexing employed (M-tree)
  • kNN querying
  • outcomes of several searches are merged to obtain
    final results

6
DDPIn - Feature Extraction
  • Features
  • n-dimensional vectors of real numbers
  • AA viewpoint ? VPT (viewpoint tag)
  • sDens
  • density of AAs in rings with a predefined width
  • sDensSSE
  • enhanced with SSE information
  • sRad
  • widths of rings containing predefined percentage
    of AAs
  • sRadSSE
  • enhanced with SSE information
  • sDir
  • number of AAs in a ring pointing from the
    viepoint
  • sDens enhanced with direction information

7
DDPIn - Similarity of VPTs
  • Metrics
  • L2
  • weighted L2
  • close neighborhood of VPs is more important

8
DDPIn Indexing Structure
  • M-tree (Metric tree)
  • Dynamic, hierarchical indexing structure
  • Data space divided into ball shaped data regions
    (hyper-spheres)
  • root node represent data region covering all data
  • children nodes represent regions covering parts
    of the space,
  • data regions form balanced hierarchical structure
  • inner nodes ? routing entries
  • leaf nodes ? ground entries

9
Querying / Classification
  • One-step
  • extracting VPTs from query ? n queries
  • ranking scheme
  • Two-step
  • healing
  • reclassification with Smith-Waterman algorithm on
    sequences

10
Experimental Results
  • SCOP 1.65 dataset
  • class ? fold ? superfamily ? family
  • 1810 proteins
  • 181 superfamilies
  • at least 10 proteins each
  • all a, all ß, a ß and a /ß classes
  • query set
  • reduced - 181 queries
  • full
  • used also by PSI, ProGreSS, PSIST methods
  • Testing of
  • superfamily classification accuracy
  • fold classification accuracy

11
Finding Optimal k for kNN Queries
12
Accuracy of VPT Semantics
13
Accuracy for Increasing Dimension
14
Accuracy of Various Metrics
15
Suitability of Pairs of VPT Semantics for Healing
identical correct classification
identical wrong classification
16
Comparison of Classification Methods
17
Conclusion
  • We have proposed
  • new representation of protein structures
  • distance and density of Ca atoms
  • ranking scheme
  • two-step classification
  • We implemented
  • M-tree indexing for proposed representation
  • classification against SCOP
  • Experimental results
  • best results among methods using identical
    classification
  • 98.9 superfamily classification accuracy
  • 100 fold classification accuracy
  • comparable run time
Write a Comment
User Comments (0)
About PowerShow.com