Title: DDPIn Distance and Density Based Protein Indexing
1DDPIn Distance and Density Based Protein
Indexing
- David Hoksza
- Charles University in Prague Department of
Software Engineering Czech Republic
2Presentation Outline
- Biological background
- Similarity search in protein structure databases
- DDPIn
- feature vector extraction
- metrics
- querying
- one-step approach
- multi-step approach
- Experimental results
- Conclusion
3Biological Background
- Proteins
- molecules
- translated from mRNA in ribosomes
- DNA ? RNA ? protein
- sequence of amino acids (20 AAs)
- coded by codon (triplet of nucleotides)
- Function of a protein derived from its three
dimensional structure - ? similar proteins have similar functions
- similar proteins have a common ancestor
- Identifying protein structure ? finding similar
proteins ? getting clue to the function
4Similarity Search in Protein Databases
- Similarity between a pair of proteins
- alignment similarity score
- RMSD, TM-score,
- visual inspection
- DALI, CE, SAP, VAST
- Classification
- SCOP (Structural Classification of Proteins)
- no need for an alignment
- indexing various features
- PSI, PSIST, ProGreSS, CTSS, DDPIn
5DDPIn - Overview
- Distance and Density based Protein Indexing
- Classification method
- Indexing of protein features
- distances among Ca atoms used
- each AA represents a feature ? protein p consists
of p features - various semantics used
- based on clustering Ca atoms into rings
- metric indexing employed (M-tree)
- kNN querying
- outcomes of several searches are merged to obtain
final results
6DDPIn - Feature Extraction
- Features
- n-dimensional vectors of real numbers
- AA viewpoint ? VPT (viewpoint tag)
- sDens
- density of AAs in rings with a predefined width
- sDensSSE
- enhanced with SSE information
- sRad
- widths of rings containing predefined percentage
of AAs - sRadSSE
- enhanced with SSE information
- sDir
- number of AAs in a ring pointing from the
viepoint - sDens enhanced with direction information
7DDPIn - Similarity of VPTs
- Metrics
- L2
- weighted L2
- close neighborhood of VPs is more important
8DDPIn Indexing Structure
- M-tree (Metric tree)
- Dynamic, hierarchical indexing structure
- Data space divided into ball shaped data regions
(hyper-spheres) - root node represent data region covering all data
- children nodes represent regions covering parts
of the space, - data regions form balanced hierarchical structure
- inner nodes ? routing entries
-
- leaf nodes ? ground entries
-
9Querying / Classification
- One-step
- extracting VPTs from query ? n queries
- ranking scheme
- Two-step
- healing
- reclassification with Smith-Waterman algorithm on
sequences
10Experimental Results
- SCOP 1.65 dataset
- class ? fold ? superfamily ? family
- 1810 proteins
- 181 superfamilies
- at least 10 proteins each
- all a, all ß, a ß and a /ß classes
- query set
- reduced - 181 queries
- full
- used also by PSI, ProGreSS, PSIST methods
- Testing of
- superfamily classification accuracy
- fold classification accuracy
11Finding Optimal k for kNN Queries
12Accuracy of VPT Semantics
13Accuracy for Increasing Dimension
14Accuracy of Various Metrics
15Suitability of Pairs of VPT Semantics for Healing
identical correct classification
identical wrong classification
16Comparison of Classification Methods
17Conclusion
- We have proposed
- new representation of protein structures
- distance and density of Ca atoms
- ranking scheme
- two-step classification
- We implemented
- M-tree indexing for proposed representation
- classification against SCOP
- Experimental results
- best results among methods using identical
classification - 98.9 superfamily classification accuracy
- 100 fold classification accuracy
- comparable run time