Title: Unsupervised Pattern Discovery in Protein Structures
1Unsupervised Pattern Discovery in Protein
Structures
Tom Milledge and Giri Narasimhan Bioinformatics
Research Group (BioRG) School of Computing and
Information Sciences, FIU Contact
tmille01_at_cs.fiu.edu, giri_at_cs.fiu.edu
III. Overview of the Approach
I. Description and Motivation
II. Background Information
- What is the project trying to do?
- To implement unsupervised pattern discovery tools
for protein structure data by using the
ggeometric hashing technique - Why is it important?
- To identify common substructures in proteins
- To create multiple 3-D structural alignments
- To identify functional regions in proteins.
- What is the expected output?
- A database of protein structure patterns.
- Case Study
- Zinc finger domains
- Dehydrogenase domains
- At the begin of the search, the query protein is
decomposed into triangles and the query protein
data is then copied to all nodes. The hash table
created in the preprocessing phase is also split
across cluster nodes by protein, with protein
attribute information stored in a separate table.
- The initial search entails matching the query
triangles with the database of (target)
triangles. - The substructure creation (extension) phase
entails the building a graph of the largest
substructures common to both.
- Geometric hashing is a computational technique
for matching geometric features against a
database of such features.
Search Phase For any given query protein, find
the matching triangles in the hash table
Preprocessing Phase Extract triangle information
from target proteins and store them in a hash
table
Extension phase Find the largest matching
substructures
Query Protein
Target Protein
Extract triangles
Extract triangles
Matching triangles
Query hits
Target hits
IV. Case Study Dehydrogenase domains
V. Conclusions and Future Work
Results Case studies have been performed on the
Zinc Finger DNA binding domain family and the
Dehydrogenase enzyme super family. In the former
case, two substructures were detected
corresponding to the known zinc-binding and DNA
binding functions. In the latter case, a large
sub-structural motif was detected in proteins
that were not related at the family level.
1B3R Hydrolase (Rat)
1CJC Reductase (Cow)
1CF2 Dehydrogenase (Bacteria)
Reoccurring substructure
Future Work Perform the search in an
unsupervised manner on the entire Protein Data
Bank (PDB) The result of such a search will be a
database (or knowledgebase) of structure patterns
in proteins. Such databases will help in the
important task of functional annotation of
proteins.
Edge-extended substructure comprising
zinc-binding residues. RMSD 0.35 angstroms
Structural alignment of two zinc finger DNA
binding motifs. Query yellow Target blue RMSD
0.95 angstroms
Other overlapping triangle matches are extended
from initial triangle to find largest common
substructure
Triangle from query protein (green) matches
triangle from target protein (pink)
RMSD is measured at each extension step to ensure
validity of the larger match
Edge-extended substructure comprising DNA-binding
residues. RMSD 0.46 angstroms
RMSD (Root Mean Square distance) less than 1.0
Angstrom indicates a good match
VI. Acknowledgement
- This research was supported in part by NIH grant
OCI-0537464 .
RMSD 0.32 Angstroms
RMSD 0.66 Angstroms