Title: SPAM Project IV
1 SPAM Project IV Julia Ponomarenko
San Diego Supercomputer Center, University of
California, San Diego La Jolla, CA, USA
2Main Aims of the Project IV
Specific studies performed and components
developed
Projects
Enhancing structural alignment with interaction
patterns for DNA-binding domains
1. Improve the CE and MC-CE pairwaise- and
multiple-structure comparison algorithms,
respectively
Study and fully-automated classification of
DNA-binding protein domains (accomplished and
published in Bioinformatics, 2002)
The classification and annotation resource for
DNA-binding protein domains has been built and
made available for the community
2. With the improved structure alignments,
characterize structures according to new and
revised domain assignments and associated
domain-level annotation
Study of enhanced CE algorithm (i) detailed
representation of residue (ii) multiple HSPs
Study of structural and functional annotation and
classification of three helical bundle motifs
(3HB) containing HTH DNA-binding motifs (in
progress)
Literature review of structural and functional
specifics of HTH and 3HBs
3. Provide all algorithms and associated
annotated domain databases to a worldwide
community
Analysis of HTH and 3HB motifs, and associated
domains in PDB from the point of view of their
annotation and classification
3(No Transcript)
4(No Transcript)
5Nuclear ribonucleoprotein A1
Papillomavirus-1 E2
6Nuclear ribonucleoprotein A1
Papillomavirus-1 E2
7SCOP (different superfamilies) 2bopa
d.58.8.1 2up1a d.58.7.1 CATH (similar
homologous superfamily) 2bopa
3.30.70.330 2up1a 3.30.70.330 DALI
(different functional families) 2bopa
DC_5_3_15 2up1a DC_5_3_26
Regulator transcription, binds dsDNA
Telomere length regulator, binds ssDNA
8Study and Classification of DNA-binding protein
domains Data and algorithms used and developed
- PDB Protein Data Bank was used as the source
of original structural data - PDP Protein Domain Parser (Alexandrov and
Shindyalov, Bioinformatics, 2003) - CE Protein structure alignment by
Combinatorial Extension (Shindyalov, Bourne,
1998)
-
- Enhanced structural alignment with optimization
of CE based alignment using DNA-binding
interaction patterns - Domain classification algorithm using composite
scoring function involving parameters
representing domain structural similarity and
matching of interaction patterns - Comparative analysis methodology for domain
classifications using 2x2 table representation
and set of seven statistical coefficients
9Building representative set of DNA-binding domains
PDB
17,304 entries (02/13/2002)
19,006 entries (10/22/2002)
983
805 chains
1,547 domains
1,254
1,085 domains
338 domains
399
Calculating classification of DNA-binding protein
domains
10Selection of DNA-binding protein chains/domains
by analyzing DNA-protein contacts
- The DNA fragment size is at least 5 bp long.
- At least 5 different protein residues are
involved in the interaction with DNA. - The contact distance cutoff between interacting
atoms was lt 5Å. - We did not take into account the different types
of DNA (A, B, Z) because of the insufficient
level of this annotation in the PDB
11- Representatives are different from each other as
defined by the following criteria - Rmsd, root mean squared deviation between two
aligned and compared protein domains, gt 2.0 Å - Z-score, statistically founded score obtained
from CE, lt 4.5 - Sequence identity in the alignment, lt 90
- Rnar, ratio of the number of structurally
aligned residues to the smallest domain length, lt
90.
12- Structural comparison of representative
DNA-binding protein domains to each other was
performed using the CE algorithm. Two classes of
parameters measuring domains similarity are
considered - Parameters measuring structural similarity Rmsd,
Z-score, Rnar - Parameter measuring the match between DNA-protein
contact patterns, Rmat
- residue contacts with DNA matching
contacts
A and B - DNA-binding protein domains RmatX -
ratio of the number of matched (structurally
aligned) contact residues to the total number of
residues involved in contacts with DNA in the
protein X. Rmat minRmatA, RmatB
13Illustration of the parameter Rmat measuring the
match of structurally aligned residues involved
in interaction with DNA 1IMHC2 (199- 365)
- NUCLEAR FACTOR OF ACTIVATED T CELLS 5 1RAMA1
(19- 193) - TRANSCRIPTION FACTOR NF-KB
The total number of residues involved in the
interaction with DNA 1IMHC2 18 1RAMA1
18 Among them are structurally aligned
1IMHC2 72 (red) 1RAMA1 74 (green)
Rmat 72
1IMHC2 1RAMA1
14Realignment using scoring function taking into
account structural similarity between two protein
domains and protein-DNA contact pattern
Similarity matrix
Structure similarity term
Protein-DNA contact pattern term
where
m denotes protein residue, X protein-DNA
complex C3 is a scaling constant
15Illustration of the result of the realignment
procedure 1IMHC2 (199- 365) - NUCLEAR
FACTOR OF ACTIVATED T CELLS 5 1RAMA1 (19-
193) - TRANSCRIPTION FACTOR NF-KB
Rmsd 2.4 Å Rnar 82 Rmat 72
16Comparative analysis methodology for domain
classifications using 2x2 table representation
and set of statistical coefficients
- counts of matches/mismatches between two
classifications - T true (match), F false (mismatch)
Jaccards coefficient NTT / (NTT NTF NFT)
Yules (colligation) coefficient (NTT x NFF -
NTF x NFT ) / (NTT x NFF NTF x NFT )
17Comparison of the classification for 263 (from
338) DNA-binding domain representatives with SCOP
at various threshold parameters
18Comparison of two structural classifications
accounting (A) and not accounting (B) for
protein-DNA contacts
A
B
19Preferred choice of parameters for the best
classification of representative DNA-binding
domains
Rmsd, Å
Not similar
5
Z-score ? 3.5
Similar if Rmat ? 80
3
Not similar
Similar
Rnar,
100
85
70
20(No Transcript)
21(No Transcript)
22(No Transcript)
23(No Transcript)
24SPDC (Structural Protein Domain Classification)
Implementation
Sun HPC 10000 / 64 cpu (parallel using MPI) Sun
HPC 4000 / 4 cpu (single-cpu)
320 cpu hours
Building representative set of domains
2 cpu hours
Building DNA-binding domain classification
Fully automated updates (quaternally monthly ?
weekly - current with PDB)
251JMCA (183-299) SCOP b.40.4.3 - Single strand
DNA-binding domain b.40 OB-fold
26(No Transcript)
27(No Transcript)
28Helix-turn-helix DNA binding motif
Myb Proto-Oncogene Protein a.4.1.3
MATA-1 (homeodomain) a.4.1.1 1.10.10.6
HIN RECOMBINASE a.4.1.2 1.10.10.6
Rmsd 1.7Å Z-Score 3.9Sequence identity
12.2
Rmsd 2.3Å Z-Score 3.7Sequence identity
13.5
Rmsd 2.0Å Z-Score 3.7Sequence identity
15.4
29TFIIB C-domain binds DNA
Retinoblastoma A pocket
Cell division protein kinase
E7 peptide
TFIIB N-domain binds TBP protein
Cyclin A3
Rmsd 1.2Å Z-Score 4.6Sequence identity 18
Rmsd 2.4Å Z-Score 3.7Sequence identity 6
3HB motif could be involved in the interaction
with DNA as well as with other proteins
30Study of structural and functional annotation and
classification of three helical bundle motifs
(3HB) containing HTH DNA-binding motifs
Representative 3HB motifs containing HTH
DNA-binding motif by Luscomber et al., 2000.
31Study of structural and functional annotation and
classification of three helical bundle motifs
(3HB) containing HTH DNA-binding motifs
- Plan
- Comparison of the representative 3HBs with PDB
structures using enhanced CE algorithm (i)
detailed representation of residue (ii)
multiple HSPs. - Post-filtering of structural neighbours based on
similarity with representatives 3HBs using
features (solvent accessibility, polarity of
enviroment, secondary structure, structural
similarity). - Study of the resulting set of 3HBs sequence
similarity, structural enviroment, clastering,
functional classification.
32Carbamoyl phosphate synthetase
??-resolvase
Rmsd 1.3Å Z-Score 4.1Sequence identity 14
3HB motif in carbamoyl phosphate synthetase
appear to involve in the stabilization of the
tetramer