Title: Graph Spectral Analysis
1(No Transcript)
2- Graph Spectral Analysis
- Active Site Analysis
- Folding Clusters
- Quaternary Association of Proteins
- Domain Identification Domain Interface Residues
- Recurrence Quantification Analysis
- Structural Identification
- Protein Aggregation/ Folding
- Structural and Topological properties
- Global Network Partitioning
- Protein Structural Organization
- Autonomous Folding Units
- Folding-important residues
- Native vs Decoy discrimination
3- RQA
- Nonlinear technique
- Transform original series into its embedding
matrix (EM) based on delays - Rows of embedding matrix correspond to windows of
length 4 - Based on computation of Euclidean distance
between the rows of the EM - Looking for epochs close to each other
4- Recurrence
- Point that repeats itself
- Most basic of relations
- Strictly local and independent of any
mathematical assumption about the system - Calculation of recurrence requires no
transformation of data - Can be used for both linear and nonlinear systems
5EMBEDDING DIMENSION 4 LAG 1
10 11 21 32 41 35 40 19
10 11 21 32 41 35 40 19
11 21 32 41 35 40 19
21 32 41 35 40 19
32 41 35 40 19
EMBEDDING MATRIX
6- Pairwise distances between each pair of rows of
the EM are calculated - If d lt r, then a dot is placed on the recurrence
plot - Application of this computation produces a
Recurrence Plot (RP) - Symmetrical M x M matrix
- Point placed at (i,j) whenever row Xi - Xj lt
r - Graphically represented by a dot
10 11 21 32 41 35 40 19
11 21 32 41 35 40 19
21 32 41 35 40 19
32 41 35 40 19
7Recurrence Quantification Analysis (RQA)
PROTEIN SIGNALS
HISTORY
HISTORY
KOENIGSBERG
8Recurrence Quantification Analysis (RQA)
PROTEIN SIGNALS RECURRENCE PLOT
HISTORY
HISTORY
KOENIGSBERG
9Thermophylic and Mesophylic Proteins
HISTORY
HISTORY
10Recurrence Quantification Analysis (RQA)
THERMAL STABILITY
HISTORY
HISTORY
- Rubredoxins
- Member of the family of redox metalloenzymes
- Relatively short polypeptide chain (53 AA)
- Ferrous/Ferric ion tetrahedrally bound to four
Cys residues - Huge difference in stability
- Pyrococcus Furiosus (Rubr Pyrfu)
- Thermophilic
- Clostridium pasteurianum (Rubr Clopa)
- Most similar in terms of 3D structure
- Mesophilic
11Recurrence Quantification Analysis (RQA)
THERMAL STABILITY
HISTORY
HISTORY
Thermophylic Proteins
Mesophylic Proteins
12Recurrence Quantification Analysis (RQA)
THERMAL STABILITY
HISTORY
HISTORY
13Recurrence Quantification Analysis (RQA)
THERMAL STABILITY
HISTORY
HISTORY
14Recurrence Quantification Analysis (RQA)
THERMAL STABILITY
HISTORY
HISTORY
- Kurtosis of recurrence distributions
- Kurtosis Index of relative peaked character of a
distribution - Higher kurtosis gt More concentrated recurrence
patterns
15Protein Aggregation
HISTORY
HISTORY
16Protein Aggregation
THE PROBLEM
HISTORY
HISTORY
- Protein aggregation is an important problem
- Involvement of protein misfolding in Alzheimers,
Huntingtons and prion diseases - Possibility of forming aggregates intrinsic to
proteins - How to delineate betweeen proteins that form
aggregates and proteins that dont? - In principle change of environmental conditions
could drive any protein to form multimeric
aggregates
17Protein Aggregation
THE PROBLEM
HISTORY
HISTORY
- Main driving force in protein folding
- Need to be soluble in water
- Proteins fold so as to
- Hide hydrophobic residues in core
- Expose polar residues to solvent
- Protein-Protein interaction can be considered as
similar to protein folding - Mainly hydrophobic character of folding processes
led to the study of hydrophobicity patterning
along the sequence using RQA
18Protein Aggregation
RQA
HISTORY
HISTORY
- 9 groups studied
- A almost pure natural ?-helix
- B almost pure natural ?-sheet
- C mainly ?-helix polymerizing
- D aggregating systems, eg amyloids
- E natively unfolded proteins
- F Proteins undergoing lot of PPI, eg DNA repair
systems - G Synthetic ?-helices
- H Synthetic ?-sheets
- I ?-helices
19Protein Aggregation
RQA
HISTORY
HISTORY
- Self aggregating systems (D) similar to
DNA-processing proteins (F) and ?-helices (I) - Connection between D and F maybe linked to the
capability of forming multimeric aggregates - Polymerizing proteins (C) have a distinct
patterning - G H have clear-cut peaks
20Protein Aggregation
RQA CDA
HISTORY
HISTORY
- Canonical Discriminant Analysis (CDA)
- Variant of linear model
- Y XBe
- Y dependent variables
- X independent variables
- B regression coefficients
- e errors
- Model used to calculate variables which can best
separate the data into classes
21Protein Aggregation
RQA CDA
HISTORY
HISTORY
- Canonical Variates plot
- CV1 vs CV2
- CV3 vs CV4
- CV1 vs CV2
- CV1 is a measure of shape
- Synthetic ?-helices (G) and natural ?-sheets (B)
situated at extremes - Regular arrangement of secondary structures
- H (artificial ?-sheets ) is also unique in that
it is most periodic - Natural Amyloid proteins are towards the center
22Protein Aggregation
RQA CDA
HISTORY
HISTORY
- CV3 vs CV4
- CV3 models the oppostion between D and G on one
hand and all other proteins on the other - CV4 represents a fine tuning of different spectra
- D (amyloid) is near DNA repair (F) and also of
artificial ?-sheets (H)
23Protein Aggregation
RQA CLUSTERING
HISTORY
HISTORY
- K-means clustering was carried out on CDA
variables - D group goes together with A (natural ?-helices),
F (DNA repair) and I (?-helices) - Implications
- ?-sheets not a prerequisite for aggregation
- Oligomerization has a pivotal role to play in
amyloid forming system - Formation of high polymers of collagen type (C)
follow a different mechanism
24Protein Secondary Structures
HISTORY
HISTORY
25Recurrence Quantification Analysis (RQA)
SECONDARY STRUCTURE
HISTORY
HISTORY
- RQA shows that higher dimensional recurrences
could be captured by single variables - By creating time/space delayed versions of
the signal - Setting a radius
- For multi-dimensional data, no need to do
embedding - For PDB structures
- x, y, z co-ordinates available
- No need for embedding matrix
- Recurrent plots are mathematically equivalent to
simple contact (adjacency ) matrices
26Recurrence Quantification Analysis (RQA)
SECONDARY STRUCTURE
HISTORY
HISTORY
- Distance matrices were built for proteins with
the C? x,y,z co-ordinates - Euclidean distance between each pair of residues
was calculated - Radius 6 A
27Recurrence Quantification Analysis (RQA)
SECONDARY STRUCTURE
HISTORY
HISTORY
28Recurrence Quantification Analysis (RQA)
SECONDARY STRUCTURE
HISTORY
HISTORY
- Parallel and anti-parallel lines
- However parallel lines
- Both by ?-helices and ?-sheets
- Introduced a new parameter to separate between
these - Based on number of recurrences in a 10 C?-atom
stretch
29Recurrence Quantification Analysis (RQA)
SECONDARY STRUCTURE
HISTORY
HISTORY
- 3D lattice models of proteins
- Construct a 3D model
- 120 residues
- ?-helix (A)
- Antiparallel ?-sheet (B)
- Parallel ?-sheet (C )
30Recurrence Quantification Analysis (RQA)
SECONDARY STRUCTURE
HISTORY
HISTORY
31Recurrence Quantification Analysis (RQA)
SECONDARY STRUCTURE
HISTORY
HISTORY
Trp Repressor
Tumor Necrosis Factor Alpha Protein
32Recurrence Quantification Analysis (RQA)
SECONDARY STRUCTURE
HISTORY
HISTORY
Chloramphenical Acetyltransferase
33Recurrence Quantification Analysis (RQA)
SECONDARY STRUCTURE
HISTORY
HISTORY
34Protein Topological Architecture using RQA
HISTORY
HISTORY
35Protein Topological Architecture
THE PROBLEM
HISTORY
HISTORY
- Significant literature exists reporting a general
scaling of different geometrical features of
protein 3D structure with chain length - Contact maps as shown before can be used to study
this phenomenon - REC exhibits a scaling law
- This measure allows to detect a general scaling
of protein structures - Two different topological modes
- Small protein lt 180 aa residues
- Large Protein gt 320 aa residues
- Between these two, a zone with a rich repertoire
of topological arrangements
36Protein Topological Architecture
RQA
HISTORY
HISTORY
- Dataset
- 91 monomeric proteins
- Well characterized structures
- Cover almost entire range of secondary structures
- RQA
- Applied to the 91 proteins
- Bend near 100 residues
37Protein Topological Architecture
RQA
HISTORY
HISTORY
- Ratio of Surface Volume (SV) and REC3D plotted
against protein length - Divergence beginning near 180 and ending at
around 320 - Looking at theoretical REC3D, divergence appears
near 274
38Protein Topological Architecture
RQA IMPLICATIONS
HISTORY
HISTORY
- Explanation of this based on geometric
considerations of protein folding - As chain length increases, a critical value of
surface volume (SV) to compactness (REC3D) is
achieved and a divergence occurs - This divergence implies a structural instability
- Maybe due to the natural limit for domain length
- Singularity might
- Suggest a maximal preferred domain size
- Mark the transition between single domain and
multi-domain proteins -
39Protein Topological Architecture
RQA APPLICATIONS
HISTORY
HISTORY
- Used as a check of consistency of predicted
structures - Figure shows that almost all correct models fall
in a tight band around the scaling curve - If the compactness of the model is far from the
reference line, then the model is erroneous
40Protein Topological Architecture
HYDROPHOBIC PATCHES
HISTORY
HISTORY
- Understanding of folding dynamics by studying
hydrophobic patches - What are minimal sized patches important for
folding? - Data
- 1977 single chain proteins
- RQA was applied
41Protein Topological Architecture
HYDROPHOBIC PATCHES
HISTORY
HISTORY
- Histogram of Maximal length recurrence lines
(MAXL) shows a peak at around 6 residues - Plot of MAXL vs Entropic contribution also shows
a similar peak around 6 - Implies that 6-residue words contain maximal
information for hydrophobicity nucleation in
globular proteins
42Protein Global Network Partitioning
43Protein Global Network Partitioning
THE PROBLEM
HISTORY
HISTORY
- Protein Folding
- Many models
- However, all in agreement that small regions of
proteins tend to fold separately - Stabilized by interactions between these distinct
units - Proteins can thus be considered as collections of
small units - Autonomous folding units/motifs/domains
- Domain
- Defined as regions containing maximum
intra-cluster connectivity and minimum
inter-cluster connectivity
44Protein Global Network Partitioning
ALGORITHM
HISTORY
HISTORY
45Protein Global Network Partitioning
ALGORITHM
HISTORY
HISTORY
Intra-module Connectivity
Participation Coefficient (Inter-module
Connectivity)
- R1 Ultra-peripheral nodes
- R2 Peripheral Nodes
- R3 Non-hub Connector Nodes
- R4 Non-hub Kinless nodes
- R5 Provincial Hubs
- R6 Connector Hubs
- R5 Kinless Hubs
46Protein Global Network Partitioning
ALGORITHM
HISTORY
HISTORY
- Used the Guimera idea for partitioning Protein
Networks into modules - Algorithm based on Genetic Algorithm instead of
on Simulated Annealing - Algorithm is based on natural partitioning
- Independent of Interaction Strength etc..
- Data
- 1420 single chain proteins
- lt 20 identity with each other
- Resolution lt 2 Ã…
47Protein Global Network Partitioning
HIERARCHICAL NATURE
HISTORY
HISTORY
- Hierarchical nature of protein organization
- Module frequency greatest for a module size of
around 12 - Maximal module size scales linearly with protein
length
48Protein Global Network Partitioning
HIERARCHICAL NATURE
HISTORY
HISTORY
- Scaling of Modularity/Length with protein length
- Similar to scaling of REC3D with protein length
- The curve in this case occurs earlier, around 50
residues - Maximal preferred folding unit size?
- Points again to a hierarchical (almost
fractal-like) behavior
49Protein Global Network Partitioning
HYDROPHOBICITY AND RASA
HISTORY
HISTORY
- P-z space was partitioned into 8 x 10 bins
- The hydrophobicities and Relative Accessible
Surface Area (RASA) of the residues falling
within these bins was identified - Standard deviation of the inter-module
hydrophobicities and RASA shown - Variance of both hyd and RASA decreases with
increasing module size - Implies that we are moving towards an average
protein - At low module sizes pure hyd or pure polar
modules possible - Not so at larger module sizes
- Variance flattens out at around 30 residues
50Protein Global Network Partitioning
HYDROPHOBICITY AND RASA
HISTORY
HISTORY
- Maximal connected patches of polar and
hydrophobic residues identified in each and every
module - Frequency plots of intra-module patch size for
hydrophobic and polar patches - Surprisingly, they are almost exactly the same
- Maximal patch size peaks at around 12
- Implications
- Hydrophobic and polar residues start aggregating
together at the beginning of the folding process
in a like-meets-like behavior - These mini-clusters must solve the communication
problem - Happens around 30 residues
- After this topological constraints come into play
- Our modules thus correspond to foldons, or
early folding units
51Protein Global Network Partitioning
SECONDARY STRUCTURES
HISTORY
HISTORY
- At smaller module sizes, we observe propensity to
have a single secondary structural feature - As the module approaches the 30 residue barrier,
we observe a breaking of this pattern - Consistent with the mixing together of the
all-hydrophobic and all-polar modules
52Protein Global Network Partitioning
INVARIANCE OF P/z
HISTORY
HISTORY
- Plot of protein P-z space
- Invariance of the P-z space
- Typical Dentists-chair sort of behavior
- Notice that high P/z valued residues are
significant - Correspond to connector hubs
- We characterized well studied proteins with their
P/z values. - High P/z valued residues act as early folding
residues - Important for folding
53Protein Global Network Partitioning
INVARIANCE OF P/z
HISTORY
HISTORY
- Ubiquitin
- Show the partitioning of the proteins in modules
as well as colored by high P/z valued residues - Very few connector hubs
54Protein Global Network Partitioning
INVARIANCE OF P/z
HISTORY
HISTORY
- Show the top P/z valued hits for ubiquitin
- Most of the top hits are within two residues of a
residue protected during folding (obtained from
experimental studies) - protected residue
- o within one residue of protected residue
- ? within two residues of protected residue
55Protein Global Network Partitioning
INVARIANCE OF P/z
HISTORY
HISTORY
- Distribution of RASA over the 8 x 10 bins of the
P/z plot for all 1420 proteins - Contour plot shown here
- Two regions of low RASA (red areas) corresponding
to buried regions - Surrounded by polar (blue) residues
- Analogous to the Ramachandran plot?
56Protein Global Network Partitioning
NATIVE vs DECOY DISCRIMINATION
HISTORY
HISTORY
- Native vs Decoy structures distribution
- Clear differences between the native (top) and
decoy (bottom) distributions - Can we use this to discriminate between Native
and Decoy structures?
57Protein Global Network Partitioning
NATIVE vs DECOY DISCRIMINATION
HISTORY
HISTORY
- PCA carried out on the RASA distribution values
- Clear separation visible between native and decoy
structures
58- Giuliani et. al., Nonlinear Signal Analysis
Methods in the Elucidation of Protein
Sequence-Structure Relationships, Chem. Rev.
2002, 102, 1471-1491 - Giuliani et al., Mapping protein sequence spaces
by recurrence quantification analysis a case
study on chimeric structures, Protein Eng., 2000,
13(10), 671-678 - Webber et al., Elucidating protein secondary
structures using alpha carbon recurrence
quantifications, Proteins Stuc. Func. Bioinf.,
2001, 44292-303 - Giuliani et al., Nonlinear methods in the
analysis of protein sequences a case study in
Rubredoxins, Biophysical Journal, 2000, 78,
136-148 - Zbilut et al., Protein Aggregation/Folding The
role of deterministic singularities of sequence
hydrophobicity as determined by nonlinear signal
analysis of acylphosphatase and AB(1-40),
Biophysical Journal, 2003, 85, 3544-3557 - Zbilut et al., A topologically related
singularity suggests a maximum preferred domain
size for protein domains, Proteins Struct. Func.
Bioinf., 2006, 65 - Joseph P. Zbilut et al., Entropic criteria for
protein folding derived from recurrences Six
residues patch as the basic protein word, FEBS
Letters, 580 4861-4864, 2006 - Guimera et al., Functional cartography of complex
metabolic networks, Nature, 2005, 433, 895-899 - Krishnan et al., Network scaling invariants help
to elucidate basic topological principles of
proteins., under review, 2007 - Krishnan et al., A topology-energetics nexus
informs native vs decoy protein distinctions,
under review, 2007
59- This is the second of a two part assignment.
- For the selected protein for which you carried
out the graph spectral analysis, employ the 1D
and 3D RQA and report results - Can you identify secondary structural features
using 3D RQA? - What does the 1D RQA using hydrophobicity tell
you about the protein? - For the same protein, partition into modules
using the GANDivA algorithm and report results - Look at how the modules are distributed
- Which are the high P/z valued residues
- What does the P/z plot look like?
60- Download RQA.tar.gz from http//www.iab.keio.ac.jp
/krishnan/downloads/Course_Materials/RQA.tar.gz - Copy this to some dir say RQA
- Then do
- tar zxf RQA.tar.gz
- make RQA1D
- make RQA3D
- This will create two executables RQA1D and RQA3D
- Run the program to see the USAGE
- Download and install GANDivA as told in the first
lecture (refer to the slides from the first
lecture)
61- Protein Modularity Detection
- Download from
- http//www.iab.keio.ac.jp/krishnan/downloads/Cour
se_Materials/GANDivA.tar.gz - Copy this to account on cacao.bioinfo.ttck.keio.ac
.jp - Installation Instructions
- Tar and unzip the package using
- tar zxf GANDivA.tar.gz
- Change directory to the main GANDivA directory
- cd GANDivA_v1.0
- Run the install script
- perl install_gandiva.pl
- You will be asked a series of questions
62- Installation Instructions
- a) Do you want to install a parallel version of
the program? Y/N - Y
- b) Do you have MPI Installed? Y/N
- Y
- c) Where would you like to have PGA installed?
- Give full path of directory in which to install
- Eg /home/krishnan/GANDivA_v1.0/PGA
- d) Please enter the path of the MPI library file
- /usr/local/lib/libmpich.a
- e) Please enter the path of the MPI include
directory - /usr/local/include
- This should automatically install the GANDivA
binary in - /path/to/GANDivA_v1.0/bin
63- ./GANDivA -a ltAdjmat filegt -n lt of Residuesgt -g
10000 -G 1000 -o ltOutput filegt - Output looks like this
- Nodes Modules z(i) P(i) P/z abs(P/z)
Score 0.524569 - 1 4 0.048422 0.000000
0.000000 0.000000 - 2 4 0.048422 0.493827
10.198457 10.198457 - 3 4 1.888448 0.165289
0.087526 0.087526 - 4 4 0.508428 0.512397
1.007805 1.007805 - 5 4 0.048422 0.444444
9.178612 9.178612 - Remember that this is a stochastic algorithm.
Hence you have to run it a few times (at least 50
times) and take the result with the highest
Modularity score. (given on the first line of the
output file. In the example here, it is 0.524569 - For those of you on cacao, you should submit the
different runs using a submit script to the job
scheduler - For those of you running it on your own machines
you can use a simple shell script to do it in
series. Since your proteins are all small, you
should be able to run this in about 20 minutes - for I in seq 100 do ./GANDivA -a ltAdjmat filegt
-n lt of Residuesgt -g 10000 -G 1000 -o RI.out
done - This will run the algorithm 100 times and the
output files will be called R1.out, R2.out, ,
R100.out