Graph Spectral Analysis presentation

About This Presentation

Transcript and Presenter's Notes

Title: Graph Spectral Analysis

1
(No Transcript)
2

Graph Spectral Analysis
Active Site Analysis
Folding Clusters
Quaternary Association of Proteins
Domain Identification Domain Interface Residues
Recurrence Quantification Analysis
Structural Identification
Protein Aggregation/ Folding
Structural and Topological properties
Global Network Partitioning
Protein Structural Organization
Autonomous Folding Units
Folding-important residues
Native vs Decoy discrimination

RQA
Nonlinear technique
Transform original series into its embedding
matrix (EM) based on delays
Rows of embedding matrix correspond to windows of
length 4
Based on computation of Euclidean distance
between the rows of the EM
Looking for epochs close to each other

Recurrence
Point that repeats itself
Most basic of relations
Strictly local and independent of any
mathematical assumption about the system
Calculation of recurrence requires no
transformation of data
Can be used for both linear and nonlinear systems

5
EMBEDDING DIMENSION 4 LAG 1
10 11 21 32 41 35 40 19
10 11 21 32 41 35 40 19
11 21 32 41 35 40 19
21 32 41 35 40 19
32 41 35 40 19
EMBEDDING MATRIX
6

Pairwise distances between each pair of rows of
the EM are calculated
If d lt r, then a dot is placed on the recurrence
plot
Application of this computation produces a
Recurrence Plot (RP)
Symmetrical M x M matrix
Point placed at (i,j) whenever row Xi - Xj lt
r
Graphically represented by a dot

10 11 21 32 41 35 40 19
11 21 32 41 35 40 19
21 32 41 35 40 19
32 41 35 40 19
7
Recurrence Quantification Analysis (RQA)
PROTEIN SIGNALS
HISTORY
HISTORY
KOENIGSBERG
8
Recurrence Quantification Analysis (RQA)
PROTEIN SIGNALS RECURRENCE PLOT
HISTORY
HISTORY
KOENIGSBERG
9
Thermophylic and Mesophylic Proteins
HISTORY
HISTORY
10
Recurrence Quantification Analysis (RQA)
THERMAL STABILITY
HISTORY
HISTORY

Rubredoxins
Member of the family of redox metalloenzymes
Relatively short polypeptide chain (53 AA)
Ferrous/Ferric ion tetrahedrally bound to four
Cys residues
Huge difference in stability
Pyrococcus Furiosus (Rubr Pyrfu)
Thermophilic
Clostridium pasteurianum (Rubr Clopa)
Most similar in terms of 3D structure
Mesophilic

11
Recurrence Quantification Analysis (RQA)
THERMAL STABILITY
HISTORY
HISTORY
Thermophylic Proteins
Mesophylic Proteins
12
Recurrence Quantification Analysis (RQA)
THERMAL STABILITY
HISTORY
HISTORY
13
Recurrence Quantification Analysis (RQA)
THERMAL STABILITY
HISTORY
HISTORY
14
Recurrence Quantification Analysis (RQA)
THERMAL STABILITY
HISTORY
HISTORY

Kurtosis of recurrence distributions
Kurtosis Index of relative peaked character of a
distribution
Higher kurtosis gt More concentrated recurrence
patterns

15
Protein Aggregation
HISTORY
HISTORY
16
Protein Aggregation
THE PROBLEM
HISTORY
HISTORY

Protein aggregation is an important problem
Involvement of protein misfolding in Alzheimers,
Huntingtons and prion diseases
Possibility of forming aggregates intrinsic to
proteins
How to delineate betweeen proteins that form
aggregates and proteins that dont?
In principle change of environmental conditions
could drive any protein to form multimeric
aggregates

17
Protein Aggregation
THE PROBLEM
HISTORY
HISTORY

Main driving force in protein folding
Need to be soluble in water
Proteins fold so as to
Hide hydrophobic residues in core
Expose polar residues to solvent
Protein-Protein interaction can be considered as
similar to protein folding
Mainly hydrophobic character of folding processes
led to the study of hydrophobicity patterning
along the sequence using RQA

18
Protein Aggregation
RQA
HISTORY
HISTORY

9 groups studied
A almost pure natural ?-helix
B almost pure natural ?-sheet
C mainly ?-helix polymerizing
D aggregating systems, eg amyloids
E natively unfolded proteins
F Proteins undergoing lot of PPI, eg DNA repair
systems
G Synthetic ?-helices
H Synthetic ?-sheets
I ?-helices

19
Protein Aggregation
RQA
HISTORY
HISTORY

Self aggregating systems (D) similar to
DNA-processing proteins (F) and ?-helices (I)
Connection between D and F maybe linked to the
capability of forming multimeric aggregates
Polymerizing proteins (C) have a distinct
patterning
G H have clear-cut peaks

20
Protein Aggregation
RQA CDA
HISTORY
HISTORY

Canonical Discriminant Analysis (CDA)
Variant of linear model
Y XBe
Y dependent variables
X independent variables
B regression coefficients
e errors
Model used to calculate variables which can best
separate the data into classes

21
Protein Aggregation
RQA CDA
HISTORY
HISTORY

Canonical Variates plot
CV1 vs CV2
CV3 vs CV4
CV1 vs CV2
CV1 is a measure of shape
Synthetic ?-helices (G) and natural ?-sheets (B)
situated at extremes
Regular arrangement of secondary structures
H (artificial ?-sheets ) is also unique in that
it is most periodic
Natural Amyloid proteins are towards the center

22
Protein Aggregation
RQA CDA
HISTORY
HISTORY

CV3 vs CV4
CV3 models the oppostion between D and G on one
hand and all other proteins on the other
CV4 represents a fine tuning of different spectra
D (amyloid) is near DNA repair (F) and also of
artificial ?-sheets (H)

23
Protein Aggregation
RQA CLUSTERING
HISTORY
HISTORY

K-means clustering was carried out on CDA
variables
D group goes together with A (natural ?-helices),
F (DNA repair) and I (?-helices)
Implications
?-sheets not a prerequisite for aggregation
Oligomerization has a pivotal role to play in
amyloid forming system
Formation of high polymers of collagen type (C)
follow a different mechanism

24
Protein Secondary Structures
HISTORY
HISTORY
25
Recurrence Quantification Analysis (RQA)
SECONDARY STRUCTURE
HISTORY
HISTORY

RQA shows that higher dimensional recurrences
could be captured by single variables
By creating time/space delayed versions of
the signal
Setting a radius
For multi-dimensional data, no need to do
embedding
For PDB structures
x, y, z co-ordinates available
No need for embedding matrix
Recurrent plots are mathematically equivalent to
simple contact (adjacency ) matrices

26
Recurrence Quantification Analysis (RQA)
SECONDARY STRUCTURE
HISTORY
HISTORY

Distance matrices were built for proteins with
the C? x,y,z co-ordinates
Euclidean distance between each pair of residues
was calculated
Radius 6 A

27
Recurrence Quantification Analysis (RQA)
SECONDARY STRUCTURE
HISTORY
HISTORY
28
Recurrence Quantification Analysis (RQA)
SECONDARY STRUCTURE
HISTORY
HISTORY

Parallel and anti-parallel lines
However parallel lines
Both by ?-helices and ?-sheets
Introduced a new parameter to separate between
these
Based on number of recurrences in a 10 C?-atom
stretch

29
Recurrence Quantification Analysis (RQA)
SECONDARY STRUCTURE
HISTORY
HISTORY

3D lattice models of proteins
Construct a 3D model
120 residues
?-helix (A)
Antiparallel ?-sheet (B)
Parallel ?-sheet (C )

30
Recurrence Quantification Analysis (RQA)
SECONDARY STRUCTURE
HISTORY
HISTORY
31
Recurrence Quantification Analysis (RQA)
SECONDARY STRUCTURE
HISTORY
HISTORY
Trp Repressor
Tumor Necrosis Factor Alpha Protein
32
Recurrence Quantification Analysis (RQA)
SECONDARY STRUCTURE
HISTORY
HISTORY
Chloramphenical Acetyltransferase
33
Recurrence Quantification Analysis (RQA)
SECONDARY STRUCTURE
HISTORY
HISTORY

Relationship with DSSP

34
Protein Topological Architecture using RQA
HISTORY
HISTORY
35
Protein Topological Architecture
THE PROBLEM
HISTORY
HISTORY

Significant literature exists reporting a general
scaling of different geometrical features of
protein 3D structure with chain length
Contact maps as shown before can be used to study
this phenomenon
REC exhibits a scaling law
This measure allows to detect a general scaling
of protein structures
Two different topological modes
Small protein lt 180 aa residues
Large Protein gt 320 aa residues
Between these two, a zone with a rich repertoire
of topological arrangements

36
Protein Topological Architecture
RQA
HISTORY
HISTORY

Dataset
91 monomeric proteins
Well characterized structures
Cover almost entire range of secondary structures
RQA
Applied to the 91 proteins
Bend near 100 residues

37
Protein Topological Architecture
RQA
HISTORY
HISTORY

Ratio of Surface Volume (SV) and REC3D plotted
against protein length
Divergence beginning near 180 and ending at
around 320
Looking at theoretical REC3D, divergence appears
near 274

38
Protein Topological Architecture
RQA IMPLICATIONS
HISTORY
HISTORY

Explanation of this based on geometric
considerations of protein folding
As chain length increases, a critical value of
surface volume (SV) to compactness (REC3D) is
achieved and a divergence occurs
This divergence implies a structural instability
Maybe due to the natural limit for domain length
Singularity might
Suggest a maximal preferred domain size
Mark the transition between single domain and
multi-domain proteins

39
Protein Topological Architecture
RQA APPLICATIONS
HISTORY
HISTORY

Used as a check of consistency of predicted
structures
Figure shows that almost all correct models fall
in a tight band around the scaling curve
If the compactness of the model is far from the
reference line, then the model is erroneous

40
Protein Topological Architecture
HYDROPHOBIC PATCHES
HISTORY
HISTORY

Understanding of folding dynamics by studying
hydrophobic patches
What are minimal sized patches important for
folding?
Data
1977 single chain proteins
RQA was applied

41
Protein Topological Architecture
HYDROPHOBIC PATCHES
HISTORY
HISTORY

Histogram of Maximal length recurrence lines
(MAXL) shows a peak at around 6 residues
Plot of MAXL vs Entropic contribution also shows
a similar peak around 6
Implies that 6-residue words contain maximal
information for hydrophobicity nucleation in
globular proteins

42
Protein Global Network Partitioning
43
Protein Global Network Partitioning
THE PROBLEM
HISTORY
HISTORY

Protein Folding
Many models
However, all in agreement that small regions of
proteins tend to fold separately
Stabilized by interactions between these distinct
units
Proteins can thus be considered as collections of
small units
Autonomous folding units/motifs/domains
Domain
Defined as regions containing maximum
intra-cluster connectivity and minimum
inter-cluster connectivity

44
Protein Global Network Partitioning
ALGORITHM
HISTORY
HISTORY
45
Protein Global Network Partitioning
ALGORITHM
HISTORY
HISTORY
Intra-module Connectivity
Participation Coefficient (Inter-module
Connectivity)

R1 Ultra-peripheral nodes
R2 Peripheral Nodes
R3 Non-hub Connector Nodes
R4 Non-hub Kinless nodes
R5 Provincial Hubs
R6 Connector Hubs
R5 Kinless Hubs

46
Protein Global Network Partitioning
ALGORITHM
HISTORY
HISTORY

Used the Guimera idea for partitioning Protein
Networks into modules
Algorithm based on Genetic Algorithm instead of
on Simulated Annealing
Algorithm is based on natural partitioning
Independent of Interaction Strength etc..
Data
1420 single chain proteins
lt 20 identity with each other
Resolution lt 2 Å

47
Protein Global Network Partitioning
HIERARCHICAL NATURE
HISTORY
HISTORY

Hierarchical nature of protein organization
Module frequency greatest for a module size of
around 12
Maximal module size scales linearly with protein
length

48
Protein Global Network Partitioning
HIERARCHICAL NATURE
HISTORY
HISTORY

Scaling of Modularity/Length with protein length
Similar to scaling of REC3D with protein length
The curve in this case occurs earlier, around 50
residues
Maximal preferred folding unit size?
Points again to a hierarchical (almost
fractal-like) behavior

49
Protein Global Network Partitioning
HYDROPHOBICITY AND RASA
HISTORY
HISTORY

P-z space was partitioned into 8 x 10 bins
The hydrophobicities and Relative Accessible
Surface Area (RASA) of the residues falling
within these bins was identified
Standard deviation of the inter-module
hydrophobicities and RASA shown
Variance of both hyd and RASA decreases with
increasing module size
Implies that we are moving towards an average
protein
At low module sizes pure hyd or pure polar
modules possible
Not so at larger module sizes
Variance flattens out at around 30 residues

50
Protein Global Network Partitioning
HYDROPHOBICITY AND RASA
HISTORY
HISTORY

Maximal connected patches of polar and
hydrophobic residues identified in each and every
module
Frequency plots of intra-module patch size for
hydrophobic and polar patches
Surprisingly, they are almost exactly the same
Maximal patch size peaks at around 12
Implications
Hydrophobic and polar residues start aggregating
together at the beginning of the folding process
in a like-meets-like behavior
These mini-clusters must solve the communication
problem
Happens around 30 residues
After this topological constraints come into play
Our modules thus correspond to foldons, or
early folding units

51
Protein Global Network Partitioning
SECONDARY STRUCTURES
HISTORY
HISTORY

At smaller module sizes, we observe propensity to
have a single secondary structural feature
As the module approaches the 30 residue barrier,
we observe a breaking of this pattern
Consistent with the mixing together of the
all-hydrophobic and all-polar modules

52
Protein Global Network Partitioning
INVARIANCE OF P/z
HISTORY
HISTORY

Plot of protein P-z space
Invariance of the P-z space
Typical Dentists-chair sort of behavior
Notice that high P/z valued residues are
significant
Correspond to connector hubs
We characterized well studied proteins with their
P/z values.
High P/z valued residues act as early folding
residues
Important for folding

53
Protein Global Network Partitioning
INVARIANCE OF P/z
HISTORY
HISTORY

Ubiquitin
Show the partitioning of the proteins in modules
as well as colored by high P/z valued residues
Very few connector hubs

54
Protein Global Network Partitioning
INVARIANCE OF P/z
HISTORY
HISTORY

Show the top P/z valued hits for ubiquitin
Most of the top hits are within two residues of a
residue protected during folding (obtained from
experimental studies)
protected residue
o within one residue of protected residue
? within two residues of protected residue

55
Protein Global Network Partitioning
INVARIANCE OF P/z
HISTORY
HISTORY

Distribution of RASA over the 8 x 10 bins of the
P/z plot for all 1420 proteins
Contour plot shown here
Two regions of low RASA (red areas) corresponding
to buried regions
Surrounded by polar (blue) residues
Analogous to the Ramachandran plot?

56
Protein Global Network Partitioning
NATIVE vs DECOY DISCRIMINATION
HISTORY
HISTORY

Native vs Decoy structures distribution
Clear differences between the native (top) and
decoy (bottom) distributions
Can we use this to discriminate between Native
and Decoy structures?

57
Protein Global Network Partitioning
NATIVE vs DECOY DISCRIMINATION
HISTORY
HISTORY

PCA carried out on the RASA distribution values
Clear separation visible between native and decoy
structures

Giuliani et. al., Nonlinear Signal Analysis
Methods in the Elucidation of Protein
Sequence-Structure Relationships, Chem. Rev.
2002, 102, 1471-1491
Giuliani et al., Mapping protein sequence spaces
by recurrence quantification analysis a case
study on chimeric structures, Protein Eng., 2000,
13(10), 671-678
Webber et al., Elucidating protein secondary
structures using alpha carbon recurrence
quantifications, Proteins Stuc. Func. Bioinf.,
2001, 44292-303
Giuliani et al., Nonlinear methods in the
analysis of protein sequences a case study in
Rubredoxins, Biophysical Journal, 2000, 78,
136-148
Zbilut et al., Protein Aggregation/Folding The
role of deterministic singularities of sequence
hydrophobicity as determined by nonlinear signal
analysis of acylphosphatase and AB(1-40),
Biophysical Journal, 2003, 85, 3544-3557
Zbilut et al., A topologically related
singularity suggests a maximum preferred domain
size for protein domains, Proteins Struct. Func.
Bioinf., 2006, 65
Joseph P. Zbilut et al., Entropic criteria for
protein folding derived from recurrences Six
residues patch as the basic protein word, FEBS
Letters, 580 4861-4864, 2006
Guimera et al., Functional cartography of complex
metabolic networks, Nature, 2005, 433, 895-899
Krishnan et al., Network scaling invariants help
to elucidate basic topological principles of
proteins., under review, 2007
Krishnan et al., A topology-energetics nexus
informs native vs decoy protein distinctions,
under review, 2007

This is the second of a two part assignment.
For the selected protein for which you carried
out the graph spectral analysis, employ the 1D
and 3D RQA and report results
Can you identify secondary structural features
using 3D RQA?
What does the 1D RQA using hydrophobicity tell
you about the protein?
For the same protein, partition into modules
using the GANDivA algorithm and report results
Look at how the modules are distributed
Which are the high P/z valued residues
What does the P/z plot look like?

Download RQA.tar.gz from http//www.iab.keio.ac.jp
/krishnan/downloads/Course_Materials/RQA.tar.gz
Copy this to some dir say RQA
Then do
tar zxf RQA.tar.gz
make RQA1D
make RQA3D
This will create two executables RQA1D and RQA3D
Run the program to see the USAGE
Download and install GANDivA as told in the first
lecture (refer to the slides from the first
lecture)

Protein Modularity Detection
Download from
http//www.iab.keio.ac.jp/krishnan/downloads/Cour
se_Materials/GANDivA.tar.gz
Copy this to account on cacao.bioinfo.ttck.keio.ac
.jp
Installation Instructions
Tar and unzip the package using
tar zxf GANDivA.tar.gz
Change directory to the main GANDivA directory
cd GANDivA_v1.0
Run the install script
perl install_gandiva.pl
You will be asked a series of questions

Installation Instructions
a) Do you want to install a parallel version of
the program? Y/N
Y
b) Do you have MPI Installed? Y/N
Y
c) Where would you like to have PGA installed?
Give full path of directory in which to install
Eg /home/krishnan/GANDivA_v1.0/PGA
d) Please enter the path of the MPI library file
/usr/local/lib/libmpich.a
e) Please enter the path of the MPI include
directory
/usr/local/include
This should automatically install the GANDivA
binary in
/path/to/GANDivA_v1.0/bin

./GANDivA -a ltAdjmat filegt -n lt of Residuesgt -g
10000 -G 1000 -o ltOutput filegt
Output looks like this
Nodes Modules z(i) P(i) P/z abs(P/z)
Score 0.524569
1 4 0.048422 0.000000
0.000000 0.000000
2 4 0.048422 0.493827
10.198457 10.198457
3 4 1.888448 0.165289
0.087526 0.087526
4 4 0.508428 0.512397
1.007805 1.007805
5 4 0.048422 0.444444
9.178612 9.178612
Remember that this is a stochastic algorithm.
Hence you have to run it a few times (at least 50
times) and take the result with the highest
Modularity score. (given on the first line of the
output file. In the example here, it is 0.524569
For those of you on cacao, you should submit the
different runs using a submit script to the job
scheduler
For those of you running it on your own machines
you can use a simple shell script to do it in
series. Since your proteins are all small, you
should be able to run this in about 20 minutes
for I in seq 100 do ./GANDivA -a ltAdjmat filegt
-n lt of Residuesgt -g 10000 -G 1000 -o RI.out
done
This will run the algorithm 100 times and the
output files will be called R1.out, R2.out, ,
R100.out

Write a Comment

User Comments (0)

About PowerShow.com

Graph Spectral Analysis PowerPoint PPT Presentation