Graph Spectral Analysis - PowerPoint PPT Presentation

1 / 63
About This Presentation
Title:

Graph Spectral Analysis

Description:

Transform original series into its ... Most basic of relations ... Between these two, a zone with a rich repertoire of topological arrangements. HISTORY ... – PowerPoint PPT presentation

Number of Views:89
Avg rating:3.0/5.0
Slides: 64
Provided by: soiI
Category:

less

Transcript and Presenter's Notes

Title: Graph Spectral Analysis


1
(No Transcript)
2
  • Graph Spectral Analysis
  • Active Site Analysis
  • Folding Clusters
  • Quaternary Association of Proteins
  • Domain Identification Domain Interface Residues
  • Recurrence Quantification Analysis
  • Structural Identification
  • Protein Aggregation/ Folding
  • Structural and Topological properties
  • Global Network Partitioning
  • Protein Structural Organization
  • Autonomous Folding Units
  • Folding-important residues
  • Native vs Decoy discrimination

3
  • RQA
  • Nonlinear technique
  • Transform original series into its embedding
    matrix (EM) based on delays
  • Rows of embedding matrix correspond to windows of
    length 4
  • Based on computation of Euclidean distance
    between the rows of the EM
  • Looking for epochs close to each other

4
  • Recurrence
  • Point that repeats itself
  • Most basic of relations
  • Strictly local and independent of any
    mathematical assumption about the system
  • Calculation of recurrence requires no
    transformation of data
  • Can be used for both linear and nonlinear systems

5
EMBEDDING DIMENSION 4 LAG 1
10 11 21 32 41 35 40 19
10 11 21 32 41 35 40 19
11 21 32 41 35 40 19
21 32 41 35 40 19
32 41 35 40 19
EMBEDDING MATRIX
6
  • Pairwise distances between each pair of rows of
    the EM are calculated
  • If d lt r, then a dot is placed on the recurrence
    plot
  • Application of this computation produces a
    Recurrence Plot (RP)
  • Symmetrical M x M matrix
  • Point placed at (i,j) whenever row Xi - Xj lt
    r
  • Graphically represented by a dot

10 11 21 32 41 35 40 19
11 21 32 41 35 40 19
21 32 41 35 40 19
32 41 35 40 19
7
Recurrence Quantification Analysis (RQA)
PROTEIN SIGNALS
HISTORY
HISTORY
KOENIGSBERG
8
Recurrence Quantification Analysis (RQA)
PROTEIN SIGNALS RECURRENCE PLOT
HISTORY
HISTORY
KOENIGSBERG
9
Thermophylic and Mesophylic Proteins
HISTORY
HISTORY
10
Recurrence Quantification Analysis (RQA)
THERMAL STABILITY
HISTORY
HISTORY
  • Rubredoxins
  • Member of the family of redox metalloenzymes
  • Relatively short polypeptide chain (53 AA)
  • Ferrous/Ferric ion tetrahedrally bound to four
    Cys residues
  • Huge difference in stability
  • Pyrococcus Furiosus (Rubr Pyrfu)
  • Thermophilic
  • Clostridium pasteurianum (Rubr Clopa)
  • Most similar in terms of 3D structure
  • Mesophilic

11
Recurrence Quantification Analysis (RQA)
THERMAL STABILITY
HISTORY
HISTORY
Thermophylic Proteins
Mesophylic Proteins
12
Recurrence Quantification Analysis (RQA)
THERMAL STABILITY
HISTORY
HISTORY
13
Recurrence Quantification Analysis (RQA)
THERMAL STABILITY
HISTORY
HISTORY
14
Recurrence Quantification Analysis (RQA)
THERMAL STABILITY
HISTORY
HISTORY
  • Kurtosis of recurrence distributions
  • Kurtosis Index of relative peaked character of a
    distribution
  • Higher kurtosis gt More concentrated recurrence
    patterns

15
Protein Aggregation
HISTORY
HISTORY
16
Protein Aggregation
THE PROBLEM
HISTORY
HISTORY
  • Protein aggregation is an important problem
  • Involvement of protein misfolding in Alzheimers,
    Huntingtons and prion diseases
  • Possibility of forming aggregates intrinsic to
    proteins
  • How to delineate betweeen proteins that form
    aggregates and proteins that dont?
  • In principle change of environmental conditions
    could drive any protein to form multimeric
    aggregates

17
Protein Aggregation
THE PROBLEM
HISTORY
HISTORY
  • Main driving force in protein folding
  • Need to be soluble in water
  • Proteins fold so as to
  • Hide hydrophobic residues in core
  • Expose polar residues to solvent
  • Protein-Protein interaction can be considered as
    similar to protein folding
  • Mainly hydrophobic character of folding processes
    led to the study of hydrophobicity patterning
    along the sequence using RQA

18
Protein Aggregation
RQA
HISTORY
HISTORY
  • 9 groups studied
  • A almost pure natural ?-helix
  • B almost pure natural ?-sheet
  • C mainly ?-helix polymerizing
  • D aggregating systems, eg amyloids
  • E natively unfolded proteins
  • F Proteins undergoing lot of PPI, eg DNA repair
    systems
  • G Synthetic ?-helices
  • H Synthetic ?-sheets
  • I ?-helices

19
Protein Aggregation
RQA
HISTORY
HISTORY
  • Self aggregating systems (D) similar to
    DNA-processing proteins (F) and ?-helices (I)
  • Connection between D and F maybe linked to the
    capability of forming multimeric aggregates
  • Polymerizing proteins (C) have a distinct
    patterning
  • G H have clear-cut peaks

20
Protein Aggregation
RQA CDA
HISTORY
HISTORY
  • Canonical Discriminant Analysis (CDA)
  • Variant of linear model
  • Y XBe
  • Y dependent variables
  • X independent variables
  • B regression coefficients
  • e errors
  • Model used to calculate variables which can best
    separate the data into classes

21
Protein Aggregation
RQA CDA
HISTORY
HISTORY
  • Canonical Variates plot
  • CV1 vs CV2
  • CV3 vs CV4
  • CV1 vs CV2
  • CV1 is a measure of shape
  • Synthetic ?-helices (G) and natural ?-sheets (B)
    situated at extremes
  • Regular arrangement of secondary structures
  • H (artificial ?-sheets ) is also unique in that
    it is most periodic
  • Natural Amyloid proteins are towards the center

22
Protein Aggregation
RQA CDA
HISTORY
HISTORY
  • CV3 vs CV4
  • CV3 models the oppostion between D and G on one
    hand and all other proteins on the other
  • CV4 represents a fine tuning of different spectra
  • D (amyloid) is near DNA repair (F) and also of
    artificial ?-sheets (H)

23
Protein Aggregation
RQA CLUSTERING
HISTORY
HISTORY
  • K-means clustering was carried out on CDA
    variables
  • D group goes together with A (natural ?-helices),
    F (DNA repair) and I (?-helices)
  • Implications
  • ?-sheets not a prerequisite for aggregation
  • Oligomerization has a pivotal role to play in
    amyloid forming system
  • Formation of high polymers of collagen type (C)
    follow a different mechanism

24
Protein Secondary Structures
HISTORY
HISTORY
25
Recurrence Quantification Analysis (RQA)
SECONDARY STRUCTURE
HISTORY
HISTORY
  • RQA shows that higher dimensional recurrences
    could be captured by single variables
  • By creating time/space delayed versions of
    the signal
  • Setting a radius
  • For multi-dimensional data, no need to do
    embedding
  • For PDB structures
  • x, y, z co-ordinates available
  • No need for embedding matrix
  • Recurrent plots are mathematically equivalent to
    simple contact (adjacency ) matrices

26
Recurrence Quantification Analysis (RQA)
SECONDARY STRUCTURE
HISTORY
HISTORY
  • Distance matrices were built for proteins with
    the C? x,y,z co-ordinates
  • Euclidean distance between each pair of residues
    was calculated
  • Radius 6 A

27
Recurrence Quantification Analysis (RQA)
SECONDARY STRUCTURE
HISTORY
HISTORY
28
Recurrence Quantification Analysis (RQA)
SECONDARY STRUCTURE
HISTORY
HISTORY
  • Parallel and anti-parallel lines
  • However parallel lines
  • Both by ?-helices and ?-sheets
  • Introduced a new parameter to separate between
    these
  • Based on number of recurrences in a 10 C?-atom
    stretch

29
Recurrence Quantification Analysis (RQA)
SECONDARY STRUCTURE
HISTORY
HISTORY
  • 3D lattice models of proteins
  • Construct a 3D model
  • 120 residues
  • ?-helix (A)
  • Antiparallel ?-sheet (B)
  • Parallel ?-sheet (C )

30
Recurrence Quantification Analysis (RQA)
SECONDARY STRUCTURE
HISTORY
HISTORY
31
Recurrence Quantification Analysis (RQA)
SECONDARY STRUCTURE
HISTORY
HISTORY
Trp Repressor
Tumor Necrosis Factor Alpha Protein
32
Recurrence Quantification Analysis (RQA)
SECONDARY STRUCTURE
HISTORY
HISTORY
Chloramphenical Acetyltransferase
33
Recurrence Quantification Analysis (RQA)
SECONDARY STRUCTURE
HISTORY
HISTORY
  • Relationship with DSSP

34
Protein Topological Architecture using RQA
HISTORY
HISTORY
35
Protein Topological Architecture
THE PROBLEM
HISTORY
HISTORY
  • Significant literature exists reporting a general
    scaling of different geometrical features of
    protein 3D structure with chain length
  • Contact maps as shown before can be used to study
    this phenomenon
  • REC exhibits a scaling law
  • This measure allows to detect a general scaling
    of protein structures
  • Two different topological modes
  • Small protein lt 180 aa residues
  • Large Protein gt 320 aa residues
  • Between these two, a zone with a rich repertoire
    of topological arrangements

36
Protein Topological Architecture
RQA
HISTORY
HISTORY
  • Dataset
  • 91 monomeric proteins
  • Well characterized structures
  • Cover almost entire range of secondary structures
  • RQA
  • Applied to the 91 proteins
  • Bend near 100 residues

37
Protein Topological Architecture
RQA
HISTORY
HISTORY
  • Ratio of Surface Volume (SV) and REC3D plotted
    against protein length
  • Divergence beginning near 180 and ending at
    around 320
  • Looking at theoretical REC3D, divergence appears
    near 274

38
Protein Topological Architecture
RQA IMPLICATIONS
HISTORY
HISTORY
  • Explanation of this based on geometric
    considerations of protein folding
  • As chain length increases, a critical value of
    surface volume (SV) to compactness (REC3D) is
    achieved and a divergence occurs
  • This divergence implies a structural instability
  • Maybe due to the natural limit for domain length
  • Singularity might
  • Suggest a maximal preferred domain size
  • Mark the transition between single domain and
    multi-domain proteins

39
Protein Topological Architecture
RQA APPLICATIONS
HISTORY
HISTORY
  • Used as a check of consistency of predicted
    structures
  • Figure shows that almost all correct models fall
    in a tight band around the scaling curve
  • If the compactness of the model is far from the
    reference line, then the model is erroneous

40
Protein Topological Architecture
HYDROPHOBIC PATCHES
HISTORY
HISTORY
  • Understanding of folding dynamics by studying
    hydrophobic patches
  • What are minimal sized patches important for
    folding?
  • Data
  • 1977 single chain proteins
  • RQA was applied

41
Protein Topological Architecture
HYDROPHOBIC PATCHES
HISTORY
HISTORY
  • Histogram of Maximal length recurrence lines
    (MAXL) shows a peak at around 6 residues
  • Plot of MAXL vs Entropic contribution also shows
    a similar peak around 6
  • Implies that 6-residue words contain maximal
    information for hydrophobicity nucleation in
    globular proteins

42
Protein Global Network Partitioning
43
Protein Global Network Partitioning
THE PROBLEM
HISTORY
HISTORY
  • Protein Folding
  • Many models
  • However, all in agreement that small regions of
    proteins tend to fold separately
  • Stabilized by interactions between these distinct
    units
  • Proteins can thus be considered as collections of
    small units
  • Autonomous folding units/motifs/domains
  • Domain
  • Defined as regions containing maximum
    intra-cluster connectivity and minimum
    inter-cluster connectivity

44
Protein Global Network Partitioning
ALGORITHM
HISTORY
HISTORY
45
Protein Global Network Partitioning
ALGORITHM
HISTORY
HISTORY
Intra-module Connectivity
Participation Coefficient (Inter-module
Connectivity)
  • R1 Ultra-peripheral nodes
  • R2 Peripheral Nodes
  • R3 Non-hub Connector Nodes
  • R4 Non-hub Kinless nodes
  • R5 Provincial Hubs
  • R6 Connector Hubs
  • R5 Kinless Hubs

46
Protein Global Network Partitioning
ALGORITHM
HISTORY
HISTORY
  • Used the Guimera idea for partitioning Protein
    Networks into modules
  • Algorithm based on Genetic Algorithm instead of
    on Simulated Annealing
  • Algorithm is based on natural partitioning
  • Independent of Interaction Strength etc..
  • Data
  • 1420 single chain proteins
  • lt 20 identity with each other
  • Resolution lt 2 Ã…

47
Protein Global Network Partitioning
HIERARCHICAL NATURE
HISTORY
HISTORY
  • Hierarchical nature of protein organization
  • Module frequency greatest for a module size of
    around 12
  • Maximal module size scales linearly with protein
    length

48
Protein Global Network Partitioning
HIERARCHICAL NATURE
HISTORY
HISTORY
  • Scaling of Modularity/Length with protein length
  • Similar to scaling of REC3D with protein length
  • The curve in this case occurs earlier, around 50
    residues
  • Maximal preferred folding unit size?
  • Points again to a hierarchical (almost
    fractal-like) behavior

49
Protein Global Network Partitioning
HYDROPHOBICITY AND RASA
HISTORY
HISTORY
  • P-z space was partitioned into 8 x 10 bins
  • The hydrophobicities and Relative Accessible
    Surface Area (RASA) of the residues falling
    within these bins was identified
  • Standard deviation of the inter-module
    hydrophobicities and RASA shown
  • Variance of both hyd and RASA decreases with
    increasing module size
  • Implies that we are moving towards an average
    protein
  • At low module sizes pure hyd or pure polar
    modules possible
  • Not so at larger module sizes
  • Variance flattens out at around 30 residues

50
Protein Global Network Partitioning
HYDROPHOBICITY AND RASA
HISTORY
HISTORY
  • Maximal connected patches of polar and
    hydrophobic residues identified in each and every
    module
  • Frequency plots of intra-module patch size for
    hydrophobic and polar patches
  • Surprisingly, they are almost exactly the same
  • Maximal patch size peaks at around 12
  • Implications
  • Hydrophobic and polar residues start aggregating
    together at the beginning of the folding process
    in a like-meets-like behavior
  • These mini-clusters must solve the communication
    problem
  • Happens around 30 residues
  • After this topological constraints come into play
  • Our modules thus correspond to foldons, or
    early folding units

51
Protein Global Network Partitioning
SECONDARY STRUCTURES
HISTORY
HISTORY
  • At smaller module sizes, we observe propensity to
    have a single secondary structural feature
  • As the module approaches the 30 residue barrier,
    we observe a breaking of this pattern
  • Consistent with the mixing together of the
    all-hydrophobic and all-polar modules

52
Protein Global Network Partitioning
INVARIANCE OF P/z
HISTORY
HISTORY
  • Plot of protein P-z space
  • Invariance of the P-z space
  • Typical Dentists-chair sort of behavior
  • Notice that high P/z valued residues are
    significant
  • Correspond to connector hubs
  • We characterized well studied proteins with their
    P/z values.
  • High P/z valued residues act as early folding
    residues
  • Important for folding

53
Protein Global Network Partitioning
INVARIANCE OF P/z
HISTORY
HISTORY
  • Ubiquitin
  • Show the partitioning of the proteins in modules
    as well as colored by high P/z valued residues
  • Very few connector hubs

54
Protein Global Network Partitioning
INVARIANCE OF P/z
HISTORY
HISTORY
  • Show the top P/z valued hits for ubiquitin
  • Most of the top hits are within two residues of a
    residue protected during folding (obtained from
    experimental studies)
  • protected residue
  • o within one residue of protected residue
  • ? within two residues of protected residue

55
Protein Global Network Partitioning
INVARIANCE OF P/z
HISTORY
HISTORY
  • Distribution of RASA over the 8 x 10 bins of the
    P/z plot for all 1420 proteins
  • Contour plot shown here
  • Two regions of low RASA (red areas) corresponding
    to buried regions
  • Surrounded by polar (blue) residues
  • Analogous to the Ramachandran plot?

56
Protein Global Network Partitioning
NATIVE vs DECOY DISCRIMINATION
HISTORY
HISTORY
  • Native vs Decoy structures distribution
  • Clear differences between the native (top) and
    decoy (bottom) distributions
  • Can we use this to discriminate between Native
    and Decoy structures?

57
Protein Global Network Partitioning
NATIVE vs DECOY DISCRIMINATION
HISTORY
HISTORY
  • PCA carried out on the RASA distribution values
  • Clear separation visible between native and decoy
    structures

58
  • Giuliani et. al., Nonlinear Signal Analysis
    Methods in the Elucidation of Protein
    Sequence-Structure Relationships, Chem. Rev.
    2002, 102, 1471-1491
  • Giuliani et al., Mapping protein sequence spaces
    by recurrence quantification analysis a case
    study on chimeric structures, Protein Eng., 2000,
    13(10), 671-678
  • Webber et al., Elucidating protein secondary
    structures using alpha carbon recurrence
    quantifications, Proteins Stuc. Func. Bioinf.,
    2001, 44292-303
  • Giuliani et al., Nonlinear methods in the
    analysis of protein sequences a case study in
    Rubredoxins, Biophysical Journal, 2000, 78,
    136-148
  • Zbilut et al., Protein Aggregation/Folding The
    role of deterministic singularities of sequence
    hydrophobicity as determined by nonlinear signal
    analysis of acylphosphatase and AB(1-40),
    Biophysical Journal, 2003, 85, 3544-3557
  • Zbilut et al., A topologically related
    singularity suggests a maximum preferred domain
    size for protein domains, Proteins Struct. Func.
    Bioinf., 2006, 65
  • Joseph P. Zbilut et al., Entropic criteria for
    protein folding derived from recurrences Six
    residues patch as the basic protein word, FEBS
    Letters, 580 4861-4864, 2006
  • Guimera et al., Functional cartography of complex
    metabolic networks, Nature, 2005, 433, 895-899
  • Krishnan et al., Network scaling invariants help
    to elucidate basic topological principles of
    proteins., under review, 2007
  • Krishnan et al., A topology-energetics nexus
    informs native vs decoy protein distinctions,
    under review, 2007

59
  • This is the second of a two part assignment.
  • For the selected protein for which you carried
    out the graph spectral analysis, employ the 1D
    and 3D RQA and report results
  • Can you identify secondary structural features
    using 3D RQA?
  • What does the 1D RQA using hydrophobicity tell
    you about the protein?
  • For the same protein, partition into modules
    using the GANDivA algorithm and report results
  • Look at how the modules are distributed
  • Which are the high P/z valued residues
  • What does the P/z plot look like?

60
  • Download RQA.tar.gz from http//www.iab.keio.ac.jp
    /krishnan/downloads/Course_Materials/RQA.tar.gz
  • Copy this to some dir say RQA
  • Then do
  • tar zxf RQA.tar.gz
  • make RQA1D
  • make RQA3D
  • This will create two executables RQA1D and RQA3D
  • Run the program to see the USAGE
  • Download and install GANDivA as told in the first
    lecture (refer to the slides from the first
    lecture)

61
  • Protein Modularity Detection
  • Download from
  • http//www.iab.keio.ac.jp/krishnan/downloads/Cour
    se_Materials/GANDivA.tar.gz
  • Copy this to account on cacao.bioinfo.ttck.keio.ac
    .jp
  • Installation Instructions
  • Tar and unzip the package using
  • tar zxf GANDivA.tar.gz
  • Change directory to the main GANDivA directory
  • cd GANDivA_v1.0
  • Run the install script
  • perl install_gandiva.pl
  • You will be asked a series of questions

62
  • Installation Instructions
  • a) Do you want to install a parallel version of
    the program? Y/N
  • Y
  • b) Do you have MPI Installed? Y/N
  • Y
  • c) Where would you like to have PGA installed?
  • Give full path of directory in which to install
  • Eg /home/krishnan/GANDivA_v1.0/PGA
  • d) Please enter the path of the MPI library file
  • /usr/local/lib/libmpich.a
  • e) Please enter the path of the MPI include
    directory
  • /usr/local/include
  • This should automatically install the GANDivA
    binary in
  • /path/to/GANDivA_v1.0/bin

63
  • ./GANDivA -a ltAdjmat filegt -n lt of Residuesgt -g
    10000 -G 1000 -o ltOutput filegt
  • Output looks like this
  • Nodes Modules z(i) P(i) P/z abs(P/z)
    Score 0.524569
  • 1 4 0.048422 0.000000
    0.000000 0.000000
  • 2 4 0.048422 0.493827
    10.198457 10.198457
  • 3 4 1.888448 0.165289
    0.087526 0.087526
  • 4 4 0.508428 0.512397
    1.007805 1.007805
  • 5 4 0.048422 0.444444
    9.178612 9.178612
  • Remember that this is a stochastic algorithm.
    Hence you have to run it a few times (at least 50
    times) and take the result with the highest
    Modularity score. (given on the first line of the
    output file. In the example here, it is 0.524569
  • For those of you on cacao, you should submit the
    different runs using a submit script to the job
    scheduler
  • For those of you running it on your own machines
    you can use a simple shell script to do it in
    series. Since your proteins are all small, you
    should be able to run this in about 20 minutes
  • for I in seq 100 do ./GANDivA -a ltAdjmat filegt
    -n lt of Residuesgt -g 10000 -G 1000 -o RI.out
    done
  • This will run the algorithm 100 times and the
    output files will be called R1.out, R2.out, ,
    R100.out
Write a Comment
User Comments (0)
About PowerShow.com