Title: BIOLOGICAL NETWORKS
1BIOLOGICAL NETWORKS
2BIOLOGICAL NETWORKS
- Introduction
- Biological Networks
- Protein-Protein Interaction Networks
- Signaling Metabolic Pathway Networks
- Expression Networks
- Biological Networks Properties
- Databases
- Discussion
- STM Clustering Model
3Introduction
4Bioinformatics
- Informatics
- Its carrier is a set of digital codes and a
language. -
- In its manifestation in the space-time
continuum, it has utility (e.g. to decrease
entropy of an open system).
- Bioinformatics
- The essence of life is information (i.e. from
digital code to emerging properties of
biosystems.) - Bioinformatics is the study of information
content of life
5Proteomics
6From the particular to the universal
A.-L- Barabasi Z. Oltvai, Science, 2002
7Genome Size
8Proteom Size (PDB)
9BIOLOGICAL NETWORK
- Networks are found in biological systems of
varying scales - 1. Evolutionary tree of life
- 2. Ecological networks
- 3. Expression networks
- 4. Regulatory networks
- - genetic control networks of organisms
- 5. The protein interaction network in cells
- 6. The metabolic network in cells
- more biological networks
10Why Study Networks?
- It is increasingly recognized that complex
systems cannot be described in a reductionist
view. - Understanding the behavior of such systems starts
with understanding the topology of the
corresponding network. - Topological information is fundamental in
constructing realistic models for the function of
the network.
11Biological Network Model
- Network
- A linked list of interconnected nodes.
- Node
- Protein, peptide, or non-protein biomolecules.
- Edges
- Biological relationships, etc., interactions,
regulations, reactions, transformations,
activation, inhibitions.
12Biological Network Model
- It is usually represented by a 2-D diagram with
characteristic symbols linking the protein and
non-protein entities.
- A circle indicates a protein or a non-protein
biomolecule. - An symbol in between indicates the nature of
molecule-molecule process (activation,
inhibition, association, disassociation, etc.)
13Protein Interaction Network
14Proteins in a cell
- There are thousands of different active proteins
in a cell acting as - enzymes, catalysors to chemical reactions of the
metabolism - components of cellular machinery (e.g. ribosomes)
- regulators of gene expression
- Certain proteins play specific roles in special
cellular compartments. - Others move from one compartment to another as
signals.
15Protein Interactions
- Proteins perform a function as a complex rather
as a single protein. - Knowing whether two proteins interact can help us
discover unknown proteins functions - If the function of one protein is known, the
function of its binding partners are likely to be
related- guilt by association. - Thus, having a good method for detecting
interactions can allow us to use a small number
of proteins with known function to characterize
new proteins.
16Protein Interactions
P. Uetz, et al. Nature, 2000 Ito et al., PNAS,
2001
17Yeast Protein Interaction Network
Nodes proteins Links
physical interactions (binding)
18Pathway Networks
19Signaling Metabolic Pathway Network
- A Pathway can be defined as a modular unit of
interacting molecules to fulfill a cellular
function. - Signaling Pathway Networks
- In biology a signal or biopotential is an
electric quantity (voltage or current or field
strength), caused by chemical reactions of
charged ions. - refer to any process by which a cell converts one
kind of signal or stimulus into another. - Another use of the term lies in describing the
transfer of information between and within cells,
as in signal transduction. - Metabolic Pathway Networks
- a series of chemical reactions occurring within a
cell, catalyzed by enzymes, resulting in either
the formation of a metabolic product to be used
or stored by the cell, or the initiation of
another metabolic pathway
20A Pathway Example
21A Pathway Example
22A Pathway Example
23Regulatory Network
- a collection of DNA segments (genes) in a cell
which interact with each other and with other
substances in the cell, thereby governing the
rates at which genes in the network are
transcribed into mRNA.
24Regulatory Network
25Expression Network
- A network representation of genomic data.
- Inferred from genomic data, i.e. microarray.
26BIOLOGICAL NETWORK PROPERTY
- Interaction Network
- Pathway Network
- Regulatory Network
- Expression Network
27Biological Networks Properties
-
- Power law degree distribution Rich get richer
- Small World A small average path length
- Mean shortest node-to-node path
- Robustness Resilient and have strong resistance
to failure on random attacks and vulnerable to
targeted attacks - Hierarchical Modularity A large clustering
coefficient - How many of a nodes neighbors are connected to
each other
28Power Law Network
- PREFERENTIAL ATTACHMENT on Growth the
probability that a new vertex will be connected
to vertex i depends on the connectivity of that
vertex
29The Barabási-Albert BA model
ER Model
WS Model
Actors
Power Grid
www
(a) Random Networks
(b) Power law Networks
Power Law Network (Scale Free)
- The probability of finding a highly connected
node decreases exponentially with k
30Small World Property
- A small average path length
- Any node can be reached within a small number of
edges, 45 hops.
31Power Law Network
-
- Power-law degree distribution Small world
phenomena also observed in - communication networks
- web graphs
- research citation networks
- social networks
- Classical -Erdos-Renyi type random graphs do not
exhibit these properties - Links between pairs of fixed set of nodes picked
uniformly - Maximum degree logarithmic with network size
- No hubs to make short connections between nodes
32Attack Tolerance
- Complex systems maintain their basic functions
even under errors and
failures
(cell ? mutations Internet ?
router breakdowns)
33Attack Tolerance
- Robust. For ?lt3, removing nodes does not break
network into islands. - Very resistant to random attacks, but attacks
targeting key nodes are more dangerous.
Max Cluster Size
Path Length
34Protein Interaction Network
H. Jeong, S.P. Mason, A.-L. Barabasi Z.N.
Oltvai, Nature, 2001
35Protein Interaction Network
- The yeast protein interaction network seems to
reveal some basic graph theoretic properties - The frequency of proteins having interactions
with exactly k other proteins follows a power
law. - The network exhibits the small world phenomena
can reach any node within small number of hops,
usually 4 or 5 hops - Robustness Resilient and have strong resistance
to failure on random attacks and vulnerable to
targeted attacks.
36 Hierarchical Modularity
E. Ravasz et al., Science, 2002
37Hierarchical Modularity
Protein Networks
Metabolic Networks
E. Ravasz et al., Science, 2002
38Implications From Observations
- Biological complexity states 2 of genes.
- Protein hubs critical for cells, 45 .
- Infections will target highly connected nodes.
- Cascading node failures could cause a critical
problem. - Development of drug and treatment with novel
strategies like targeting effective nodes is
indispensable.
39Databases
40Protein Databases
- Swiss-Prot (non-redundant database)
- Release 41.0, 11/4/2003 124,464 entries.
- Release 41.5, 23/4/2002 125,236 entries.
- TrEMBL (translations of EMBL nucleotide sequences
- not yet integrated into Swiss-Prot)
- Release 23.7, 17/4/2003 863,248 entries
- This number keeps rapidly growing mainly due to
large - scale sequencing projects.
41Protein Interaction Databases
- Species-specific
- FlyNets - Gene networks in the fruit fly
- MIPS - Yeast Genome Database
- RegulonDB - A DataBase On Transcriptional
Regulation in E. Coli - SoyBase
- PIMdb - Drosophila Protein Interaction Map
database - Function-specific
- Biocatalysis/Biodegradation Database
- BRITE - Biomolecular Relations in Information
Transmission and Expression - COPE - Cytokines Online Pathfinder Encyclopaedia
- Dynamic Signaling Maps
- EMP - The Enzymology Database
- FIMM - A Database of Functional Molecular
Immunology - CSNDB - Cell Signaling Networks Database
42Protein Interaction Databases
- Interaction type-specific
- DIP - Database of Interacting Proteins
- DPInteract - DNA-protein interactions
- Inter-Chain Beta-Sheets (ICBS) - A database of
protein-protein interactions mediated by
interchain beta-sheet formation - Interact - A Protein-Protein Interaction database
- GeneNet (Gene networks)
- General
- BIND - Biomolecular Interaction Network Database
- BindingDB - The Binding Database
- MINT - a database of Molecular INTeractions
- PATIKA - Pathway Analysis Tool for Integration
and Knowledge Acquisition - PFBP - Protein Function and Biochemical Pathways
Project - PIM (Protein Interaction Map)
43Pathway Databases
- KEGG (Kyoto Encyclopedia of Genes and Genomes)
- http//www.genome.ad.jp/kegg/
- Institute for Chemical Research, Kyoto
University - PathDB
- http//www.ncgr.org/pathdb/index.html
- National Center for Genomic Resources
- SPAD Signaling PAthway Database
- Graduate School of Genetic Resources Technology.
Kyushu University. - Cytokine Signaling Pathway DB.
- Dept. of Biochemistry. Kumamoto Univ.
- EcoCyc and MetaCyc
- Stanford Research Institute
- BIND (Biomolecular Interaction Network Database)
- UBC, Univ. of Toronto
44KEGG
- Pathway Database Computerize current knowledge
of molecular and cellular biology in terms of the
pathway of interacting molecules or genes. - Genes Database Maintain gene catalogs of all
sequenced organisms and link each gene product to
a pathway component - Ligand Database Organize a database of all
chemical compounds in living cells and link each
compound to a pathway component - Pathway Tools Develop new bioinformatics
technologies for functional genomics, such as
pathway comparison, pathway reconstruction, and
pathway design
45(No Transcript)
46(No Transcript)
47This is the expanded
48(No Transcript)
49Discussion
- Problems
- Network Inference
- Micro Array, Protein Chips, other high throughput
assay methods - Function prediction
- The function of 40-50 of the new proteins is
unknown - Understanding biological function is important
for - Study of fundamental biological processes
- Drug design
- Genetic engineering
- Functional module detection
- Cluster analysis
- Topological Analysis
- Descriptive and Structural
- Locality Analysis
- Essential Component Analysis
- Dynamics Analysis
- Signal Flow Analysis
- Metabolic Flux Analysis
- Steady State, Response, Fluctuation Analysis
50Signal Transduction Model Based Functional Module
Detection Algorithm for Protein-Protein
Interaction Networks
- Woochang Hwang1
- Young-Rae Cho1
- Aidong Zhang1
- Murali Ramanathan2
- 1Department of Computer Science and Engineering,
- State University of New York at Buffalo
- 2Department of Pharmaceutical Sciences,
- State University of New York at Buffalo
51Contents
- Introduction
- Protein Interaction Networks
- Functional Categories
- Functional Module Detection Algorithm
- Signal Transduction Model (STM)
- Experimental Results
- Discussion
- Future Works
52Introduction
- Cellular Functions are coordinately carried out
by groups of genes and gene products. - Detection of such functional modules in a complex
molecular network is one of the most challenging
problem. - Molecular networks high data volume, high noise
level, sparse connectivity, etc. - PPI data
- S. Cerevisae full PPI data in DIP over 4900
proteins and 18000 interactions. - PPI data provide us the good opportunity to
analyze the underlying principles and the
structure of large living systems.
53Cluster Assessment
- Clustering Coefficient
- N(v) is the set of the direct neighbors of node v
and d(v) is the number of the direct neighbors of
node v - Betweeness Centrality
- is the number of shortest paths from node s
to t and (v) the number of shortest paths
from s to t that pass through the node v. - P-value
- C is the size of the cluster containing k
proteins with a given function G is the size of
the universal set of proteins of known proteins
and contains n proteins with the function. - The p-value is the probability that a cluster
would be enriched with proteins with a particular
function by chance alone. - Density
- n is the number of proteins and e is the number
of interactions in a sub graph s of a PPI
network.
54Protein-Protein Interaction (PPI) Data MIPS
Functional Category Data
- DIP Yeast Protein Interaction core data
- 2521 proteins, 5949 interactions
- Average clustering coefficient 0.069
- Average path length 5.47
- MIPS Functional Category
- 457 Hierarchical Functional Categories
- Sub graphs of each functional categories are
extracted from DIP core data. - Average graph density 0.0025
- Average diameter (longest path in a graph) 4.23
55MIPS functional modules in DIP Protein-Protein
Interaction (PPI) Network
Figure 1. (a) Mitochodrial Transport
19 singletons Diameter 6
(b) Mitosis 20 singletons
Diameter 3
56Topological Properties of MIPS Functional Modules
in DIP Protein Interaction Data
- Sparse connectivity low density, isolated sub
graphs and singletons existence. - Longish shape high diameter
57Related works
- Distance Based Approaches
- Several distance metrics were introduced
- Use traditional clustering algorithms
- Graph Based Approaches
- Density based approaches Maximal Cliques, Quasi
Cliques, RNSC, HCS, MCODE - Statistical approaches MCL, Samantha
58Related works
- Suffered by their limited way of clustering.
- identify only the clusters with specific shapes,
e.g., balanced round shapes, with high density . - But, the actual functional modules are not so
densely connected as they expected. - Some members in functional categories do not have
direct physical interaction with other members of
the functional category they belong to. - Modules that have longish shapes are frequently
observed. - The incompleteness of clustering is another
distinct drawback of existing algorithms, which
produce many clusters with small size and
singletons.
59Contribution
- Unexpected properties of functional categories
and sparse connectivity in PPI networks. - A relative excess of emphasis on density in the
existing methods can be preferential for
detecting clusters with relatively balanced round
shapes, high discarding rate, and limit
performance.
- STM Clustering Model
- Effective clustering should be able to detect
clusters with arbitrary shape and density if the
cluster members share biological and topological
similarities. - To take those unexpected properties of PPI
networks and actual functional modules into
consideration and to conquer the drawbacks of
existing approaches effectively - STM clustering model utilizes a statistical
signal transduction model to find the modules
whose members share biological common feature
even though they are sparsely connected. - STM model also adopts the networks topological
properties into the model.
60STM Clustering Model
- Process 1 Simulation of dynamic statistical
signal - transduction behavior in the
network. - STM model simulates dynamic signal transduction
behavior to find the most influential proteins on
each protein in PPI network biologically and
topologically. - Process 2 Selection of the putative cluster
representatives on each node. - Process 3 Preliminary clusters formation.
- Preliminary clusters will be formed by
accumulating each node toward its chosen
representatives. - Process 4 Cluster merge.
- So far, STM has considered only the biological
features and topological connectivity of the
network and its components, not similarity among
preliminary clusters. - Clusters that have significant interconnections
between them should have substantial similarity. - In process 4, STM will merge the clusters which
has substantial similarity.
61Statistical Signal Transduction Model
- Signal transduction behavior of the network is
modeled by the Erlang distribution, a special
case of the Gamma distribution. -
-
(1) - where c gt 0 is the shape parameter, b gt 0 is the
scale parameter, x gt 0 is the independent
variable, usually time. - The Erlang distribution with x/b 1 is used and
the value of c is set to the number of nodes
between source protein node and the target
protein - Setting the value of x/b to unity assesses the
perturbation at the target protein when the
perturbation reaches 1/e of its initial value at
the nearest neighbor of the source protein node.
62Statistical Signal Transduction Model
- Statistically, the Erlang distribution represents
the time required to carry out a sequence of c
tasks whose durations are identical, exponential
probability distributions. - It represents the chance that the actual time to
accomplish c tasks will be less than or equal to
b.
Figure 2. The pharmacodynamic signal transduction
model whose bolus response is an Erlang
distribution. The b is the time constant for
signal transfer and c is the number of
compartments.
63Topologically Modified Signal Transduction Model
- The Erlang distribution was further weighted to
reflect network topology. -
-
(2) -
- d(i) is the degree of node i, P(v,w) is the set
of all visited nodes on the shortest path from
node v to node w excluding the source node v and
target node w, and F(c) is the signal
transduction behavior function. - The perturbation induced by the source protein
node was assumed to be proportional to its degree
and to follow the shortest path to the target
protein node. - Our choice of the shortest path is motivated by
the finding that the majority of flux prefers the
path of least resistance in many physicochemical
and biological systems. - During transduction to the target protein node,
the perturbation was assumed to be dissipated at
each intermediate node visited in proportion to
the reciprocal of the degree of each intermediate
node visited.
64Process 1 Signal Transduction Simulation
Figure 3. Blue arrows are signals from node A
and Red ones are from node H. Results for other
nodes are not shown.
65Process 1 Signal Transduction Simulation
Figure 3. Blue arrows are signal from node A and
Red ones are from node H. Results for other nodes
are not shown.
66Process 1 Signal Transduction Simulation
Figure 3. Blue arrows are signal from node A and
Red ones are from node H. Results for other nodes
are not shown.
67Process 1 Signal Transduction Simulation
Figure 3. Blue arrows are signal from node A and
Red ones are from node H. Results for other nodes
are not shown.
68Process 2 Representatives Selection
Figure 4. A simple network. Each box contains the
numerical values obtained from Equation 2, from
source nodes A, F, G, and H to other target nodes
although signals should be propagated from every
node in the network. Results for other nodes are
not shown.
69Process 3 Preliminary Clusters Formulation
- Figure 5. Three preliminary clusters, A, B, C,
D, E, F, F, G, L, N, G, H, I, J, K, M, are
obtained after the Process 3.
70Cluster Merge
- Similarity of two clusters i and j
-
-
(3) - where interconnectivity(i, j) is the number of
connections between clusters i and j, and
minsize(i, j) is the size of the smaller cluster
among clusters i and j. - The pair of clusters that have the highest
similarity are merged in each iteration and the
merge process iterates until the highest
similarity of all cluster pairs is less than a
given threshold. - We see when interconnectivity(i, j)gtminsize(i,
j), clusters i and j have substantial
interconnections.
71Process 4 Cluster Merge
Figure 6. Two clusters, A, B, C, D, E, F, G, L,
N, G, H, I, J, K, M, are obtained after the
Merge process when 1.0 is used as the merge
threshold.
72Process 4 Cluster Merge
- Figure 7. Three clusters, A, B, C, D, E, F, F,
G, L, N, G, H, I, J, K, M, are obtained after
the Process 4 when 2.0 is used as the merge
threshold.
73Experimental Results
- Protein Interaction Data
- The core data of S. Cerevisiae was obtained from
the DIP database. - 2526 proteins and 5949 filtered reliable physical
interactions. - Species such as S. Cerevisae provide important
test beds for the study of the PPI networks since
it is a well-studied organism for which most
proteomics data is available for the organism, by
virtue of the availability of a defined and
relatively stable proteome, full genome clone
libraries, established molecular biology
experimental techniques and an assortment of well
designed genomics databases.
74Clustering Performance Analysis
60 clusters Average size 40.1 Average Density
0.2145 Average P-value 13.7 Average Hit
51.7 Average Unknown 5.1
Table 1. all 60 clusters that have more than 4
proteins
75Comparative Analysis
Table 2. Performance analyses of the clusters
more than size 4.
- Other methods can only detect the clusters with
small size. - Relatively high P-scores regarding their high
discarding rates on other - methods (e.g., Maximal Clique, Quasi Clique,
Samantha) - Due to the mass production of small size
clusters which have less - than 5 members
- Due to the discard of sparsely connected
proteins. - Due to high overlaps among many small clusters
which are highly - enriched for the same function.
76Computational Complexity
- Our signal transduction based model is
fundamentally established on all pairs shortest
path searching algorithm to measure the distance
between all pairs of nodes O(V2logVVE) where V
is the number of nodes and E is the number of
edges in a network. - The time required to find the best cluster pair
that has the most interconnections is O(k2logk)
by using heap-based priority queue, where k is
the number of preliminary clusters. - But k is much smaller than V in sparse networks
like the Yeast PPI network. - So the total time complexity of our algorithm is
bounded by the time consumed in measuring the
distance between all pairs of nodes, which is
O(V2logVVE).
77Discussion
- In head-to-head comparisons, our algorithm
outperformed competing approaches and is capable
of effectively detecting both dense and sparsely
connected, biologically relevant functional
modules with fewer discards. - The clusters identified had p-values that are 2.2
orders of magnitude or approximately 125-fold
lower than Quasi clique, the best performing
alternative clustering method, on biological
function. - The incompleteness of clustering is another
distinct drawback of existing algorithms, which
produce many clusters with small size and
singletons. - Our method discarded only about 7.8 of proteins
which is tremendously lower than the other
approaches did, 59 in average. - In conclusion, our method has strong
pharmacodynamics-based underpinnings and is an
effective, versatile approach for analyzing
protein-protein interactions.
78Thanks!