Title: Introduction to Bioinformatics
1Introduction to Bioinformatics
Lecture 16 Intracellular Networks Graph
theory
2High-throughput Biological Data
- Enormous amounts of biological data are being
generated by high-throughput capabilities even
more are coming - genomic sequences
- gene expression data
- mass spectrometry data
- protein-protein interaction data
- protein structures
- ......
- Hidden in these data is information that reflects
- existence, organization, activity, functionality
of biological machineries at different levels
in living organisms
3Bio-Data Analysis andData Mining
- Existing/emerging bio-data analysis and mining
tools for - DNA sequence assembly
- Genetic map construction
- Sequence comparison and database search
- Gene finding
- .
- Gene expression data analysis
- Phylogenetic tree analysis to infer
horizontally-transferred genes - Mass spec. data analysis for protein complex
characterization -
- Current prevailing mode of work
Developing ad hoc tools for each individual
application
4Bio-Data Analysis and Data Mining
- As the amount and types of data and the needs to
establish connections across multi-data sources
increase rapidly, the number of analysis tools
needed will go up exponentially - blast, blastp, blastx, blastn, from BLAST
family of tools - gene finding tools for human, mouse, fly, rice,
cyanobacteria, .. - tools for finding various signals in genomic
sequences, protein-binding sites, splice junction
sites, translation start sites, ..
Many of these data analysis problems are
fundamentally the same problem(s) and can be
solved using the same set of tools
Developing ad hoc tools for each application
problem (by each group of individual researchers)
may soon become inadequate as bio-data production
capabilities further ramp up
5Data Clustering
- Many biological data analysis problems can be
formulated as clustering problems - microarray gene expression data analysis
- arrayCGH data (chromosomal gains and losses)
- identification of regulatory binding sites
(similarly, splice junction sites, translation
start sites, ......) - (yeast) two-hybrid data analysis (for inference
of protein complexes) - phylogenetic tree clustering (for inference of
horizontally transferred genes) - protein domain identification
- identification of structural motifs
- prediction reliability assessment of protein
structures - NMR peak assignments
- ......
6Data Clustering an example
- Regulatory binding-sites are short conserved
sequence fragments in promoter regions - Solving binding-site identification as a
clustering problem - Project all fragments into Euclidean space so
that similar fragments are projected to nearby
positions and dissimilar fragments to far
positions - Observation conserved fragments form clusters
in a noisy background
........acgtttataatggcg ...... ........ggctttatatt
cgtc ...... ........ccgaatataatcta .......
7Data Clustering Problems
- Clustering partition a data set into clusters so
that data points of the same cluster are
similar and points of different clusters are
dissimilar - Cluster identification -- identifying clusters
with significantly different features than the
background
8Multivariate statistics Cluster analysis
C1 C2 C3 C4 C5 C6 ..
1 2 3 4 5
Raw table
Any set of numbers per column
- Multi-dimensional problems
- Objects can be viewed as a cloud of points in a
multidimensional space - Need ways to group the data
9Multivariate statistics Cluster analysis
C1 C2 C3 C4 C5 C6 ..
1 2 3 4 5
Raw table
Any set of numbers per column
Similarity criterion
Similarity matrix
Scores
55
Cluster criterion
Dendrogram
10Cluster analysis data normalisation/weighting
C1 C2 C3 C4 C5 C6 ..
1 2 3 4 5
Raw table
Normalisation criterion
C1 C2 C3 C4 C5 C6 ..
1 2 3 4 5
Normalised table
Column normalisation x/max Column range
normalisation (x-min)/(max-min)
11Cluster analysis (dis)similarity matrix
C1 C2 C3 C4 C5 C6 ..
1 2 3 4 5
Raw table
Similarity criterion
Similarity matrix
Scores
55
Di,j (?k xik xjkr)1/r Minkowski
metrics r 2 Euclidean distance r 1 City
block distance
12Cluster analysis Clustering criteria
Similarity matrix
Scores
55
Cluster criterion
Dendrogram (tree)
Single linkage - Nearest neighbour Complete
linkage Furthest neighbour Group averaging
UPGMA (phylogeny) Ward Neighbour joining
global measure (phylogeny)
13Cluster analysis Clustering criteria
- Start with N clusters of 1 object each
- Apply clustering distance criterion iteratively
until you have 1 cluster of N objects - Most interesting clustering somewhere in between
distance
Dendrogram (tree)
N clusters
1 cluster
14Single linkage clustering (nearest neighbour)
Char 2
Char 1
15Single linkage clustering (nearest neighbour)
Char 2
Char 1
16Single linkage clustering (nearest neighbour)
Char 2
Char 1
17Single linkage clustering (nearest neighbour)
Char 2
Char 1
18Single linkage clustering (nearest neighbour)
Char 2
Char 1
19Single linkage clustering (nearest neighbour)
Char 2
Char 1
Distance from point to cluster is defined as the
smallest distance between that point and any
point in the cluster
20Single linkage clustering (nearest neighbour)
Let Ci and Cj be two disjoint clusters di,j
Min(dp,q), where p ? Ci and q ? Cj
Single linkage dendrograms typically show
chaining behaviour (i.e., all the time a single
object is added to existing cluster)
21Complete linkage clustering (furthest neighbour)
Char 2
Char 1
22Complete linkage clustering (furthest neighbour)
Char 2
Char 1
23Complete linkage clustering (furthest neighbour)
Char 2
Char 1
24Complete linkage clustering (furthest neighbour)
Char 2
Char 1
25Complete linkage clustering (furthest neighbour)
Char 2
Char 1
26Complete linkage clustering (furthest neighbour)
Char 2
Char 1
27Complete linkage clustering (furthest neighbour)
Char 2
Char 1
28Complete linkage clustering (furthest neighbour)
Char 2
Char 1
Distance from point to cluster is defined as the
largest distance between that point and any point
in the cluster
29Complete linkage clustering (furthest neighbour)
Let Ci and Cj be two disjoint clusters di,j
Max(dp,q), where p ? Ci and q ? Cj
More structured clusters than with single
linkage clustering
30Clustering algorithm
- Initialise (dis)similarity matrix
- Take two points with smallest distance as first
cluster - Merge corresponding rows/columns in
(dis)similarity matrix - Repeat steps 2. and 3.
- using appropriate cluster
- measure until last two clusters are merged
31Average linkage clustering (Unweighted Pair
Group Mean Averaging -UPGMA)
Char 2
Char 1
Distance from cluster to cluster is defined as
the average distance over all within-cluster
distances
32 UPGMA
Let Ci and Cj be two disjoint clusters
1 di,j ?p?q dp,q, where p ? Ci and q ?
Cj Ci Cj
Ci
Cj
In words calculate the average over all pairwise
inter-cluster distances
33Multivariate statistics Cluster analysis
C1 C2 C3 C4 C5 C6 ..
1 2 3 4 5
Data table
Similarity criterion
Similarity matrix
Scores
55
Cluster criterion
Phylogenetic tree
34Multivariate statistics Cluster analysis
C1 C2 C3 C4 C5 C6
1 2 3 4 5
Similarity criterion
Scores
66
Cluster criterion
Scores
55
Cluster criterion
Make two-way ordered table using dendrograms
35Multivariate statistics Two-way cluster analysis
C4 C3 C6 C1 C2 C5
1 4 2 5 3
Make two-way (rows, columns) ordered table using
dendrograms This shows blocks of numbers that
are similar
36Multivariate statistics Two-way cluster analysis
37Graph theory
The river Pregal in Königsberg the Königsberg
bridge problem and Eulers graph
Can you start at some land area (S1, S2, I1, I2)
and walk each bridge exactly once returning to
the starting land area?
38(No Transcript)
39Graphs - definition
- Digraphs Directed graphs
- Complete graphs have all possible edges
- Planar graphs can be presented in 2D and have no
crossing edges (e.g. chip design)
40Graphs - definition
0 1 1.5 2 5 6 7 9 1 0 2 1 6.5 6 8
8 1.5 2 0 1 4 4 6 5.5 . . .
Graph
Adjacency matrix
An undirected graph has a symmetric adjacency
matrix A digraph typically has a non-symmetric
adjacency matrix
41Example application OBSTRUCT creating
non-redundant datasets of protein structures
- Based on all-against-all global sequence
alignment - Create all-against-all sequence similarity matrix
- Filter matrix based on desired similarity range
(convert to 0 and 1 values) - Form maximal clique (largest complete subgraph)
by ordering rows and columns - This is an NP-complete problem (NP
non-polynomial) and thus problem scales
exponentially with number of vertices (proteins)
42Example application 1 OBSTRUCT creating
non-redundant datasets of protein structures
- Statistical research on protein structures
typically requires a database of a maximum number
of non-redundant (i.e. non-homologous) structures
- Often, two structures that have a sequence
identity of less than 25 are taken as
non-redundant - Given an initial set of N structures (with
corresponding sequences) and all-against-all
pair-wise alignments - Find the largest possible subset where each
sequence has lt25 sequence identity with any
other sequence
Heringa, J., Sommerfeldt, H., Higgins, D., and
Argos, P. (1992). Obstruct a program to obtain
largest cliques from a protein sequence set
according to structural resolution and sequence
similarity. Comp. Appl. Biosci. (CABIOS) 8,
599-600.
43Example application 1 OBSTRUCT creating
non-redundant datasets of protein structures
(Cnt.)
- The problem now can be formalised as follows
- Make a graph containing all sequences as vertices
(nodes) - Connect two nodes with an edge if their sequence
identity lt 25 - Make an adjacency matrix following the above
rules
Heringa, J., Sommerfeldt, H., Higgins, D., and
Argos, P. (1992). Obstruct a program to obtain
largest cliques from a protein sequence set
according to structural resolution and sequence
similarity. Comp. Appl. Biosci. (CABIOS) 8,
599-600.
44Example application 1 OBSTRUCT creating
non-redundant datasets of protein structures
(Cnt.)
- The algorithm
- Now try and reorder the rows (and columns in the
same way) such that we get a square only
consisting of 1s in the upper left corner - This corresponds to a complete graph (also called
clique) containing a set of non-redundant
proteins
Heringa, J., Sommerfeldt, H., Higgins, D., and
Argos, P. (1992). Obstruct a program to obtain
largest cliques from a protein sequence set
according to structural resolution and sequence
similarity. Comp. Appl. Biosci. (CABIOS) 8,
599-600.
45Example application 1 OBSTRUCT creating
non-redundant datasets of protein structures
(Cnt.)
- Order sum array and reorder rows and columns
accordingly - Estimate largest possible clique and take subset
of adj. matrix containing only rows with enough
1s - For a clique of size N, a subset of M rows (and
columns), where M ? N, with at least N 1s is
selected. - Go to step 1.
5 4 6 4 ..
1 0 1 1 1 0 0 0 1 0 1 0 0 1 1 1 0
0 1 0 1 1 1 0 1 1 0 1 0 1 1 0 0 0
0 1 . . . . . . .
?
Adjacency matrix
Heringa, J., Sommerfeldt, H., Higgins, D., and
Argos, P. (1992). Obstruct a program to obtain
largest cliques from a protein sequence set
according to structural resolution and sequence
similarity. Comp. Appl. Biosci. (CABIOS) 8,
599-600.
46(No Transcript)
47Some books call graphs containing multiple edges
or loops a multigraph, and those without a graph.
Other books allow multiple edges or loops in a
graph, but then talk about a graph without
multiple edges and loops as a simple graph.
48Remarks A multigraph might have no multiple
edges or loops. Every (simple) graph is a
multigraph, but not every multigraph is a
(simple) graph. Every graph is
finite Sometimes even multigraph folks talk
about a simple graph to emphasize that there
are no multiple edges and loops.
49Further definitions
K3,3
50Further definitions
bipartite A
graph is bipartite if its vertices can be
partitioned into two disjoint subsets U and V
such that each edge connects a vertex from U to
one from V. A bipartite graph is a complete
bipartite graph if every vertex in U is connected
to every vertex in V. If U has n elements and V
has m, then we denote the resulting complete
bipartite graph by Kn,m.
K3,3
51The Stable Marriage Algorithm
- In mathematics, the stable marriage problem (SMP)
is the problem of finding a stable matching a
matching in which no element of the first matched
set prefers an element of the second matched set
that also prefers the first element. - It is commonly stated as
- Given n men and n women, where each person has
ranked all members of the opposite sex with a
unique number between 1 and n in order of
preference, marry the men and women off such that
there are no two people of opposite sex who would
both rather have each other than their current
partners. If there are no such people, all the
marriages are "stable". - In 1962, David Gale and Lloyd Shapley proved
that, for any equal number of men and women, it
is always possible to solve the SMP and make all
marriages stable.
52The Stable Marriage Algorithm
- Also called the Gale-Shapley algorithm (see
preceding slide) - Given two non-overlapping equally sized graphs of
men (A, B, C, ..) and women (a, b, c, ), where
each man and woman has a preference list about
persons of the opposite sex - A pairing denotes a 1-to-1 correspondence between
men and women (each man marries one woman) - A pairing is unstable if there are couples X-x
and Y-y such that X prefers y to x and y prefers
X to Y - if this happens, pair X-y is called unsatisfied
- A pairing in which there are no unsatisfied
couples is called a stable pairing or stable
marriage - The Stable Marriage Algorithm forms a bipartite
graph that is stable
53A abcd denotes the preferences of A (likes a the
most, then b, then c, while d is liked least)
54(No Transcript)
55(No Transcript)
56(No Transcript)
57The Stable Marriage Algorithm
- The Gale-Shapley pairing, in the form presented
here, is male-optimal and female-pessimal (it
would be the reverse, of course, if the roles of
"male" and "female" participants in the algorithm
were interchanged). - To see this, consider the definition of a
feasible marriage. We say that the marriage
between man A and woman B is feasible if there
exists a stable pairing in which A and B are
married. When we say a pairing is male-optimal,
we mean that every man is paired with his highest
ranked feasible partner. Similarly, a
female-pessimal pairing is one in which each
female is paired with her lowest ranked feasible
partner. - Thuis means that men pair with women higher in
their preference list than where these men appear
in the list of the women paired to them (on
average)
58Graphs - definition
0 1 1.5 2 5 6 7 9 1 0 2 1 6.5 6 8
8 1.5 2 0 1 4 4 6 5.5 . . .
Graph
Adjacency matrix
An undirected graph has a symmetric adjacency
matrix A digraph typically has a non-symmetric
adjacency matrix
59A Theoretical Framework
- Representation of a set of n-dimensional (n-D)
points as a graph - each data point represented as a node
- each pair of points represented as an edge with a
weight defined by the distance between the two
points
graph representation
distance matrix
n-D data points
60A Theoretical Framework
- Spanning tree a sub-graph that has all nodes
connected and has no cycles - Minimum spanning tree a spanning tree with the
minimum total distance
(a)
(c)
(b)
61Spanning tree
- Prims algorithm (graph, tree)
- step 1 select an arbitrary node as the current
tree - step 2 find an external node that is closest to
the tree, and add it with its corresponding edge
into tree - step 3 continue steps 1 and 2 till all nodes are
connected in tree.
(a)
62Spanning tree
- Kruskals algorithm
- step 1 consider edges in non-decreasing order
- step 2 if edge selected does not form cycle,
then add it into tree otherwise reject - step 3 continue steps 1 and 2 till all nodes are
connected in tree.
63A Theoretical Framework
- A formal definition of a cluster
- C forms a cluster in D only if for any partition
C C1 U C2, the closest point, from D-C1, to C1
is from C2, in other words, the closest
connection from the points outside C1 must come
from within C2 - Key results
For any data set D, any of its cluster is
represented by a sub-tree of its MST
64A Theoretical Framework
- We can use the result on the preceding slide for
clustering using PRIMs algorithm - The order of nodes as selected by PRIMs
algorithm defines a linear representation, L(D),
of a data set D - We can plot the node distances in PRIMs minimum
spanning tree against the order of the nodes as
they were added by PRIMs algorithm (see earlier
slide)
Any contiguous block in L(D) represents a cluster
if and only if its elements form a sub-tree of
the MST, plus some minor additional conditions
(each cluster forms a valley)
4
A
A
B
B
7
PRIMs
edge weight (distance)
E
E
5
3
C
C
D
D
A B E D C
65A Theoretical Framework
So, if we first calculate all pairwise distances
in the plots below, convert the dots to a
(complete) graph, use PRIMs algorithm to
calculate a MST, and then make the plot as on the
preceding slide, we get for real-size data
Valleys correspond to clusters (red bars)
A nice clustering algorithm.... That can also be
used for signal/noise filtering
66Take home messages
- Revise agglomerative clustering (e.g.
nearest/furthest neighbour, group averaging) - Learn about graphs (complete, (un)directed,
adjacency matrix, bipartite, etc.) - Understand the relationship between a graph and
its adjacency matrix - Understand the Stable Marriage Algorithm
- Understand Prims and Kruskals algorithm
- Make sure you understand the clustering method
based on the minimal spanning tree made by Prims
algorithm (preceding slide)