Introduction to Bioinformatics presentation

About This Presentation

Transcript and Presenter's Notes

Title: Introduction to Bioinformatics

1
Introduction to Bioinformatics
Lecture 16 Intracellular Networks Graph
theory
2
High-throughput Biological Data

Enormous amounts of biological data are being
generated by high-throughput capabilities even
more are coming
genomic sequences
gene expression data
mass spectrometry data
protein-protein interaction data
protein structures
......
Hidden in these data is information that reflects
existence, organization, activity, functionality
of biological machineries at different levels
in living organisms

3
Bio-Data Analysis andData Mining

Existing/emerging bio-data analysis and mining
tools for
DNA sequence assembly
Genetic map construction
Sequence comparison and database search
Gene finding
.
Gene expression data analysis
Phylogenetic tree analysis to infer
horizontally-transferred genes
Mass spec. data analysis for protein complex
characterization
Current prevailing mode of work

Developing ad hoc tools for each individual
application
4
Bio-Data Analysis and Data Mining

As the amount and types of data and the needs to
establish connections across multi-data sources
increase rapidly, the number of analysis tools
needed will go up exponentially
blast, blastp, blastx, blastn, from BLAST
family of tools
gene finding tools for human, mouse, fly, rice,
cyanobacteria, ..
tools for finding various signals in genomic
sequences, protein-binding sites, splice junction
sites, translation start sites, ..

Many of these data analysis problems are
fundamentally the same problem(s) and can be
solved using the same set of tools
Developing ad hoc tools for each application
problem (by each group of individual researchers)
may soon become inadequate as bio-data production
capabilities further ramp up
5
Data Clustering

Many biological data analysis problems can be
formulated as clustering problems
microarray gene expression data analysis
arrayCGH data (chromosomal gains and losses)
identification of regulatory binding sites
(similarly, splice junction sites, translation
start sites, ......)
(yeast) two-hybrid data analysis (for inference
of protein complexes)
phylogenetic tree clustering (for inference of
horizontally transferred genes)
protein domain identification
identification of structural motifs
prediction reliability assessment of protein
structures
NMR peak assignments
......

6
Data Clustering an example

Regulatory binding-sites are short conserved
sequence fragments in promoter regions
Solving binding-site identification as a
clustering problem
Project all fragments into Euclidean space so
that similar fragments are projected to nearby
positions and dissimilar fragments to far
positions
Observation conserved fragments form clusters
in a noisy background

........acgtttataatggcg ...... ........ggctttatatt
cgtc ...... ........ccgaatataatcta .......
7
Data Clustering Problems

Clustering partition a data set into clusters so
that data points of the same cluster are
similar and points of different clusters are
dissimilar
Cluster identification -- identifying clusters
with significantly different features than the
background

8
Multivariate statistics Cluster analysis
C1 C2 C3 C4 C5 C6 ..
1 2 3 4 5
Raw table
Any set of numbers per column

Multi-dimensional problems
Objects can be viewed as a cloud of points in a
multidimensional space
Need ways to group the data

9
Multivariate statistics Cluster analysis
C1 C2 C3 C4 C5 C6 ..
1 2 3 4 5
Raw table
Any set of numbers per column
Similarity criterion
Similarity matrix
Scores
55
Cluster criterion
Dendrogram
10
Cluster analysis data normalisation/weighting
C1 C2 C3 C4 C5 C6 ..
1 2 3 4 5
Raw table
Normalisation criterion
C1 C2 C3 C4 C5 C6 ..
1 2 3 4 5
Normalised table
Column normalisation x/max Column range
normalisation (x-min)/(max-min)
11
Cluster analysis (dis)similarity matrix
C1 C2 C3 C4 C5 C6 ..
1 2 3 4 5
Raw table
Similarity criterion
Similarity matrix
Scores
55
Di,j (?k xik xjkr)1/r Minkowski
metrics r 2 Euclidean distance r 1 City
block distance
12
Cluster analysis Clustering criteria
Similarity matrix
Scores
55
Cluster criterion
Dendrogram (tree)
Single linkage - Nearest neighbour Complete
linkage Furthest neighbour Group averaging
UPGMA (phylogeny) Ward Neighbour joining
global measure (phylogeny)
13
Cluster analysis Clustering criteria

Start with N clusters of 1 object each
Apply clustering distance criterion iteratively
until you have 1 cluster of N objects
Most interesting clustering somewhere in between

distance
Dendrogram (tree)
N clusters
1 cluster
14
Single linkage clustering (nearest neighbour)
Char 2
Char 1
15
Single linkage clustering (nearest neighbour)
Char 2
Char 1
16
Single linkage clustering (nearest neighbour)
Char 2
Char 1
17
Single linkage clustering (nearest neighbour)
Char 2
Char 1
18
Single linkage clustering (nearest neighbour)
Char 2
Char 1
19
Single linkage clustering (nearest neighbour)
Char 2
Char 1
Distance from point to cluster is defined as the
smallest distance between that point and any
point in the cluster
20
Single linkage clustering (nearest neighbour)
Let Ci and Cj be two disjoint clusters di,j
Min(dp,q), where p ? Ci and q ? Cj
Single linkage dendrograms typically show
chaining behaviour (i.e., all the time a single
object is added to existing cluster)
21
Complete linkage clustering (furthest neighbour)
Char 2
Char 1
22
Complete linkage clustering (furthest neighbour)
Char 2
Char 1
23
Complete linkage clustering (furthest neighbour)
Char 2
Char 1
24
Complete linkage clustering (furthest neighbour)
Char 2
Char 1
25
Complete linkage clustering (furthest neighbour)
Char 2
Char 1
26
Complete linkage clustering (furthest neighbour)
Char 2
Char 1
27
Complete linkage clustering (furthest neighbour)
Char 2
Char 1
28
Complete linkage clustering (furthest neighbour)
Char 2
Char 1
Distance from point to cluster is defined as the
largest distance between that point and any point
in the cluster
29
Complete linkage clustering (furthest neighbour)
Let Ci and Cj be two disjoint clusters di,j
Max(dp,q), where p ? Ci and q ? Cj
More structured clusters than with single
linkage clustering
30
Clustering algorithm

Initialise (dis)similarity matrix
Take two points with smallest distance as first
cluster
Merge corresponding rows/columns in
(dis)similarity matrix
Repeat steps 2. and 3.
using appropriate cluster
measure until last two clusters are merged

31
Average linkage clustering (Unweighted Pair
Group Mean Averaging -UPGMA)
Char 2
Char 1
Distance from cluster to cluster is defined as
the average distance over all within-cluster
distances
32
UPGMA
Let Ci and Cj be two disjoint clusters
1 di,j ?p?q dp,q, where p ? Ci and q ?
Cj Ci Cj
Ci
Cj
In words calculate the average over all pairwise
inter-cluster distances
33
Multivariate statistics Cluster analysis
C1 C2 C3 C4 C5 C6 ..
1 2 3 4 5
Data table
Similarity criterion
Similarity matrix
Scores
55
Cluster criterion
Phylogenetic tree
34
Multivariate statistics Cluster analysis
C1 C2 C3 C4 C5 C6
1 2 3 4 5
Similarity criterion
Scores
66
Cluster criterion
Scores
55
Cluster criterion
Make two-way ordered table using dendrograms
35
Multivariate statistics Two-way cluster analysis
C4 C3 C6 C1 C2 C5
1 4 2 5 3
Make two-way (rows, columns) ordered table using
dendrograms This shows blocks of numbers that
are similar
36
Multivariate statistics Two-way cluster analysis
37
Graph theory
The river Pregal in Königsberg the Königsberg
bridge problem and Eulers graph
Can you start at some land area (S1, S2, I1, I2)
and walk each bridge exactly once returning to
the starting land area?
38
(No Transcript)
39
Graphs - definition

Digraphs Directed graphs
Complete graphs have all possible edges
Planar graphs can be presented in 2D and have no
crossing edges (e.g. chip design)

40
Graphs - definition
0 1 1.5 2 5 6 7 9 1 0 2 1 6.5 6 8
8 1.5 2 0 1 4 4 6 5.5 . . .
Graph
Adjacency matrix
An undirected graph has a symmetric adjacency
matrix A digraph typically has a non-symmetric
adjacency matrix
41
Example application OBSTRUCT creating
non-redundant datasets of protein structures

Based on all-against-all global sequence
alignment
Create all-against-all sequence similarity matrix
Filter matrix based on desired similarity range
(convert to 0 and 1 values)
Form maximal clique (largest complete subgraph)
by ordering rows and columns
This is an NP-complete problem (NP
non-polynomial) and thus problem scales
exponentially with number of vertices (proteins)

42
Example application 1 OBSTRUCT creating
non-redundant datasets of protein structures

Statistical research on protein structures
typically requires a database of a maximum number
of non-redundant (i.e. non-homologous) structures
Often, two structures that have a sequence
identity of less than 25 are taken as
non-redundant
Given an initial set of N structures (with
corresponding sequences) and all-against-all
pair-wise alignments
Find the largest possible subset where each
sequence has lt25 sequence identity with any
other sequence

Heringa, J., Sommerfeldt, H., Higgins, D., and
Argos, P. (1992). Obstruct a program to obtain
largest cliques from a protein sequence set
according to structural resolution and sequence
similarity. Comp. Appl. Biosci. (CABIOS) 8,
599-600.
43
Example application 1 OBSTRUCT creating
non-redundant datasets of protein structures
(Cnt.)

The problem now can be formalised as follows
Make a graph containing all sequences as vertices
(nodes)
Connect two nodes with an edge if their sequence
identity lt 25
Make an adjacency matrix following the above
rules

Heringa, J., Sommerfeldt, H., Higgins, D., and
Argos, P. (1992). Obstruct a program to obtain
largest cliques from a protein sequence set
according to structural resolution and sequence
similarity. Comp. Appl. Biosci. (CABIOS) 8,
599-600.
44
Example application 1 OBSTRUCT creating
non-redundant datasets of protein structures
(Cnt.)

The algorithm
Now try and reorder the rows (and columns in the
same way) such that we get a square only
consisting of 1s in the upper left corner
This corresponds to a complete graph (also called
clique) containing a set of non-redundant
proteins

Heringa, J., Sommerfeldt, H., Higgins, D., and
Argos, P. (1992). Obstruct a program to obtain
largest cliques from a protein sequence set
according to structural resolution and sequence
similarity. Comp. Appl. Biosci. (CABIOS) 8,
599-600.
45
Example application 1 OBSTRUCT creating
non-redundant datasets of protein structures
(Cnt.)

Order sum array and reorder rows and columns
accordingly
Estimate largest possible clique and take subset
of adj. matrix containing only rows with enough
1s
For a clique of size N, a subset of M rows (and
columns), where M ? N, with at least N 1s is
selected.
Go to step 1.

5 4 6 4 ..
1 0 1 1 1 0 0 0 1 0 1 0 0 1 1 1 0
0 1 0 1 1 1 0 1 1 0 1 0 1 1 0 0 0
0 1 . . . . . . .
?
Adjacency matrix
Heringa, J., Sommerfeldt, H., Higgins, D., and
Argos, P. (1992). Obstruct a program to obtain
largest cliques from a protein sequence set
according to structural resolution and sequence
similarity. Comp. Appl. Biosci. (CABIOS) 8,
599-600.
46
(No Transcript)
47
Some books call graphs containing multiple edges
or loops a multigraph, and those without a graph.
Other books allow multiple edges or loops in a
graph, but then talk about a graph without
multiple edges and loops as a simple graph.
48
Remarks A multigraph might have no multiple
edges or loops. Every (simple) graph is a
multigraph, but not every multigraph is a
(simple) graph. Every graph is
finite Sometimes even multigraph folks talk
about a simple graph to emphasize that there
are no multiple edges and loops.
49
Further definitions
K3,3
50
Further definitions
bipartite A
graph is bipartite if its vertices can be
partitioned into two disjoint subsets U and V
such that each edge connects a vertex from U to
one from V. A bipartite graph is a complete
bipartite graph if every vertex in U is connected
to every vertex in V. If U has n elements and V
has m, then we denote the resulting complete
bipartite graph by Kn,m.
K3,3
51
The Stable Marriage Algorithm

In mathematics, the stable marriage problem (SMP)
is the problem of finding a stable matching a
matching in which no element of the first matched
set prefers an element of the second matched set
that also prefers the first element.
It is commonly stated as
Given n men and n women, where each person has
ranked all members of the opposite sex with a
unique number between 1 and n in order of
preference, marry the men and women off such that
there are no two people of opposite sex who would
both rather have each other than their current
partners. If there are no such people, all the
marriages are "stable".
In 1962, David Gale and Lloyd Shapley proved
that, for any equal number of men and women, it
is always possible to solve the SMP and make all
marriages stable.

52
The Stable Marriage Algorithm

Also called the Gale-Shapley algorithm (see
preceding slide)
Given two non-overlapping equally sized graphs of
men (A, B, C, ..) and women (a, b, c, ), where
each man and woman has a preference list about
persons of the opposite sex
A pairing denotes a 1-to-1 correspondence between
men and women (each man marries one woman)
A pairing is unstable if there are couples X-x
and Y-y such that X prefers y to x and y prefers
X to Y
if this happens, pair X-y is called unsatisfied
A pairing in which there are no unsatisfied
couples is called a stable pairing or stable
marriage
The Stable Marriage Algorithm forms a bipartite
graph that is stable

53
A abcd denotes the preferences of A (likes a the
most, then b, then c, while d is liked least)
54
(No Transcript)
55
(No Transcript)
56
(No Transcript)
57
The Stable Marriage Algorithm

The Gale-Shapley pairing, in the form presented
here, is male-optimal and female-pessimal (it
would be the reverse, of course, if the roles of
"male" and "female" participants in the algorithm
were interchanged).
To see this, consider the definition of a
feasible marriage. We say that the marriage
between man A and woman B is feasible if there
exists a stable pairing in which A and B are
married. When we say a pairing is male-optimal,
we mean that every man is paired with his highest
ranked feasible partner. Similarly, a
female-pessimal pairing is one in which each
female is paired with her lowest ranked feasible
partner.
Thuis means that men pair with women higher in
their preference list than where these men appear
in the list of the women paired to them (on
average)

58
Graphs - definition
0 1 1.5 2 5 6 7 9 1 0 2 1 6.5 6 8
8 1.5 2 0 1 4 4 6 5.5 . . .
Graph
Adjacency matrix
An undirected graph has a symmetric adjacency
matrix A digraph typically has a non-symmetric
adjacency matrix
59
A Theoretical Framework

Representation of a set of n-dimensional (n-D)
points as a graph
each data point represented as a node
each pair of points represented as an edge with a
weight defined by the distance between the two
points

graph representation
distance matrix
n-D data points
60
A Theoretical Framework

Spanning tree a sub-graph that has all nodes
connected and has no cycles
Minimum spanning tree a spanning tree with the
minimum total distance

(a)
(c)
(b)
61
Spanning tree

Prims algorithm (graph, tree)
step 1 select an arbitrary node as the current
tree
step 2 find an external node that is closest to
the tree, and add it with its corresponding edge
into tree
step 3 continue steps 1 and 2 till all nodes are
connected in tree.

(a)
62
Spanning tree

Kruskals algorithm
step 1 consider edges in non-decreasing order
step 2 if edge selected does not form cycle,
then add it into tree otherwise reject
step 3 continue steps 1 and 2 till all nodes are
connected in tree.

63
A Theoretical Framework

A formal definition of a cluster
C forms a cluster in D only if for any partition
C C1 U C2, the closest point, from D-C1, to C1
is from C2, in other words, the closest
connection from the points outside C1 must come
from within C2
Key results

For any data set D, any of its cluster is
represented by a sub-tree of its MST
64
A Theoretical Framework

We can use the result on the preceding slide for
clustering using PRIMs algorithm
The order of nodes as selected by PRIMs
algorithm defines a linear representation, L(D),
of a data set D
We can plot the node distances in PRIMs minimum
spanning tree against the order of the nodes as
they were added by PRIMs algorithm (see earlier
slide)

Any contiguous block in L(D) represents a cluster
if and only if its elements form a sub-tree of
the MST, plus some minor additional conditions
(each cluster forms a valley)
4
A
A
B
B
7
PRIMs
edge weight (distance)
E
E
5
3
C
C
D
D
A B E D C
65
A Theoretical Framework
So, if we first calculate all pairwise distances
in the plots below, convert the dots to a
(complete) graph, use PRIMs algorithm to
calculate a MST, and then make the plot as on the
preceding slide, we get for real-size data
Valleys correspond to clusters (red bars)
A nice clustering algorithm.... That can also be
used for signal/noise filtering
66
Take home messages

Revise agglomerative clustering (e.g.
nearest/furthest neighbour, group averaging)
Learn about graphs (complete, (un)directed,
adjacency matrix, bipartite, etc.)
Understand the relationship between a graph and
its adjacency matrix
Understand the Stable Marriage Algorithm
Understand Prims and Kruskals algorithm
Make sure you understand the clustering method
based on the minimal spanning tree made by Prims
algorithm (preceding slide)

Write a Comment

User Comments (0)

About PowerShow.com

Introduction to Bioinformatics PowerPoint PPT Presentation