Title: http://creativecommons.org/licenses/by-sa/2.0/
1http//creativecommons.org/licenses/by-sa/2.0/
2Introduction to Algorithms in BioinformaticsSan
ja Rogic Computer Science Department,UBC
3Outline
- What are algorithms?
- Algorithm design techniques
- Efficiency of algorithms
- Algorithms in Bioinformatics
- Sequence alignment from algorithmic perspective
4Making chocolate mousse
- Ingredients
- 6 ounces of semisweet chocolate
- ¼ cup of powder sugar
- 6 separated eggs
- Yields
- 6 to 8 servings
5Recipe
- melt the chocolate over simmering water
- stir in powder sugar
- remove from heat and beat egg yolks
- in separate bowl beat egg whites until foamy
- gently fold whites into chocolate mixture
- pour into individual serving dishes
- chill at least 4 hours
6input
ingredients
hardware
recipe
utensils cook
algorithm (software)
output
chocolate mousse
7Algorithm
- a sequence of instructions one must perform in
order to solve a well-formulated problem - problems specified in terms of inputs and
outputs - examples of algorithms dividing numbers,
changing a flat tire, knitting a sweater,
looking up a telephone number
8Levels of detail algorithm for the chocolate
mousse
- melt the chocolate over simmering water
- stir in powder sugar
- remove from heat and beat egg yolks
- in separate bowl beat egg whites until foamy
- gently fold whites into chocolate mixture
- pour into individual serving dishes
- chill at least 4 hours
9stir in powder sugar
take a little powder sugar, pour it into the
melted chocolate, stir it in, take a little more,
pour, stir,
take 2365 grains of powdered sugar, pour them
into the melted chocolate, pick up a spoon and
use circular movements to stir it in,
move your arm towards the ingredients at an angle
of 14º, at an approximate velocity of 18 inches
per second,
- depends on hardware and/or comprehension level
of potential reader
10 Representation
pseudocode SumIntegers (n) 1 sum ? 0 2 for i ?
1 to n 3 sum ? sum i 4 return sum
flowchart
start
sum ? 0
i 1,n
sum ? sum i
stop
11Problem specification
- recipe tailored only for specific set of
ingredients - in general, an algorithm should be able to deal
with many different inputs - algorithmic problem consist of
- specification of a legal, possibly infinite
collection of potential input sets (e.g., n ? N) - specification of desired outputs as a function of
the inputs (e.g. sum)
12Algorithm design techniques
- most common algorithm design techniques
- exhaustive search
- branch-and-bound
- divide-and-conquer
- greedy
- dynamic programming
- machine learning
- probabilistic algorithms
13Looking for a cordless phone
14Exhaustive search
- also called brute force
- examines every possible alternative to find the
solution - in the example with the cordless phone ignore
the ringing sound and examine every square
centimeter of your house (explore entire
search/solution space)
15Exhaustive search
16Exhaustive search
- phone would probably stop ringing by the time you
find it, but this method guarantees you that you
will eventually find the phone - these algorithms are generally easy to design and
implement but are too slow to be practical
17Branch-and-bound algorithms
- start searching through the first floor but
realize that ringing is coming from above - rule out basement and first floor (pruning)
- faster than exhaustive
18Divide-and-conquer algorithms
- divide a problem in smaller subproblems, solve
the problems independently and combine the
solutions into a solution for original problem - usually done recursively
- merging of the solutions is a critical step and
can take long time
19Greedy algorithms
- choose most attractive alternative at each
iteration, without regard for future consequences - nearsighted algorithms
- settling for an local instead of global optimum
- these algorithms are easy to design and implement
and are usually fast
20Greedy algorithms
- walk in the direction of phones ringing
21Dynamic programming
- similar to divide-and-conquer in a sense that it
breaks a problem into smaller subproblems - It is based on the principal of optimality the
optimal solution to a problem is a combination of
optimal solutions to some of its subproblems. - it cleverly organizes computation to avoid
recomputing values that are already known
22Machine learning algorithms
- collect statistics about where you leave your
phone learning where the phone ends up most of
the time (kitchen 80, bedroom 15, bathroom 5) - use this data to devise time-saving strategy
(look in the kitchen first, ) - extensively used in Bioinformatics
- Hidden Markov Models (gene finding, sequence
align) - Neural Network (prediction of splice sites)
- Support Vector Machines (analysis of microarray
data)
23Randomized algorithms
- use random choices to search the solution space
flip a coin to decide on the next step - in the cordless phone example flip a coin to
decide where you want to start your search (heads
go to the second floor) - once on the second floor flip a coin to choose
to which room you want to go - randomized search through the solution space
guided by a fitness function
24Randomized algorithms
25Efficiency of algorithms
- finding efficient algorithms for important
algorithmic problems is one of the most common
research topics in CS - example searching through a telephone book
26First approach linear search
- exhaustively search through the whole telephone
book (n records) - at every step compare your query with the current
list position - how many comparisons you have to make?
- this algorithm has a worst-case running time on
the order of n time complexity is O(n)
27Big-O notation
- running time of an algorithm is often expressed
in approximate number of elementary instructions
performed by the algorithm using big-O notation - the number of elementary instructions performed
by the algorithm is most often a function of its
input size - if n is the input size and f(n) is the
approximate number of elementary operations
performed by the algorithm in the worst case we
say that the algorithm has O(f(n)) time
complexity - elementary operations arithmetic operations,
comparison
28Complexity of algorithms
Boxes represent elementary instructions or blocks
of elementary instructions. Function calls are
not elementary instructions!
start
start
start
i 1,n
i 1,n
j 1,m
stop
stop
stop
O(c) constant time
O(n) linear algorithm
O(n2) quadratic algorithm
29Big-O notation
- big-O notation doesnt care about constants
- even if it takes time n, 5n, 100n we still say it
runs in time O(n) - this is always the worst-case running time for
many inputs the algorithm is going to run much
faster
30Second approach binary search
- How do you really use a telephone book?
- BinarySearch (L,X)
- if (X Lmid)
- return Lmid
- else if (X lt Lmid)
- BinarySearch ( , X)
- else BinarySearch ( , X)
?
?
?
?
1
2
3
4
M
31Binary search
- what is the worst-case complexity of this
algorithm? - how many comparisons we have to make?
- we need to keep dividing the list by two until
there is only one element left
k is the of times we need to divide/compare
- binary search has time complexity of O(log n)
this is a logarithmic algorithm (very fast!)
32Why do we care?
33Growth rates of some functions
34Running time of algorithms
35Tractable vs intractable problems
36Classification of algorithms based on their
complexity
- polynomial algorithms all algorithms that have
time complexity of O(nk), for fixed k - examples O(n), O(nlog n), O(n3), O(n150),
- logarithmic algorithms are a subclass of
polynomial they are faster than polynomials - exponential algorithms all algorithms that have
time complexity of O(kn), for fixed kgt1 - examples O(2n), O(nn), O(n!),
37Time complexity implicates algorithms speed
polynomial algorithms fastexponential
algorithms s-l-o-w
38NP-complete problems
- class of hard computational problems for which
exhaustive search takes exponential time - no polynomial time algorithm has been found for
any of these problems, but there is no proof that
it is impossible to find one - famous NP-complete problem is Traveling Salesman
Problem (TSP)
39Traveling Salesman Problem
- Input map of cities, roads between cities and
distances - Output a shortest path that goes through each
city exactly once - many algorithmic problems are TSP in disguise
40TSP example
- we are given an instance of TSP problem
a
100
75
125
b
e
100
50
75
125
50
125
c
100
d
41Solution space how many possible paths are
there?
a
4
42Solution space
a
b
4 ? 3
43Solution space
a
b
c
4 ? 3 ? 2
44Solution space
a
b
c
d
4 ? 3 ? 2 ? 1
- in general there are (n - 1)! possible paths for
n cities
45Graph representation of search space
46Time complexity of TSP problem
- we need to calculate the lengths of all possible
paths to find the minimum one - there are (n-1)! possible paths and for each one
it takes O(n) to calculate the length of the path
(sum up n integers) - the complexity of this algorithm is O(n!)
- for TSP problem of size 20 it would take
approximately 77 years to solve it!
47Applications of TSP problem
- TSP problem is not a toy problem but has many
important applications - arranging transportation routes
- drilling circuit boards
- analysis of structures of crystals
- genome sequencing construction of radiation
hybrid maps - genetic engineering research project design of
universal DNA string
48How can we solve TSP problem and other
NP-complete problems
- two general approaches
- approximate algorithms (heuristics)
- probabilistic (randomized, stochastic) algorithms
49Heuristic algorithms
- computationally cheaper algorithms that may find
good quality solution - there is no guarantee that the optimal solution
will be found - using intuitive rules to ignore some regions of
solution (search) space
50Randomized algorithms
- use random choices to search the solution space
flip a coin to decide on the next step - each run of the algorithm results in a different
search path through the solution space (different
outputs) - deterministic algorithms produce always the same
output for a specific input - fast algorithms that may find a good quality
solution - not guaranteed to find optimal solution
- many runs required
51TSP Sweden tour
- optimal solution for 25,000 cities
- CPU time 85 years (8 years on 96-processor
cluster) - using LKH heuristics
http//www.tsp.gatech.edu/sweden/index.html
52NP-complete problems in Bioinformatics
- multiple sequence alignment
- phylogenetic tree construction
- RNA secondary structure prediction with
pseudoknots - protein structure prediction
53Recap pairwise sequence alignment
- align two DNA sequences
- S1 TTCATA
- S2 TGCTCGTA
54Dynamic programming recursion
- in general aligning two sequences x1x2xn and
y1y2ym with linear gap penalty -d - F(i, j) max
- F(0,0) 0, F(i,0) - id, F(0,j) - jd
F(i-1, j-1) s(xi, yj) match/mismatch F(i-1, j)
d gap in y F(i, j-1) d gap in x
55Dynamic programming matrix
x TTCAT, y TGCTCGTA, 5 match, -2 mismatch
and -6 gap
56Tracing back
T _ _ T C A T A T G C T C G T A
global alignment
57What is the time complexity of DP?
- we need to fill a matrix of size (m1) ? (n1)
- for each cell of the matrix we need constant
number of elementary instructions - time complexity O(mn) (or O(n2))
- space complexity O(mn)
58Dynamic programming in bioinformatics
- DP is widely used in bioinformatics
- sequence alignments
- gene prediction
- RNA secondary structure prediction
59Is O(n2) fast enough?
- we cannot use this approach to search the whole
genome! - use heuristics (approximate) approaches FASTA,
BLAST
60Multiple sequence alignment
- generalizing the notion of pairwise alignment
- alignment of 3 sequences
A T _ G C G _ A _ C G T _ A A T C A C _ A
61DP for alignment of three sequences
- we can extend logic used for pairwise alignments
3-D 7 edges in cube
2-D 3 edges in square
62DP for alignment of three sequences
(i-1, j-1, k-1)
(i-1, j, k-1)
(i-1,j-1,k)
(i-1, j, k)
(i, j, k-1)
(i, j-1,k-1)
(i, j, k)
(i, j-1, k)
63DP for alignment of three sequences
square diagonal no indels
F(i-1,j-1,k-1) s(xi,yj,zk) F(i-1,j-1,k)
s(xi,yj,_) F(i-1,j,k-1) s(xi,_,zk) F(i,j-1,k-
1) s(_,yj,zk) F(i-1,j,k)
s(xi,_,_) F(i,j-1,k) s(_,yj,_) F(i,j,k-1)
s(_,_,zk)
face diagonal one indel
F(i,j,k) max
edge diagonal two indels
- s(x, y, z) is an entry in the 3-D scoring matrix
64Computational complexity of MSA
dynamic programming matrix for 3 sequences
n
n
n
- we have to calculate value for each cell of this
matrix - number of cells is n3 for the sequences of
length n - how many calculations for each cell? we need
to calculate 7 terms in order to get max - for 3 sequences of length n, running time is 7n3
O(n3)
65Computational complexity of MSA
- If we have k sequences of length n
- kdimensional dynamic programming matrix will
have nk cells - for each cell 2k-1 (number of vertices in k-dim
cube minus 1) calculations are needed - run time of this algorithm is O(2knk)
- it is exponential in the number of sequences in
the alignment! - NP-complete problem
66How can we solve MSA problem?
- use heuristic approaches
- most commonly used approach to multiple sequence
alignment is progressive alignment
67Progressive multiple alignment
- greedy approach
- start from a strong pairwise alignment and
iteratively add one string to the growing
multiple alignment - multiple sequence alignment of k sequences
reduced to k-1 pairwise alignments - in general works well for close sequences, but
there are no performance guarantees
68Conclusions
- Computer Science has an essential role in
Bioinformatics (beyond programming and building
databases) - without powerful algorithmic techniques many
problems in Bioinformatics would be unsolvable - next time you run a tool think about amazing
research that went into the building of its
engine
69Recommended books
- An introduction to Bioinformatics algorithms,
N.C. Jones and P.A. Pevzner - Biological sequence analysis, Durbin et al.
- Algorithmics the spirit of computing, D. Harel
- Introduction to Algorithms, Cormen et al.