http://creativecommons.org/licenses/by-sa/2.0/

About This Presentation

Title:

http://creativecommons.org/licenses/by-sa/2.0/

Description:

– PowerPoint PPT presentation

Number of Views:52

Avg rating:3.0/5.0

Slides: 70

Provided by: bioinfo2

Category:

more less

Transcript and Presenter's Notes

Title: http://creativecommons.org/licenses/by-sa/2.0/

1
http//creativecommons.org/licenses/by-sa/2.0/
2
Introduction to Algorithms in BioinformaticsSan
ja Rogic Computer Science Department,UBC
3
Outline

What are algorithms?
Algorithm design techniques
Efficiency of algorithms
Algorithms in Bioinformatics
Sequence alignment from algorithmic perspective

4
Making chocolate mousse

Ingredients
6 ounces of semisweet chocolate
¼ cup of powder sugar
6 separated eggs
Yields
6 to 8 servings

5
Recipe

melt the chocolate over simmering water
stir in powder sugar
remove from heat and beat egg yolks
in separate bowl beat egg whites until foamy
gently fold whites into chocolate mixture
pour into individual serving dishes
chill at least 4 hours

6
input
ingredients
hardware
recipe
utensils cook
algorithm (software)
output
chocolate mousse
7
Algorithm

a sequence of instructions one must perform in
order to solve a well-formulated problem
problems specified in terms of inputs and
outputs
examples of algorithms dividing numbers,
changing a flat tire, knitting a sweater,
looking up a telephone number

8
Levels of detail algorithm for the chocolate
mousse

melt the chocolate over simmering water
stir in powder sugar
remove from heat and beat egg yolks
in separate bowl beat egg whites until foamy
gently fold whites into chocolate mixture
pour into individual serving dishes
chill at least 4 hours

9
stir in powder sugar
take a little powder sugar, pour it into the
melted chocolate, stir it in, take a little more,
pour, stir,
take 2365 grains of powdered sugar, pour them
into the melted chocolate, pick up a spoon and
use circular movements to stir it in,
move your arm towards the ingredients at an angle
of 14º, at an approximate velocity of 18 inches
per second,

depends on hardware and/or comprehension level
of potential reader

10
Representation
pseudocode SumIntegers (n) 1 sum ? 0 2 for i ?
1 to n 3 sum ? sum i 4 return sum
flowchart
start
sum ? 0
i 1,n
sum ? sum i
stop
11
Problem specification

recipe tailored only for specific set of
ingredients
in general, an algorithm should be able to deal
with many different inputs
algorithmic problem consist of
specification of a legal, possibly infinite
collection of potential input sets (e.g., n ? N)
specification of desired outputs as a function of
the inputs (e.g. sum)

12
Algorithm design techniques

most common algorithm design techniques
exhaustive search
branch-and-bound
divide-and-conquer
greedy
dynamic programming
machine learning
probabilistic algorithms

13
Looking for a cordless phone
14
Exhaustive search

also called brute force
examines every possible alternative to find the
solution
in the example with the cordless phone ignore
the ringing sound and examine every square
centimeter of your house (explore entire
search/solution space)

15
Exhaustive search
16
Exhaustive search

phone would probably stop ringing by the time you
find it, but this method guarantees you that you
will eventually find the phone
these algorithms are generally easy to design and
implement but are too slow to be practical

17
Branch-and-bound algorithms

start searching through the first floor but
realize that ringing is coming from above
rule out basement and first floor (pruning)
faster than exhaustive

18
Divide-and-conquer algorithms

divide a problem in smaller subproblems, solve
the problems independently and combine the
solutions into a solution for original problem
usually done recursively
merging of the solutions is a critical step and
can take long time

19
Greedy algorithms

choose most attractive alternative at each
iteration, without regard for future consequences
nearsighted algorithms
settling for an local instead of global optimum
these algorithms are easy to design and implement
and are usually fast

20
Greedy algorithms

walk in the direction of phones ringing

21
Dynamic programming

similar to divide-and-conquer in a sense that it
breaks a problem into smaller subproblems
It is based on the principal of optimality the
optimal solution to a problem is a combination of
optimal solutions to some of its subproblems.
it cleverly organizes computation to avoid
recomputing values that are already known

22
Machine learning algorithms

collect statistics about where you leave your
phone learning where the phone ends up most of
the time (kitchen 80, bedroom 15, bathroom 5)
use this data to devise time-saving strategy
(look in the kitchen first, )
extensively used in Bioinformatics
Hidden Markov Models (gene finding, sequence
align)
Neural Network (prediction of splice sites)
Support Vector Machines (analysis of microarray
data)

23
Randomized algorithms

use random choices to search the solution space
flip a coin to decide on the next step
in the cordless phone example flip a coin to
decide where you want to start your search (heads
go to the second floor)
once on the second floor flip a coin to choose
to which room you want to go
randomized search through the solution space
guided by a fitness function

24
Randomized algorithms
25
Efficiency of algorithms

finding efficient algorithms for important
algorithmic problems is one of the most common
research topics in CS
example searching through a telephone book

26
First approach linear search

exhaustively search through the whole telephone
book (n records)
at every step compare your query with the current
list position
how many comparisons you have to make?
this algorithm has a worst-case running time on
the order of n time complexity is O(n)

27
Big-O notation

running time of an algorithm is often expressed
in approximate number of elementary instructions
performed by the algorithm using big-O notation
the number of elementary instructions performed
by the algorithm is most often a function of its
input size
if n is the input size and f(n) is the
approximate number of elementary operations
performed by the algorithm in the worst case we
say that the algorithm has O(f(n)) time
complexity
elementary operations arithmetic operations,
comparison

28
Complexity of algorithms
Boxes represent elementary instructions or blocks
of elementary instructions. Function calls are
not elementary instructions!
start
start
start
i 1,n
i 1,n
j 1,m
stop
stop
stop
O(c) constant time
O(n) linear algorithm
O(n2) quadratic algorithm
29
Big-O notation

big-O notation doesnt care about constants
even if it takes time n, 5n, 100n we still say it
runs in time O(n)
this is always the worst-case running time for
many inputs the algorithm is going to run much
faster

30
Second approach binary search

How do you really use a telephone book?
BinarySearch (L,X)
if (X Lmid)
return Lmid
else if (X lt Lmid)
BinarySearch ( , X)
else BinarySearch ( , X)

?
?
?
?
1
2
3
4
M
31
Binary search

what is the worst-case complexity of this
algorithm?
how many comparisons we have to make?
we need to keep dividing the list by two until
there is only one element left

k is the of times we need to divide/compare

binary search has time complexity of O(log n)
this is a logarithmic algorithm (very fast!)

32
Why do we care?
33
Growth rates of some functions
34
Running time of algorithms
35
Tractable vs intractable problems
36
Classification of algorithms based on their
complexity

polynomial algorithms all algorithms that have
time complexity of O(nk), for fixed k
examples O(n), O(nlog n), O(n3), O(n150),
logarithmic algorithms are a subclass of
polynomial they are faster than polynomials
exponential algorithms all algorithms that have
time complexity of O(kn), for fixed kgt1
examples O(2n), O(nn), O(n!),

37
Time complexity implicates algorithms speed
polynomial algorithms fastexponential
algorithms s-l-o-w
38
NP-complete problems

class of hard computational problems for which
exhaustive search takes exponential time
no polynomial time algorithm has been found for
any of these problems, but there is no proof that
it is impossible to find one
famous NP-complete problem is Traveling Salesman
Problem (TSP)

39
Traveling Salesman Problem

Input map of cities, roads between cities and
distances
Output a shortest path that goes through each
city exactly once
many algorithmic problems are TSP in disguise

40
TSP example

we are given an instance of TSP problem

a
100
75
125
b
e
100
50
75
125
50
125
c
100
d
41
Solution space how many possible paths are
there?
a
4
42
Solution space
a
b
4 ? 3
43
Solution space
a
b
c
4 ? 3 ? 2
44
Solution space
a
b
c
d
4 ? 3 ? 2 ? 1

in general there are (n - 1)! possible paths for
n cities

45
Graph representation of search space
46
Time complexity of TSP problem

we need to calculate the lengths of all possible
paths to find the minimum one
there are (n-1)! possible paths and for each one
it takes O(n) to calculate the length of the path
(sum up n integers)
the complexity of this algorithm is O(n!)
for TSP problem of size 20 it would take
approximately 77 years to solve it!

47
Applications of TSP problem

TSP problem is not a toy problem but has many
important applications
arranging transportation routes
drilling circuit boards
analysis of structures of crystals
genome sequencing construction of radiation
hybrid maps
genetic engineering research project design of
universal DNA string

48
How can we solve TSP problem and other
NP-complete problems

two general approaches
approximate algorithms (heuristics)
probabilistic (randomized, stochastic) algorithms

49
Heuristic algorithms

computationally cheaper algorithms that may find
good quality solution
there is no guarantee that the optimal solution
will be found
using intuitive rules to ignore some regions of
solution (search) space

50
Randomized algorithms

use random choices to search the solution space
flip a coin to decide on the next step
each run of the algorithm results in a different
search path through the solution space (different
outputs)
deterministic algorithms produce always the same
output for a specific input
fast algorithms that may find a good quality
solution
not guaranteed to find optimal solution
many runs required

51
TSP Sweden tour

optimal solution for 25,000 cities
CPU time 85 years (8 years on 96-processor
cluster)
using LKH heuristics

http//www.tsp.gatech.edu/sweden/index.html
52
NP-complete problems in Bioinformatics

multiple sequence alignment
phylogenetic tree construction
RNA secondary structure prediction with
pseudoknots
protein structure prediction

53
Recap pairwise sequence alignment

align two DNA sequences
S1 TTCATA
S2 TGCTCGTA

54
Dynamic programming recursion

in general aligning two sequences x1x2xn and
y1y2ym with linear gap penalty -d
F(i, j) max
F(0,0) 0, F(i,0) - id, F(0,j) - jd

F(i-1, j-1) s(xi, yj) match/mismatch F(i-1, j)
d gap in y F(i, j-1) d gap in x
55
Dynamic programming matrix
x TTCAT, y TGCTCGTA, 5 match, -2 mismatch
and -6 gap
56
Tracing back
T _ _ T C A T A T G C T C G T A
global alignment
57
What is the time complexity of DP?

we need to fill a matrix of size (m1) ? (n1)
for each cell of the matrix we need constant
number of elementary instructions
time complexity O(mn) (or O(n2))
space complexity O(mn)

58
Dynamic programming in bioinformatics

DP is widely used in bioinformatics
sequence alignments
gene prediction
RNA secondary structure prediction

59
Is O(n2) fast enough?

we cannot use this approach to search the whole
genome!
use heuristics (approximate) approaches FASTA,
BLAST

60
Multiple sequence alignment

generalizing the notion of pairwise alignment
alignment of 3 sequences

A T _ G C G _ A _ C G T _ A A T C A C _ A
61
DP for alignment of three sequences

we can extend logic used for pairwise alignments

3-D 7 edges in cube
2-D 3 edges in square
62
DP for alignment of three sequences
(i-1, j-1, k-1)
(i-1, j, k-1)
(i-1,j-1,k)
(i-1, j, k)
(i, j, k-1)
(i, j-1,k-1)
(i, j, k)
(i, j-1, k)
63
DP for alignment of three sequences
square diagonal no indels
F(i-1,j-1,k-1) s(xi,yj,zk) F(i-1,j-1,k)
s(xi,yj,_) F(i-1,j,k-1) s(xi,_,zk) F(i,j-1,k-
1) s(_,yj,zk) F(i-1,j,k)
s(xi,_,_) F(i,j-1,k) s(_,yj,_) F(i,j,k-1)
s(_,_,zk)
face diagonal one indel
F(i,j,k) max
edge diagonal two indels

s(x, y, z) is an entry in the 3-D scoring matrix

64
Computational complexity of MSA
dynamic programming matrix for 3 sequences
n
n
n

we have to calculate value for each cell of this
matrix
number of cells is n3 for the sequences of
length n
how many calculations for each cell? we need
to calculate 7 terms in order to get max
for 3 sequences of length n, running time is 7n3
O(n3)

65
Computational complexity of MSA

If we have k sequences of length n
kdimensional dynamic programming matrix will
have nk cells
for each cell 2k-1 (number of vertices in k-dim
cube minus 1) calculations are needed
run time of this algorithm is O(2knk)
it is exponential in the number of sequences in
the alignment!
NP-complete problem

66
How can we solve MSA problem?

use heuristic approaches
most commonly used approach to multiple sequence
alignment is progressive alignment

67
Progressive multiple alignment

greedy approach
start from a strong pairwise alignment and
iteratively add one string to the growing
multiple alignment
multiple sequence alignment of k sequences
reduced to k-1 pairwise alignments
in general works well for close sequences, but
there are no performance guarantees

68
Conclusions

Computer Science has an essential role in
Bioinformatics (beyond programming and building
databases)
without powerful algorithmic techniques many
problems in Bioinformatics would be unsolvable
next time you run a tool think about amazing
research that went into the building of its
engine

http://creativecommons.org/licenses/by-sa/2.0/ - PowerPoint PPT Presentation

http://creativecommons.org/licenses/by-sa/2.0/

– PowerPoint PPT presentation