http://creativecommons.org/licenses/by-sa/2.0/ - PowerPoint PPT Presentation

1 / 69
About This Presentation
Title:

http://creativecommons.org/licenses/by-sa/2.0/

Description:

– PowerPoint PPT presentation

Number of Views:52
Avg rating:3.0/5.0
Slides: 70
Provided by: bioinfo2
Category:

less

Transcript and Presenter's Notes

Title: http://creativecommons.org/licenses/by-sa/2.0/


1
http//creativecommons.org/licenses/by-sa/2.0/
2
Introduction to Algorithms in BioinformaticsSan
ja Rogic Computer Science Department,UBC
3
Outline
  • What are algorithms?
  • Algorithm design techniques
  • Efficiency of algorithms
  • Algorithms in Bioinformatics
  • Sequence alignment from algorithmic perspective

4
Making chocolate mousse
  • Ingredients
  • 6 ounces of semisweet chocolate
  • ¼ cup of powder sugar
  • 6 separated eggs
  • Yields
  • 6 to 8 servings

5
Recipe
  • melt the chocolate over simmering water
  • stir in powder sugar
  • remove from heat and beat egg yolks
  • in separate bowl beat egg whites until foamy
  • gently fold whites into chocolate mixture
  • pour into individual serving dishes
  • chill at least 4 hours

6
input
ingredients
hardware
recipe
utensils cook
algorithm (software)
output
chocolate mousse
7
Algorithm
  • a sequence of instructions one must perform in
    order to solve a well-formulated problem
  • problems specified in terms of inputs and
    outputs
  • examples of algorithms dividing numbers,
    changing a flat tire, knitting a sweater,
    looking up a telephone number

8
Levels of detail algorithm for the chocolate
mousse
  • melt the chocolate over simmering water
  • stir in powder sugar
  • remove from heat and beat egg yolks
  • in separate bowl beat egg whites until foamy
  • gently fold whites into chocolate mixture
  • pour into individual serving dishes
  • chill at least 4 hours

9
stir in powder sugar
take a little powder sugar, pour it into the
melted chocolate, stir it in, take a little more,
pour, stir,
take 2365 grains of powdered sugar, pour them
into the melted chocolate, pick up a spoon and
use circular movements to stir it in,
move your arm towards the ingredients at an angle
of 14º, at an approximate velocity of 18 inches
per second,
  • depends on hardware and/or comprehension level
    of potential reader

10
Representation
pseudocode SumIntegers (n) 1 sum ? 0 2 for i ?
1 to n 3 sum ? sum i 4 return sum
flowchart
start
sum ? 0
i 1,n
sum ? sum i
stop
11
Problem specification
  • recipe tailored only for specific set of
    ingredients
  • in general, an algorithm should be able to deal
    with many different inputs
  • algorithmic problem consist of
  • specification of a legal, possibly infinite
    collection of potential input sets (e.g., n ? N)
  • specification of desired outputs as a function of
    the inputs (e.g. sum)

12
Algorithm design techniques
  • most common algorithm design techniques
  • exhaustive search
  • branch-and-bound
  • divide-and-conquer
  • greedy
  • dynamic programming
  • machine learning
  • probabilistic algorithms

13
Looking for a cordless phone
14
Exhaustive search
  • also called brute force
  • examines every possible alternative to find the
    solution
  • in the example with the cordless phone ignore
    the ringing sound and examine every square
    centimeter of your house (explore entire
    search/solution space)

15
Exhaustive search
16
Exhaustive search
  • phone would probably stop ringing by the time you
    find it, but this method guarantees you that you
    will eventually find the phone
  • these algorithms are generally easy to design and
    implement but are too slow to be practical

17
Branch-and-bound algorithms
  • start searching through the first floor but
    realize that ringing is coming from above
  • rule out basement and first floor (pruning)
  • faster than exhaustive

18
Divide-and-conquer algorithms
  • divide a problem in smaller subproblems, solve
    the problems independently and combine the
    solutions into a solution for original problem
  • usually done recursively
  • merging of the solutions is a critical step and
    can take long time

19
Greedy algorithms
  • choose most attractive alternative at each
    iteration, without regard for future consequences
  • nearsighted algorithms
  • settling for an local instead of global optimum
  • these algorithms are easy to design and implement
    and are usually fast

20
Greedy algorithms
  • walk in the direction of phones ringing

21
Dynamic programming
  • similar to divide-and-conquer in a sense that it
    breaks a problem into smaller subproblems
  • It is based on the principal of optimality the
    optimal solution to a problem is a combination of
    optimal solutions to some of its subproblems.
  • it cleverly organizes computation to avoid
    recomputing values that are already known

22
Machine learning algorithms
  • collect statistics about where you leave your
    phone learning where the phone ends up most of
    the time (kitchen 80, bedroom 15, bathroom 5)
  • use this data to devise time-saving strategy
    (look in the kitchen first, )
  • extensively used in Bioinformatics
  • Hidden Markov Models (gene finding, sequence
    align)
  • Neural Network (prediction of splice sites)
  • Support Vector Machines (analysis of microarray
    data)

23
Randomized algorithms
  • use random choices to search the solution space
    flip a coin to decide on the next step
  • in the cordless phone example flip a coin to
    decide where you want to start your search (heads
    go to the second floor)
  • once on the second floor flip a coin to choose
    to which room you want to go
  • randomized search through the solution space
    guided by a fitness function

24
Randomized algorithms
25
Efficiency of algorithms
  • finding efficient algorithms for important
    algorithmic problems is one of the most common
    research topics in CS
  • example searching through a telephone book

26
First approach linear search
  • exhaustively search through the whole telephone
    book (n records)
  • at every step compare your query with the current
    list position
  • how many comparisons you have to make?
  • this algorithm has a worst-case running time on
    the order of n time complexity is O(n)

27
Big-O notation
  • running time of an algorithm is often expressed
    in approximate number of elementary instructions
    performed by the algorithm using big-O notation
  • the number of elementary instructions performed
    by the algorithm is most often a function of its
    input size
  • if n is the input size and f(n) is the
    approximate number of elementary operations
    performed by the algorithm in the worst case we
    say that the algorithm has O(f(n)) time
    complexity
  • elementary operations arithmetic operations,
    comparison

28
Complexity of algorithms
Boxes represent elementary instructions or blocks
of elementary instructions. Function calls are
not elementary instructions!
start
start
start
i 1,n
i 1,n
j 1,m
stop
stop
stop
O(c) constant time
O(n) linear algorithm
O(n2) quadratic algorithm
29
Big-O notation
  • big-O notation doesnt care about constants
  • even if it takes time n, 5n, 100n we still say it
    runs in time O(n)
  • this is always the worst-case running time for
    many inputs the algorithm is going to run much
    faster

30
Second approach binary search
  • How do you really use a telephone book?
  • BinarySearch (L,X)
  • if (X Lmid)
  • return Lmid
  • else if (X lt Lmid)
  • BinarySearch ( , X)
  • else BinarySearch ( , X)

?
?
?
?
1
2
3
4
M
31
Binary search
  • what is the worst-case complexity of this
    algorithm?
  • how many comparisons we have to make?
  • we need to keep dividing the list by two until
    there is only one element left

k is the of times we need to divide/compare
  • binary search has time complexity of O(log n)
    this is a logarithmic algorithm (very fast!)

32
Why do we care?
33
Growth rates of some functions
34
Running time of algorithms
35
Tractable vs intractable problems
36
Classification of algorithms based on their
complexity
  • polynomial algorithms all algorithms that have
    time complexity of O(nk), for fixed k
  • examples O(n), O(nlog n), O(n3), O(n150),
  • logarithmic algorithms are a subclass of
    polynomial they are faster than polynomials
  • exponential algorithms all algorithms that have
    time complexity of O(kn), for fixed kgt1
  • examples O(2n), O(nn), O(n!),

37
Time complexity implicates algorithms speed
polynomial algorithms fastexponential
algorithms s-l-o-w
38
NP-complete problems
  • class of hard computational problems for which
    exhaustive search takes exponential time
  • no polynomial time algorithm has been found for
    any of these problems, but there is no proof that
    it is impossible to find one
  • famous NP-complete problem is Traveling Salesman
    Problem (TSP)

39
Traveling Salesman Problem
  • Input map of cities, roads between cities and
    distances
  • Output a shortest path that goes through each
    city exactly once
  • many algorithmic problems are TSP in disguise

40
TSP example
  • we are given an instance of TSP problem

a
100
75
125
b
e
100
50
75
125
50
125
c
100
d
41
Solution space how many possible paths are
there?
a
4
42
Solution space
a
b
4 ? 3
43
Solution space
a
b
c
4 ? 3 ? 2
44
Solution space
a
b
c
d
4 ? 3 ? 2 ? 1
  • in general there are (n - 1)! possible paths for
    n cities

45
Graph representation of search space
46
Time complexity of TSP problem
  • we need to calculate the lengths of all possible
    paths to find the minimum one
  • there are (n-1)! possible paths and for each one
    it takes O(n) to calculate the length of the path
    (sum up n integers)
  • the complexity of this algorithm is O(n!)
  • for TSP problem of size 20 it would take
    approximately 77 years to solve it!

47
Applications of TSP problem
  • TSP problem is not a toy problem but has many
    important applications
  • arranging transportation routes
  • drilling circuit boards
  • analysis of structures of crystals
  • genome sequencing construction of radiation
    hybrid maps
  • genetic engineering research project design of
    universal DNA string

48
How can we solve TSP problem and other
NP-complete problems
  • two general approaches
  • approximate algorithms (heuristics)
  • probabilistic (randomized, stochastic) algorithms

49
Heuristic algorithms
  • computationally cheaper algorithms that may find
    good quality solution
  • there is no guarantee that the optimal solution
    will be found
  • using intuitive rules to ignore some regions of
    solution (search) space

50
Randomized algorithms
  • use random choices to search the solution space
    flip a coin to decide on the next step
  • each run of the algorithm results in a different
    search path through the solution space (different
    outputs)
  • deterministic algorithms produce always the same
    output for a specific input
  • fast algorithms that may find a good quality
    solution
  • not guaranteed to find optimal solution
  • many runs required

51
TSP Sweden tour
  • optimal solution for 25,000 cities
  • CPU time 85 years (8 years on 96-processor
    cluster)
  • using LKH heuristics

http//www.tsp.gatech.edu/sweden/index.html
52
NP-complete problems in Bioinformatics
  • multiple sequence alignment
  • phylogenetic tree construction
  • RNA secondary structure prediction with
    pseudoknots
  • protein structure prediction

53
Recap pairwise sequence alignment
  • align two DNA sequences
  • S1 TTCATA
  • S2 TGCTCGTA

54
Dynamic programming recursion
  • in general aligning two sequences x1x2xn and
    y1y2ym with linear gap penalty -d
  • F(i, j) max
  • F(0,0) 0, F(i,0) - id, F(0,j) - jd

F(i-1, j-1) s(xi, yj) match/mismatch F(i-1, j)
d gap in y F(i, j-1) d gap in x
55
Dynamic programming matrix
x TTCAT, y TGCTCGTA, 5 match, -2 mismatch
and -6 gap
56
Tracing back
T _ _ T C A T A T G C T C G T A
global alignment
57
What is the time complexity of DP?
  • we need to fill a matrix of size (m1) ? (n1)
  • for each cell of the matrix we need constant
    number of elementary instructions
  • time complexity O(mn) (or O(n2))
  • space complexity O(mn)

58
Dynamic programming in bioinformatics
  • DP is widely used in bioinformatics
  • sequence alignments
  • gene prediction
  • RNA secondary structure prediction

59
Is O(n2) fast enough?
  • we cannot use this approach to search the whole
    genome!
  • use heuristics (approximate) approaches FASTA,
    BLAST

60
Multiple sequence alignment
  • generalizing the notion of pairwise alignment
  • alignment of 3 sequences

A T _ G C G _ A _ C G T _ A A T C A C _ A
61
DP for alignment of three sequences
  • we can extend logic used for pairwise alignments

3-D 7 edges in cube
2-D 3 edges in square
62
DP for alignment of three sequences
(i-1, j-1, k-1)
(i-1, j, k-1)
(i-1,j-1,k)
(i-1, j, k)
(i, j, k-1)
(i, j-1,k-1)
(i, j, k)
(i, j-1, k)
63
DP for alignment of three sequences
square diagonal no indels
F(i-1,j-1,k-1) s(xi,yj,zk) F(i-1,j-1,k)
s(xi,yj,_) F(i-1,j,k-1) s(xi,_,zk) F(i,j-1,k-
1) s(_,yj,zk) F(i-1,j,k)
s(xi,_,_) F(i,j-1,k) s(_,yj,_) F(i,j,k-1)
s(_,_,zk)
face diagonal one indel
F(i,j,k) max
edge diagonal two indels
  • s(x, y, z) is an entry in the 3-D scoring matrix

64
Computational complexity of MSA
dynamic programming matrix for 3 sequences
n
n
n
  • we have to calculate value for each cell of this
    matrix
  • number of cells is n3 for the sequences of
    length n
  • how many calculations for each cell? we need
    to calculate 7 terms in order to get max
  • for 3 sequences of length n, running time is 7n3
    O(n3)

65
Computational complexity of MSA
  • If we have k sequences of length n
  • kdimensional dynamic programming matrix will
    have nk cells
  • for each cell 2k-1 (number of vertices in k-dim
    cube minus 1) calculations are needed
  • run time of this algorithm is O(2knk)
  • it is exponential in the number of sequences in
    the alignment!
  • NP-complete problem

66
How can we solve MSA problem?
  • use heuristic approaches
  • most commonly used approach to multiple sequence
    alignment is progressive alignment

67
Progressive multiple alignment
  • greedy approach
  • start from a strong pairwise alignment and
    iteratively add one string to the growing
    multiple alignment
  • multiple sequence alignment of k sequences
    reduced to k-1 pairwise alignments
  • in general works well for close sequences, but
    there are no performance guarantees

68
Conclusions
  • Computer Science has an essential role in
    Bioinformatics (beyond programming and building
    databases)
  • without powerful algorithmic techniques many
    problems in Bioinformatics would be unsolvable
  • next time you run a tool think about amazing
    research that went into the building of its
    engine

69
Recommended books
  • An introduction to Bioinformatics algorithms,
    N.C. Jones and P.A. Pevzner
  • Biological sequence analysis, Durbin et al.
  • Algorithmics the spirit of computing, D. Harel
  • Introduction to Algorithms, Cormen et al.
Write a Comment
User Comments (0)
About PowerShow.com