Some algorithmic background - PowerPoint PPT Presentation

About This Presentation
Title:

Some algorithmic background

Description:

Some algorithmic background Biology 162 Computational Genetics Todd Vision Fall 2004 26 Aug 2004 – PowerPoint PPT presentation

Number of Views:199
Avg rating:3.0/5.0
Slides: 36
Provided by: Todd276
Learn more at: http://labs.bio.unc.edu
Category:

less

Transcript and Presenter's Notes

Title: Some algorithmic background


1
Some algorithmic background
  • Biology 162 Computational Genetics
  • Todd VisionFall 2004
  • 26 Aug 2004

2
Some algorithmic background
  • Algorithms
  • Analysis of time and memory requirements
  • NP completeness
  • Graphs
  • Travelling salesman problem
  • DNA computers
  • Strings and Sequences
  • Recursion

3
Algorithm
  • A finite set of rules that gives a sequence of
    operations for solving a problem suitable for
    implementation by a computer
  • A correct algorithm will solve all instances of a
    problem
  • An algorithm can be implemented
  • Multiple ways
  • In different languages
  • On different hardware architectures
  • The choice of algorithm is usually far more
    important to time/memory usage than implementation

4
Knuths 5 features of an algorithm
  • Finiteness - guaranteed to terminate
  • Definiteness - each step precisely defined
  • Effectiveness - each step must be small
  • Defined inputs
  • Defined outputs

5
Analysis of algorithms
  • Mathematical description of time and memory
    requirements
  • Algorithm efficiency
  • Time and memory are a function of the size of the
    problem instance f(x)
  • Efficiency generally expressed in Big O notation
  • Assuming the instance is a worst-case scenario
  • Describes how time/memory scale as problem size
    grows asymptotically large

6
Big O notation
  • O(n), or order n, where n is the highest order
    term in f(x)
  • For small instances, an O(n2) algorithm may be
    faster than an O(n) algorithm
  • The notation does not account for constant
    factors, which may affect comparisons
  • The big O notation does not allow one to actually
    predict the running time or memory usage
  • Average running time may be much better than
    worst-case

7
Algorithm efficiency
  • An algorithm is efficient if the running time is
    bounded by a polynomial
  • O(n4) yes
  • O(4n) no
  • O(4log(n)) gray area
  • Problems are considered to be of class
  • P if a deterministic efficient algorithm exists
  • NP if no such algorithm has yet been found
  • NP-complete if a nondeterministic polynomial time
    algorithm exists

8
Are NP-complete problems in class P?
  • If any NP-complete problem is provably in class
    P, then all NP-complete problems must be!
  • Strictly, this applies only to decision problems
  • Corresponding optimization problems must be at
    least as hard, and are referred to as NP-hard
  • Many of the most interesting problems in
    computational biology are NP-complete or NP-hard

9
Algorithms without optimality guarantees
  • Approximation algorithm
  • For many NP-hard problems, polynomial-time
    algorithms exist that can provably give answers
    within some small factor e of the optimal answer
  • Heuristic algorithm
  • An algorithm that may be sensible, and may work
    in practice, but is not necessarily efficient and
    has no guarantee of finding a solution within e
    of the optimal one

10
Travelling salesman problem
  • A salesman must visit each city on a list exactly
    once, covering the smallest number of miles in
    total
  • Classic NP-hard problem
  • Excellent approximate algorithms exist
  • Many computational biology problems are solved by
    casting them as instances of the TSP and then
    applying an existing algorithm

11
Travelling salesman problem
New York
810
Chicago
2050
1330
2790
Los Angeles
1090
1400
1610
2720
1540
Dallas
Miami
1190
12
Graph jargon
  • A graph G(V, E) is composed of a set of vertices
    (V) and edges (E)
  • Vertices are also known as nodes
  • The edges, and thus the graphs, may be
  • Directed, if edges have a head at one vertex and
    a tail at the other
  • Undirected otherwise
  • The degree of a vertex is the number of adjacent
    vertices
  • For directed graphs, vertices have an indegree
    and an outdegree

13
Graph jargon
  • Weighted graphs have a cost or distance w(Ei) on
    each edge i (as in the TSP)
  • A path is a list of vertices (v1,v2..vk) where
    (vi,vi1) are adjacent
  • The weight of a path is the sum of the weights on
    each edge
  • A cycle is a path which returns to the same
    vertex
  • Acyclic graphs have no paths that are cyclic
  • Acyclic undirected graphs are trees
  • The phylogenetic trees that biologists know and
    love
  • Important data structures

14
Graph jargon
  • Connected components are sets of vertices for
    which
  • No adjacent vertices are excluded
  • Do not contain subsets of vertices that are
    themselves connected components

15
Eulerian graph
  • Contains a cycle in which each edge appears
    exactly once
  • A Eulerian path can be found with an algorithm
    that is O(nm) in the number of vertices n and
    edges m

3
2
7
4
8
1
6
5
16
Hamiltonian graph
  • Contains a cycle in which each vertex appears
    exactly once
  • The objective of the TSP is to find a Hamiltonian
    path with minimal weight
  • Problems with Hamiltonian paths are NP-hard

17
DNA computing
  • In 1994, Leonard Adleman implemented a DNA
    computer that could solve for a Hamiltonian cycle
    in a graph

18
DNA computing
  • Outline of algorithm
  • Generate all possible routes
  • Select itineraries that start with the proper
    city and end with the final city
  • Select itineraries with the correct number of
    cities
  • Select itineraries that contain each city only
    once
  • Each step corresponds to the application of a
    standard molecular biology reaction

19
DNA computing
  • Cities are encoded by oligonucleotides
  • Los Angeles GCTACG
  • Chicago CTAGTA
  • Dallas TCGTAC
  • Miami CTACGG
  • New York ATGCCG
  • The path (LA, Chicago, Dallas, Miami, New York)
    would be
  • GCTACG CTAGTA TGCTAC CTACGG ATGCCG

20
DNA computing
21
DNA computing
  • Random itineraries obtained by
  • mixing oligonucleotides encoding both cities and
    routes in a test tube
  • Allowing complementary DNA strands to hybridize
  • Adding ligase to glue the pieces together

22
DNA computing
  • Select for paths that start in LA and end in NY
  • By performing the polymerase chain reaction with
    LA and NY specific primers

X
X
23
DNA computing
  • Select paths of the appropriate length (5 cities
    30 bases) by isolating the correct band from an
    electrophoretic gel

24
DNA computing
  • Select paths in which each city is represented by
    affinity purification with probes complementary
    to each city
  • A path of length 5 containing each city once must
    be a Hamiltonian Path

25
DNA computing
  • Is this practical?
  • No. A 200 city HP problem would require more DNA
    than the weight of the Earth
  • Is this useful?
  • Yes.
  • DNA operations are inherently massively parallel,
    making simultaneous evaluation of 1015 molecules
    feasible
  • Silicon-chip computers perform only sequential
    operations and cannot deal with large
    combinatorial problems by exhaustive search

26
Stretching the analogy
  • Many biological operations can be thought of in
    algorithmic terms
  • Specific proteins act in defined sequences on a
    variable set of inputs to produce a definite
    output
  • Cell division
  • Neuronal firing
  • Protein secretion

27
Segue to sequence analysis
  • DNA and protein sequences will be the center of
    our attention for much of the course
  • We need to be able to precisely describe
    algorithms that have these molecules as inputs
    and outputs

28
Sequences and strings
  • Biologists and computer scientists use the words
    string and sequence differently
  • You will see sequence used in both ways in this
    class
  • In CS jargon
  • A string S is an contiguous ordered set of
    symbols
  • A sequence is an ordered set of letters that need
    not be continuous
  • If ABCDEFGH is a string
  • ACEG is a sequence
  • All strings are sequences, but not all sequences
    are strings

29
String jargon
  • W.r.t. some alphabet A
  • For DNA, Aa,c,g,t
  • For proteins, there are 20 symbols in the
    alphabet
  • A DNA string Sacgtgc
  • The length of a string is given by S6
  • Index the ith position in S by Si
  • An interval Si..j defines a substring of S
  • S is a superstring of all its component
    substrings
  • S1..j is a prefix and Sj..S is is a suffix
    of S

30
Alignment as a string edit
  • We can define edit operations on S
  • Substitution
  • Insertion
  • Deletion
  • Objective functions
  • One way to formulate the sequence alignment
    problem is transform S into S with a minimal
    edit distance (ie fewest operations)
  • Equivalently, we can seek an alignment with a
    maximal score

31
Pairwise alignment
  • Scores reflects a ratio of
  • Probability of alignment under evolutionary model
  • Probability of a chance alignment
  • Expressed as a Log Odds, or LOD, ratio
  • Total score is simply the sum of scores for each
    edit operation
  • A brute force algorithm
  • Enumerate all possible alignments and choose the
    one(s) with highest score

32
Combinatorial explosion!
n of alignments
5 258
10 187,126
15 156,454,989
20 1.4 x 1011
25 1.3 x 1014
33
Dynamic programming
  • Efficient (ie polynomial-time) algorithm that
    guarantees finding an optimal pairwise alignment
  • O(n2) where n is the the length of the sequences
  • Comes in a few flavors
  • Global (Needleman-Wunsch)
  • Local (Smith-Waterman)
  • Multiple segments
  • Repeats, overlaps, etc.

34
Recursion
  • Principle of dynamic programming is that the
    solution to a large instance can be recursively
    found from solutions to smaller instances

35
Reading assignments
  • Gibson Muse, Box 2.1 Pairwise sequence
    alignment, pgs 72-75.
  • Durbin R, Eddy S, Krogh A, Mitchison G (1998)
    Ch. 2 Pairwise alignment, pgs, 12-31 in
    Biological sequence analysis, Cambridge Univ.
    Press.
Write a Comment
User Comments (0)
About PowerShow.com