Finding Consensus in a Family of DNA Sequences - PowerPoint PPT Presentation

1 / 30
About This Presentation
Title:

Finding Consensus in a Family of DNA Sequences

Description:

A consensus sequence is one that captures the important features of a family of ... M., Peter Adams, Darryn Bryant, Dirk P. Kroese, Keith R. Mitchelson, Duncan ... – PowerPoint PPT presentation

Number of Views:182
Avg rating:3.0/5.0
Slides: 31
Provided by: joshuagi7
Category:

less

Transcript and Presenter's Notes

Title: Finding Consensus in a Family of DNA Sequences


1
Finding Consensus in a Family of DNA Sequences
  • Joshua W. Gilkerson

2
Agenda
  • Consensus Sequences
  • Combinatorial Optimization
  • Simulated Annealing
  • Genetic Algorithms
  • Application to Finding a Consensus Sequence

3
Consensus Sequence
  • A consensus sequence is one that captures the
    important features of a family of sequences.
  • Consensus sequences have many important
    applications in bioinformatics.
  • Used to define the real sequence of a feature
    that may appear several times in a genome,
    slightly modified.
  • Genes
  • Repeats

4
An Example
  • Form a consensus sequence from the following
    strings
  • THE LONGER SEQUENCE
  • A SEQUENCE
  • THIS TISAS
  • S RIS ANEQ

5
An Example
  • THE LONGER SEQUENCE
  • A SEQUENCE
  • THIS TISAS
  • S RIS ANEQ

6
Definition Point Mutation
  • A modification to a string at a single location,
    changing a single character.
  • May be one of
  • Insert
  • Delete
  • Substitute

7
Definition Levenshtein Distance
  • The minimum number of point mutations required to
    transform one string into another.
  • Also known as the edit distance.

8
Definition Consensus Sequence
  • The most widely used definition of a consensus
    string is that it minimizes the sum of the edit
    distances to the strings in the family.
  • Also known as the Steiner Consensus String
  • Very similar to a mean, which minimizes the sum
    of the distances to a family of numbers.

9
Why Not An Exact Algorithm
  • The determination of a consensus sequence has
    been shown to be NP-hard (Gusfield).
  • Instances from bioinformatics are large, making
    them intractable.
  • Therefore, an approximation technique is needed.
  • Simulated Annealing and Genetic Algorithms have
    been applied successfully to other NP-hard
    problems.

10
Combinatorial Optimization
  • The process of finding minimum (or maximum)
    values for a function whose domain space contains
    many independent variables.
  • Finding a consensus sequence is a type of
    combinatorial optimization, where the value to be
    minimized is the sum of the edit distances.

11
Simulated Annealing
  • is a process developed for other combinatorial
    optimization problems (Kirkpatrick), and has
    shown promise for finding consensus sequences
    (Keith).
  • approximates the process of annealing used to
    produce very dense, very stable masses.
  • Annealing involves heating a mass and the
    allowing it to cool very slowly. The product is
    an object whose molecules are arranged to have
    the lowest possible energy.

12
Simulated Annealing
  • Start with an initial approximation and an
    initial temperature.
  • Iteratively, choose a set of adjacent
    approximations.
  • Randomly choose the next approximation from this
    set plus the current approximation, weighted
    using the value begin optimized.
  • Slowly lower the temperature.

13
Simulated Annealing
  • Sequences are weighted proportionally to the
    expression
  • kB is the Boltzmann constant
  • T is the current temperature
  • ?E is the change in the target variable

14
Simulated Annealing
  • As the temperature becomes lower, the odds of
    choosing a higher energy state also become lower.
  • This allows the approximation to move freely at
    first, but even at high temperatures, if a very
    low energy position is found, it is hard to leave.

15
An Example
  • This is a very simplified example where the
    domain is a single variable.
  • The value to be optimized is plotted to the left.

16
An Example
  • If moves to higher energy are not allowed,
    starting anywhere between 2 and 4 would yield the
    final answer of 3, which does not have minimum
    energy.

17
An Example
  • However, if moves to higher energy states are
    allowed, it is at least possible to find the
    optimal answer of 5, no matter what the initial
    approximation is.

18
Application to Consensus Sequences
  • While many methods have been developed to find
    consensus sequences, most are expensive and dont
    scale well to larger numbers of sequences.
  • Most prior methods involve doing multiple
    sequence alignments, and using these alignments
    to form a consensus, however using simulated
    annealing finds a solution directly.
  • The consensus sequence may then be used to
    quickly generate a multiple alignment if needed.

19
Application to Consensus Sequences
  • Sum of edit distances used as optimization value.
  • Sets of optimizations are constructed by
    iteratively considering each position in the
    current approximation and making all possible
    point mutations at that position.

20
Analysis
  • In the algorithm as outlined here and presented
    fully by Keith et al, the runtime
  • scales linearly with the number of sequences
  • scales quadratically with the length of the
    sequences
  • Empirical results have verified these theoretical
    predictions.

21
Genetic Algorithm
  • Is another process developed for solving
    combinatorial optimization problems.
  • Approximates the process of evolution by natural
    selection as seen in nature

22
Genetic Algorithm
  • Start with an initial population (either random
    or some domain specific approximation).
  • Generate offspring from the initial population.
  • Remove the less fit offspring.
  • Fitness if measured by the value being optimized.

23
Genetic Algorithm
  • Generating offspring
  • Can be done sexually or asexually
  • Sexual reproduction takes part of the candidate
    solution from each of two parents already in the
    population.
  • Parts of the candidate may then be mutated.

24
Genetic Algorithm
  • Because of this process of producing offspring,
    problems whose solution space can be represented
    as a vector (or string) are most easily
    addressed.
  • Sexual reproduction allows for some large jumps
    within the solution space, but limited by the
    current population.

25
An Example
  • Same example as used earlier.

26
Application to Consensus Sequences
  • Initial population is the set of strings under
    consideration
  • Random chance of sexual or asexual reproduction.
  • If sexual, a prefix is selected from one sequence
    and a suffix from another. The lengths are
    selected randomly over a poisson distribution to
    favor offspring of approximately the same length
    as the parents.
  • Each position of the sequence is mutated with
    some probability.

27
Application to Consensus Sequences
  • A number of offspring are produced, then some
    number of the best are kept.
  • The process is repeated some number of times.

28
Analysis
  • In the algorithm as outlined, the runtime
  • Scales linearly with the number of sequences
  • Scales quadratically with the length of the
    sequences
  • Empirical results have suggested some interesting
    improvements for setting the population sizes and
    the number of generations.
  • Speculatively, I expect a final versions runtime
    to
  • Scale quadratically with the number of sequences.
  • Scale cubucally with their lengths.

29
Summary
  • Simulated annealing and genetic algorithms have
    been used successfully in many other
    combinatorial optimization problems including the
    traveling salesman problem.
  • Likewise, early results are promising for finding
    consensus sequences.
  • No guarantees of accuracy.
  • However, it appears that it might be possible to
    give some approximation of how accurate a given
    solution is.
  • Very fast.

30
References
  • Day, W. H., and F. R. McMorris. The computation
    of consensus patterns in DNA sequence.
    Mathematical and Computer Modelling.
    17(1993)49-52.
  • Keith, Jonathan M., Peter Adams, Darryn Bryant,
    Dirk P. Kroese, Keith R. Mitchelson, Duncan A.E.
    Cochran, and Gita H. Lala. A Simulated Annealing
    Algorithm for Finding Consensus Sequences.
    Bioinformatics. 18(2002)1495-9.
  • Kirkpatrick, S., C. D. Gelatt, Jr., M. P. Vecchi.
    Optimization by Simulated Annealing. Science.
    220(1983)671-80.
Write a Comment
User Comments (0)
About PowerShow.com