DNA Segmentation - PowerPoint PPT Presentation

About This Presentation
Title:

DNA Segmentation

Description:

DNA sequence data offer an extremely fine view where traditional methods of ... Duda R.O., Hart P.E., Stork D.G. (2001) Pattern Classification, New York: John ... – PowerPoint PPT presentation

Number of Views:17
Avg rating:3.0/5.0
Slides: 34
Provided by: diro8
Category:

less

Transcript and Presenter's Notes

Title: DNA Segmentation


1
DNA Segmentation
  • Presented by Ming-Te Cheng
  • IFT 6299 - Algorithmique de lADN
  • Autumn 2004
  • November 15, 2004

2
Overview
  • Introduction
  • Segmentation Models
  • Segmentation Methods
  • Discussion

3
Introduction
  • Statistical analysis of DNA sequences are
    motivated by 3 areas of exploration
  • DNA sequence data offer an extremely fine view
    where traditional methods of variation analysis
    can be extended
  • DNA sequence data allow fine-tuning and
    organization of genetic process
  • Comparison of sequences between species demands
    methods of determining similarities in evolution
    or function

4
Introduction
  • Large chunks of the genome are sequenced, where
    the functionality of many sequences are unknown
  • Scientists rely on homology (similarity) to
    analyze unknown sequences with previously well
    studies small sequences
  • Need methods of describing and assessing
    sequences that provide useful characterizations
  • Ex Segmentation models

5
Segmentation Models
  • Assumption Sequences can be partitioned into a
    number of segments
  • Each segment has a certain degree of internal
    homogeneity (or similarity)
  • Ex Isochores
  • Large segments (gt 300 kb) of DNA belonging to a
    number of classes defined by different GC
    levels and by fairly homogeneous base
    compositions

6
Segmentation Methods
  • Common techniques
  • Moving Window
  • Maximum Likelihood Estimation
  • Hidden Markov Models
  • Recursive Segmentation

7
Segmentation Methods Moving Window
  • Most commonly used algorithm in biology community
  • Straightforward implementation
  • Calculate density of a sequence feature of
    interest within a window
  • Move window along sequence
  • Recalculate density again

8
Segmentation MethodsMoving Window
  • Drawbacks
  • Arbitrary choice of window size and moving
    distance
  • If window size is too large, local fluctuations
    that contain significant biological information
    may be averaged out
  • If moving distance is too long, one domain can be
    split between two windows and its distinctive
    feature may be lost

9
Segmentation MethodsMaximum-Likelihood Estimation
  • Algorithm that computes the maximum likelihood
    estimate for the number of changed segments

10
Segmentation Methods Maximum-Likelihood
Estimation
  • Let X1,,Xn represent a sequence of independent
    random letters from an alphabet
  • Let every Xi be one of two known distributions,
    specified by the probabilities and
  • Changed segment is a segment a,b of indices
    where for all
  • Unchanged segment is a segment a,b of indices
    where for all

11
Segmentation Methods Maximum-Likelihood
Estimation
  • Let xi be the observed values of Xi for sequence
  • Let C be a non-intersecting set of hypothetical
    changed segments
  • Let z (z1,,zn) be the indicator vector for C

12
Segmentation Methods Maximum-Likelihood
Estimation
  • Likelihood function is can be written as
  • Log-likelihood can be written as
  • First term represents log-likelihood of null
    hypothesis that there are no changed segments
  • Second term represents log-likelihood ratio of
    the alternative hypothesis

13
Segmentation Methods Maximum-Likelihood
Estimation
14
Segmentation MethodsHidden Markov Models
  • Example of Markov Model

15
Segmentation MethodsHidden Markov Models
  • Example of Hidden Markov Model

16
Segmentation MethodsHidden Markov Models
  • Assumes that different segments can be classified
    into a finite set of state, where the nucleotide
    data in each state follows a probability
    distribution

17
Segmentation MethodsHidden Markov Models
  • Let the finite number of r states underlying the
    observations be denoted by Si
  • Let the states follows a Markov process with
    transition matrix
  • System of equations for the hidden chain can be
    written as

18
Segmentation MethodsHidden Markov Models
  • Likewise, system of equations for the
    observations can be written as
  • where yi (yi,1,,yi,m) represent vector of m
    possible observed outcomes, and where each
    observation is associated with one of the states

19
Segmentation MethodsHidden Markov Models
  • With the system equations for hidden chain and
    observations, the smoothing equations can be
    derived
  • and be used to plot the homogeneous regions in
    the sequence

20
Segmentation MethodsHidden Markov Models
21
Segmentation MethodsRecursive Segmentation
  • Assumes that sequences exhibit hierarchical
    patterns (possibility of subdomains)
  • It is possible to apply a filter to convert the
    original four-base DNA sequence into k-symbol
    sequence
  • Ex S(strong) C,G and W(weak) A,T

22
Segmentation MethodsRecursive Segmentation
  • Divide-and-conquer approach is applied
  • For k-symbol sequence of length N, calculate each
    position i (0 lt i lt N) the entropy H of the whole
    sequence, entropy Hl of the subsequence on the
    left side of the partition point, and entropy Hr
    of the subsequence on the right side.

23
Segmentation MethodsRecursive Segmentation
  • Entropy equations as defined by (Shannon 1948)
  • where Nj, Nj,l, and Nj,r are the counts of
    symbol j in the whole, left, and right sequence,
    respectively

24
Segmentation MethodsRecursive Segmentation
  • Maximized Jensen-Shannon divergence was chosen to
    measure the heterogeneity of the sequence
  • If divergence is large enough, the sequence is
    heterogeneous and should be segmented
  • Equation is recursively applied for both the left
    and the right subsequence, as long as the
    calculated divergence value stays above the given
    threshold (similar to constructing a binary tree)

25
Segmentation MethodsRecursive Segmentation
  • Alternate approach to determining stopping
    criterion involves finding a model at the border
    between underfitting models (those that do not
    fit the data well) and overfitting models (those
    that fit the data too well by using too many
    parameters)
  • Bayesian Information Criterion (BIC) was used to
    balance goodness-of-fit of the model to data

26
Segmentation MethodsRecursive Segmentation
  • Alternate approach to determining stopping
    criterion involves finding a model at the border
    between underfitting models (those that do not
    fit the data well) and overfitting models (those
    that fit the data too well by using too many
    parameters)
  • Bayesian Information Criterion (BIC) was used to
    balance the goodness-of-fit of the model to
    data
  • L is the likelihood of the model, K the number
    of free parameters, and N the sample size

27
Segmentation MethodsRecursive Segmentation
  • Two models can be compared
  • Modelling the sequence as one single random
    sequence
  • Modelling it as two random subsequences with
    different base compositions
  • In order for recursive segmentation to continue,
    the following must apply
  • where k is the number of different symbols in
    the sequence

28
Segmentation MethodsRecursive Segmentation
  • Alternate recursive segmentation algorithm
    condition can be used to define the segmentation
    strength s, i.e.
  • Recursive segmentation process can be continued
    as long as s gt s0, where s0 is predefined by the
    user

29
Segmentation MethodsRecursive Segmentation
30
Segmentation MethodsRecursive Segmentation
31
Segmentation MethodsRecursive Segmentation
32
Discussion
  • DNA sequences can be assumed to have segments
    where each has a degree of homogeneity
  • A number of statistical methods can be used to
    identify and analyse these segments
  • Isochores
  • CpG islands
  • Replication origin and terminus
  • Complex patterns in telomeres
  • Coding-noncoding borders
  • Other statistical methods for analysing DNA
    segmentation do exist, each with varying degrees
    of success
  • Bayesian approach
  • Walking Markov
  • Change-point methods

33
References
  • Braun J.V., Müller H.-G. Statistical methods
    for DNA sequence segmentation, Statistical
    Science, 13142-162, 1998.
  • Duda R.O., Hart P.E., Stork D.G. (2001) Pattern
    Classification, New York John Wiley Sons, Inc.
  • Churchill, G.A. Stochastic models for
    heterogeneous DNA sequences, Bulletin of
    Mathematical Biology, 5179-94, 1989.
  • Csürös M. Algorithms for finding
    maximal-scoring segment sets, Proc. WABI, 2004.
  • Li W., et al. Applications of recursive
    segmentation to the analysis of DNA sequences,
    Computational Chemistry, 26491-510, 2002.
Write a Comment
User Comments (0)
About PowerShow.com