Title: DNA Segmentation
1DNA Segmentation
- Presented by Ming-Te Cheng
- IFT 6299 - Algorithmique de lADN
- Autumn 2004
- November 15, 2004
2Overview
- Introduction
- Segmentation Models
- Segmentation Methods
- Discussion
3Introduction
- Statistical analysis of DNA sequences are
motivated by 3 areas of exploration - DNA sequence data offer an extremely fine view
where traditional methods of variation analysis
can be extended - DNA sequence data allow fine-tuning and
organization of genetic process - Comparison of sequences between species demands
methods of determining similarities in evolution
or function
4Introduction
- Large chunks of the genome are sequenced, where
the functionality of many sequences are unknown - Scientists rely on homology (similarity) to
analyze unknown sequences with previously well
studies small sequences - Need methods of describing and assessing
sequences that provide useful characterizations - Ex Segmentation models
5Segmentation Models
- Assumption Sequences can be partitioned into a
number of segments - Each segment has a certain degree of internal
homogeneity (or similarity) - Ex Isochores
- Large segments (gt 300 kb) of DNA belonging to a
number of classes defined by different GC
levels and by fairly homogeneous base
compositions
6Segmentation Methods
- Common techniques
- Moving Window
- Maximum Likelihood Estimation
- Hidden Markov Models
- Recursive Segmentation
7Segmentation Methods Moving Window
- Most commonly used algorithm in biology community
- Straightforward implementation
- Calculate density of a sequence feature of
interest within a window - Move window along sequence
- Recalculate density again
8Segmentation MethodsMoving Window
- Drawbacks
- Arbitrary choice of window size and moving
distance - If window size is too large, local fluctuations
that contain significant biological information
may be averaged out - If moving distance is too long, one domain can be
split between two windows and its distinctive
feature may be lost
9Segmentation MethodsMaximum-Likelihood Estimation
- Algorithm that computes the maximum likelihood
estimate for the number of changed segments
10Segmentation Methods Maximum-Likelihood
Estimation
- Let X1,,Xn represent a sequence of independent
random letters from an alphabet - Let every Xi be one of two known distributions,
specified by the probabilities and - Changed segment is a segment a,b of indices
where for all - Unchanged segment is a segment a,b of indices
where for all
11Segmentation Methods Maximum-Likelihood
Estimation
- Let xi be the observed values of Xi for sequence
- Let C be a non-intersecting set of hypothetical
changed segments - Let z (z1,,zn) be the indicator vector for C
12Segmentation Methods Maximum-Likelihood
Estimation
- Likelihood function is can be written as
- Log-likelihood can be written as
- First term represents log-likelihood of null
hypothesis that there are no changed segments - Second term represents log-likelihood ratio of
the alternative hypothesis
13Segmentation Methods Maximum-Likelihood
Estimation
14Segmentation MethodsHidden Markov Models
15Segmentation MethodsHidden Markov Models
- Example of Hidden Markov Model
16Segmentation MethodsHidden Markov Models
- Assumes that different segments can be classified
into a finite set of state, where the nucleotide
data in each state follows a probability
distribution
17Segmentation MethodsHidden Markov Models
- Let the finite number of r states underlying the
observations be denoted by Si - Let the states follows a Markov process with
transition matrix - System of equations for the hidden chain can be
written as
18Segmentation MethodsHidden Markov Models
- Likewise, system of equations for the
observations can be written as - where yi (yi,1,,yi,m) represent vector of m
possible observed outcomes, and where each
observation is associated with one of the states
19Segmentation MethodsHidden Markov Models
- With the system equations for hidden chain and
observations, the smoothing equations can be
derived - and be used to plot the homogeneous regions in
the sequence
20Segmentation MethodsHidden Markov Models
21Segmentation MethodsRecursive Segmentation
- Assumes that sequences exhibit hierarchical
patterns (possibility of subdomains) - It is possible to apply a filter to convert the
original four-base DNA sequence into k-symbol
sequence - Ex S(strong) C,G and W(weak) A,T
22Segmentation MethodsRecursive Segmentation
- Divide-and-conquer approach is applied
- For k-symbol sequence of length N, calculate each
position i (0 lt i lt N) the entropy H of the whole
sequence, entropy Hl of the subsequence on the
left side of the partition point, and entropy Hr
of the subsequence on the right side.
23Segmentation MethodsRecursive Segmentation
- Entropy equations as defined by (Shannon 1948)
- where Nj, Nj,l, and Nj,r are the counts of
symbol j in the whole, left, and right sequence,
respectively
24Segmentation MethodsRecursive Segmentation
- Maximized Jensen-Shannon divergence was chosen to
measure the heterogeneity of the sequence - If divergence is large enough, the sequence is
heterogeneous and should be segmented - Equation is recursively applied for both the left
and the right subsequence, as long as the
calculated divergence value stays above the given
threshold (similar to constructing a binary tree)
25Segmentation MethodsRecursive Segmentation
- Alternate approach to determining stopping
criterion involves finding a model at the border
between underfitting models (those that do not
fit the data well) and overfitting models (those
that fit the data too well by using too many
parameters) - Bayesian Information Criterion (BIC) was used to
balance goodness-of-fit of the model to data
26Segmentation MethodsRecursive Segmentation
- Alternate approach to determining stopping
criterion involves finding a model at the border
between underfitting models (those that do not
fit the data well) and overfitting models (those
that fit the data too well by using too many
parameters) - Bayesian Information Criterion (BIC) was used to
balance the goodness-of-fit of the model to
data - L is the likelihood of the model, K the number
of free parameters, and N the sample size
27Segmentation MethodsRecursive Segmentation
- Two models can be compared
- Modelling the sequence as one single random
sequence - Modelling it as two random subsequences with
different base compositions - In order for recursive segmentation to continue,
the following must apply - where k is the number of different symbols in
the sequence
28Segmentation MethodsRecursive Segmentation
- Alternate recursive segmentation algorithm
condition can be used to define the segmentation
strength s, i.e. - Recursive segmentation process can be continued
as long as s gt s0, where s0 is predefined by the
user
29Segmentation MethodsRecursive Segmentation
30Segmentation MethodsRecursive Segmentation
31Segmentation MethodsRecursive Segmentation
32Discussion
- DNA sequences can be assumed to have segments
where each has a degree of homogeneity - A number of statistical methods can be used to
identify and analyse these segments - Isochores
- CpG islands
- Replication origin and terminus
- Complex patterns in telomeres
- Coding-noncoding borders
- Other statistical methods for analysing DNA
segmentation do exist, each with varying degrees
of success - Bayesian approach
- Walking Markov
- Change-point methods
33References
- Braun J.V., Müller H.-G. Statistical methods
for DNA sequence segmentation, Statistical
Science, 13142-162, 1998. - Duda R.O., Hart P.E., Stork D.G. (2001) Pattern
Classification, New York John Wiley Sons, Inc. - Churchill, G.A. Stochastic models for
heterogeneous DNA sequences, Bulletin of
Mathematical Biology, 5179-94, 1989. - Csürös M. Algorithms for finding
maximal-scoring segment sets, Proc. WABI, 2004. - Li W., et al. Applications of recursive
segmentation to the analysis of DNA sequences,
Computational Chemistry, 26491-510, 2002.