DNA Segmentation - PowerPoint PPT Presentation

About This Presentation

Title:

DNA Segmentation

Description:

DNA sequence data offer an extremely fine view where traditional methods of ... Duda R.O., Hart P.E., Stork D.G. (2001) Pattern Classification, New York: John ... – PowerPoint PPT presentation

Number of Views:17

Avg rating:3.0/5.0

Slides: 34

Provided by: diro8

Category:

more less

Transcript and Presenter's Notes

Title: DNA Segmentation

1
DNA Segmentation

Presented by Ming-Te Cheng
IFT 6299 - Algorithmique de lADN
Autumn 2004
November 15, 2004

2
Overview

Introduction
Segmentation Models
Segmentation Methods
Discussion

3
Introduction

Statistical analysis of DNA sequences are
motivated by 3 areas of exploration
DNA sequence data offer an extremely fine view
where traditional methods of variation analysis
can be extended
DNA sequence data allow fine-tuning and
organization of genetic process
Comparison of sequences between species demands
methods of determining similarities in evolution
or function

4
Introduction

Large chunks of the genome are sequenced, where
the functionality of many sequences are unknown
Scientists rely on homology (similarity) to
analyze unknown sequences with previously well
studies small sequences
Need methods of describing and assessing
sequences that provide useful characterizations
Ex Segmentation models

5
Segmentation Models

Assumption Sequences can be partitioned into a
number of segments
Each segment has a certain degree of internal
homogeneity (or similarity)
Ex Isochores
Large segments (gt 300 kb) of DNA belonging to a
number of classes defined by different GC
levels and by fairly homogeneous base
compositions

6
Segmentation Methods

Common techniques
Moving Window
Maximum Likelihood Estimation
Hidden Markov Models
Recursive Segmentation

7
Segmentation Methods Moving Window

Most commonly used algorithm in biology community
Straightforward implementation
Calculate density of a sequence feature of
interest within a window
Move window along sequence
Recalculate density again

8
Segmentation MethodsMoving Window

Drawbacks
Arbitrary choice of window size and moving
distance
If window size is too large, local fluctuations
that contain significant biological information
may be averaged out
If moving distance is too long, one domain can be
split between two windows and its distinctive
feature may be lost

9
Segmentation MethodsMaximum-Likelihood Estimation

Algorithm that computes the maximum likelihood
estimate for the number of changed segments

10
Segmentation Methods Maximum-Likelihood
Estimation

Let X1,,Xn represent a sequence of independent
random letters from an alphabet
Let every Xi be one of two known distributions,
specified by the probabilities and
Changed segment is a segment a,b of indices
where for all
Unchanged segment is a segment a,b of indices
where for all

11
Segmentation Methods Maximum-Likelihood
Estimation

Let xi be the observed values of Xi for sequence
Let C be a non-intersecting set of hypothetical
changed segments
Let z (z1,,zn) be the indicator vector for C

12
Segmentation Methods Maximum-Likelihood
Estimation

Likelihood function is can be written as
Log-likelihood can be written as
First term represents log-likelihood of null
hypothesis that there are no changed segments
Second term represents log-likelihood ratio of
the alternative hypothesis

13
Segmentation Methods Maximum-Likelihood
Estimation
14
Segmentation MethodsHidden Markov Models

Example of Markov Model

15
Segmentation MethodsHidden Markov Models

Example of Hidden Markov Model

16
Segmentation MethodsHidden Markov Models

Assumes that different segments can be classified
into a finite set of state, where the nucleotide
data in each state follows a probability
distribution

17
Segmentation MethodsHidden Markov Models

Let the finite number of r states underlying the
observations be denoted by Si
Let the states follows a Markov process with
transition matrix
System of equations for the hidden chain can be
written as

18
Segmentation MethodsHidden Markov Models

Likewise, system of equations for the
observations can be written as
where yi (yi,1,,yi,m) represent vector of m
possible observed outcomes, and where each
observation is associated with one of the states

19
Segmentation MethodsHidden Markov Models

With the system equations for hidden chain and
observations, the smoothing equations can be
derived
and be used to plot the homogeneous regions in
the sequence

20
Segmentation MethodsHidden Markov Models
21
Segmentation MethodsRecursive Segmentation

Assumes that sequences exhibit hierarchical
patterns (possibility of subdomains)
It is possible to apply a filter to convert the
original four-base DNA sequence into k-symbol
sequence
Ex S(strong) C,G and W(weak) A,T

22
Segmentation MethodsRecursive Segmentation

Divide-and-conquer approach is applied
For k-symbol sequence of length N, calculate each
position i (0 lt i lt N) the entropy H of the whole
sequence, entropy Hl of the subsequence on the
left side of the partition point, and entropy Hr
of the subsequence on the right side.

23
Segmentation MethodsRecursive Segmentation

Entropy equations as defined by (Shannon 1948)
where Nj, Nj,l, and Nj,r are the counts of
symbol j in the whole, left, and right sequence,
respectively

24
Segmentation MethodsRecursive Segmentation

Maximized Jensen-Shannon divergence was chosen to
measure the heterogeneity of the sequence
If divergence is large enough, the sequence is
heterogeneous and should be segmented
Equation is recursively applied for both the left
and the right subsequence, as long as the
calculated divergence value stays above the given
threshold (similar to constructing a binary tree)

25
Segmentation MethodsRecursive Segmentation

Alternate approach to determining stopping
criterion involves finding a model at the border
between underfitting models (those that do not
fit the data well) and overfitting models (those
that fit the data too well by using too many
parameters)
Bayesian Information Criterion (BIC) was used to
balance goodness-of-fit of the model to data

26
Segmentation MethodsRecursive Segmentation

Alternate approach to determining stopping
criterion involves finding a model at the border
between underfitting models (those that do not
fit the data well) and overfitting models (those
that fit the data too well by using too many
parameters)
Bayesian Information Criterion (BIC) was used to
balance the goodness-of-fit of the model to
data
L is the likelihood of the model, K the number
of free parameters, and N the sample size

27
Segmentation MethodsRecursive Segmentation

Two models can be compared
Modelling the sequence as one single random
sequence
Modelling it as two random subsequences with
different base compositions
In order for recursive segmentation to continue,
the following must apply
where k is the number of different symbols in
the sequence

28
Segmentation MethodsRecursive Segmentation

Alternate recursive segmentation algorithm
condition can be used to define the segmentation
strength s, i.e.
Recursive segmentation process can be continued
as long as s gt s0, where s0 is predefined by the
user

29
Segmentation MethodsRecursive Segmentation
30
Segmentation MethodsRecursive Segmentation
31
Segmentation MethodsRecursive Segmentation
32
Discussion

DNA sequences can be assumed to have segments
where each has a degree of homogeneity
A number of statistical methods can be used to
identify and analyse these segments
Isochores
CpG islands
Replication origin and terminus
Complex patterns in telomeres
Coding-noncoding borders
Other statistical methods for analysing DNA
segmentation do exist, each with varying degrees
of success
Bayesian approach
Walking Markov
Change-point methods

33
References

Braun J.V., Müller H.-G. Statistical methods
for DNA sequence segmentation, Statistical
Science, 13142-162, 1998.
Duda R.O., Hart P.E., Stork D.G. (2001) Pattern
Classification, New York John Wiley Sons, Inc.
Churchill, G.A. Stochastic models for
heterogeneous DNA sequences, Bulletin of
Mathematical Biology, 5179-94, 1989.
Csürös M. Algorithms for finding
maximal-scoring segment sets, Proc. WABI, 2004.
Li W., et al. Applications of recursive
segmentation to the analysis of DNA sequences,
Computational Chemistry, 26491-510, 2002.