A new sequence distance measure for phylogenetic tree construction presentation

About This Presentation

Transcript and Presenter's Notes

Title: A new sequence distance measure for phylogenetic tree construction

1
A new sequence distance measure for phylogenetic
tree construction

Bioinformatics Vol19 Nov 2003
Hasan H. Otu and Khalid Sayood
HMS, Beth Israel Deaconess Med.

2
Abstract

Most existing approaches for phylogenetic
inference use multiple sequence alignment of
sequences and assume some sort of an evolutionary
model.
The MSA does not work for all types of data, e.g.
whole genome phylogeny and the evolutionary model
may not always be correct.
A new distance measure based on relative
information between sequence using Lempel-Ziv
complexity.

3
Outline

Retrospect to phylogeny
Introduction
LZ 77
LZ 78
LZW
Proposed algorithm and distance measure
Result
Discussion

4
Types of data used in phylogenetic
inference Character-based methods Use the
aligned characters, such as DNA or protein
sequences, directly during tree inference.
Taxa Characters Species
A ATGGCTATTCTTATAGTACG Species
B ATCGCTAGTCTTATATTACA Species
C TTCACTAGACCTGTGGTCCA Species
D TTGACCAGACCTGTGGTCCG Species
E TTGACCAGTTCTCTAGTTCG Distance-based methods
Transform the sequence data into pair-wise
distances (dissimilarities), and then use the
matrix during tree building. A
B C D E Species A ---- 0.20
0.50 0.45 0.40 Species B 0.23 ---- 0.40
0.55 0.50 Species C 0.87 0.59 ----
0.15 0.40 Species D 0.73 1.12 0.17 ----
0.25 Species E 0.59 0.89 0.61 0.31 ----
5
Introduction

Some of the approaches in the first category
utilize various distance measures (Jukes and
Cantor, 1969 Kimura, 1980Barry and Hartigan,
1987 Kishino and Hasegawa, 1989 Lake, 1994)
which use different models of nucleotide
substitutions or amino acid replacement.
The second category can further be divided into
two groups based on the optimality criterion used
in tree evaluation parsimony and maximum
likelihood methods.

6
(No Transcript)
7
Introduction (Cont)

All of these methods require a multiple sequence
alignment of sequences and some sort of an
evolutionary model, these methods become
insufficient for phylogenies using complete
genomes.
Due to gene rearrangements, inversion,
transposition and translocation, unequal length
of sequences.

8
LZ 77
9
The encoding algorithm

Set the coding position to the beginning of the
input stream
find the longest match in the window for the
lookahead buffer
output the pair (P,C) with the following meaning
This gives the following instruction to the
decoder "Go back P characters in the window and
copy C characters to the output"

10
LZ78
11
The encoding algorithm

At the start, the dictionary and P are empty
C next character in the charstream
Is the string PC present in the dictionary?
if it is, P PC (extend P with C)
if not,
output these two objects to the codestream
the code word corresponding to P (if P is empty,
output a zero)
C, in the same form as input from the charstream
add the string PC to the dictionary
P empty
are there more characters in the charstream?
if yes, return to step 2
if not
if P is not empty, output the code word
corresponding to P
END.

12
LZW
13
Proposed distance

Given two sequence Q and S, consider the sequence
SQ and its exhaustive history.
The number of the components needed to build Q
when appended to S is c(SQ)-c(S), this number
will be less than or equal to c(Q).
How much c(SQ)-c(S) is less than c(Q) will depend
on the degree of similarity of S and Q.

14
Example

SAACGTACCATTG
HE(S)A,AC,G,T,ACC,AT,TG
c(S)7
RCTAGGGACTTAT
HE(R)C,T,A,G,GG,AC,TT,AT
c(R)8
QACGGTCACCAA
HE(Q)A,C,G,GT,CA,CC,AA
c(Q)7

15
Example (Cont)

SQAACGTACCATTGACGGTCACCAA
HE(SQ)A,AC,G,T,ACC,AT,TG,ACG,GT,C,ACCA
c(SQ)11, Q took 4 steps
RQCTAGGGACTTATACGGTCACCAA
HE(RS)C,T,A,G,GG,AC,TT,AT,ACG,GT,CA,CC,AA
c(RQ)13, Q took 5 steps
The reason RQ took more steps then SQ is because
Q is closer to S than Q.

16
Proposed distance (Cont)

Formulate the numbers of steps it takes to
generate a sequence Q form a sequence S by
c(SQ)-c(S).
If S is closer to Q than R then we would expect
c(SQ)-c(Q)ltc(RQ)-c(R).
11-7 lt 13-8 , TRUE!

17
Distance Measure

Sequence S and Q given, defined the function as
dm1
Normalize form (eliminate the effect of length)
dm2

18
Distance Measure (Cont)

The idea of building sequence Q using S is the
sum distance.
Similarity, normalize version of d1 as follows

19
Satisfaction conditions

D(S,Q) ?0 where the equality is satisfied iff SQ
(identity).
D(S,Q)D(Q,S) (symmetry).
D(S,Q)?D(S,T)D(T,Q) (triangle inequality).
D(Q,R)D(S,T)?
maxD(Q,S)D(R,T),D(Q,T)D(S,R)
(additivity).

20
Results and Discussion

Neighbor joining by PHYLIP package and parsimony
and maximum likelihood methods are using
ClustalW.
1000bp sequence and evolved it into two sequence
A and B using point mutations(10) and
segment-based modifications(10).
Evolved A into A1 and A2, B into B1 and B2, A to
A and B to B.

21
Results (Cont)
22
(No Transcript)
23
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
24
2
1
3
19
20
25
Conclusion

Proposed a new distance measure and its
variations.
The proposed does not require multiple sequence
alignment and is fully automatic.
Perform comparisons at the whole genome level.
Unequal sequence length or the relatively
different positioning are not problem.
Distance measures do not use any evolutionary
model.

Write a Comment

User Comments (0)

About PowerShow.com

A new sequence distance measure for phylogenetic tree construction PowerPoint PPT Presentation