Title: A new sequence distance measure for phylogenetic tree construction
1A new sequence distance measure for phylogenetic
tree construction
- Bioinformatics Vol19 Nov 2003
- Hasan H. Otu and Khalid Sayood
- HMS, Beth Israel Deaconess Med.
2Abstract
- Most existing approaches for phylogenetic
inference use multiple sequence alignment of
sequences and assume some sort of an evolutionary
model. - The MSA does not work for all types of data, e.g.
whole genome phylogeny and the evolutionary model
may not always be correct. - A new distance measure based on relative
information between sequence using Lempel-Ziv
complexity.
3Outline
- Retrospect to phylogeny
- Introduction
- LZ 77
- LZ 78
- LZW
- Proposed algorithm and distance measure
- Result
- Discussion
4Types of data used in phylogenetic
inference Character-based methods Use the
aligned characters, such as DNA or protein
sequences, directly during tree inference.
Taxa Characters Species
A ATGGCTATTCTTATAGTACG Species
B ATCGCTAGTCTTATATTACA Species
C TTCACTAGACCTGTGGTCCA Species
D TTGACCAGACCTGTGGTCCG Species
E TTGACCAGTTCTCTAGTTCG Distance-based methods
Transform the sequence data into pair-wise
distances (dissimilarities), and then use the
matrix during tree building. A
B C D E Species A ---- 0.20
0.50 0.45 0.40 Species B 0.23 ---- 0.40
0.55 0.50 Species C 0.87 0.59 ----
0.15 0.40 Species D 0.73 1.12 0.17 ----
0.25 Species E 0.59 0.89 0.61 0.31 ----
5Introduction
- Some of the approaches in the first category
utilize various distance measures (Jukes and
Cantor, 1969 Kimura, 1980Barry and Hartigan,
1987 Kishino and Hasegawa, 1989 Lake, 1994)
which use different models of nucleotide
substitutions or amino acid replacement. - The second category can further be divided into
two groups based on the optimality criterion used
in tree evaluation parsimony and maximum
likelihood methods.
6(No Transcript)
7Introduction (Cont)
- All of these methods require a multiple sequence
alignment of sequences and some sort of an
evolutionary model, these methods become
insufficient for phylogenies using complete
genomes. - Due to gene rearrangements, inversion,
transposition and translocation, unequal length
of sequences.
8LZ 77
9The encoding algorithm
- Set the coding position to the beginning of the
input stream - find the longest match in the window for the
lookahead buffer - output the pair (P,C) with the following meaning
- This gives the following instruction to the
decoder "Go back P characters in the window and
copy C characters to the output"
10LZ78
11The encoding algorithm
- At the start, the dictionary and P are empty
- C next character in the charstream
- Is the string PC present in the dictionary?
- if it is, P PC (extend P with C)
- if not,
- output these two objects to the codestream
- the code word corresponding to P (if P is empty,
output a zero) - C, in the same form as input from the charstream
- add the string PC to the dictionary
- P empty
- are there more characters in the charstream?
- if yes, return to step 2
- if not
- if P is not empty, output the code word
corresponding to P - END.
12LZW
13Proposed distance
- Given two sequence Q and S, consider the sequence
SQ and its exhaustive history. - The number of the components needed to build Q
when appended to S is c(SQ)-c(S), this number
will be less than or equal to c(Q). - How much c(SQ)-c(S) is less than c(Q) will depend
on the degree of similarity of S and Q.
14Example
- SAACGTACCATTG
- HE(S)A,AC,G,T,ACC,AT,TG
- c(S)7
- RCTAGGGACTTAT
- HE(R)C,T,A,G,GG,AC,TT,AT
- c(R)8
- QACGGTCACCAA
- HE(Q)A,C,G,GT,CA,CC,AA
- c(Q)7
15Example (Cont)
- SQAACGTACCATTGACGGTCACCAA
- HE(SQ)A,AC,G,T,ACC,AT,TG,ACG,GT,C,ACCA
- c(SQ)11, Q took 4 steps
- RQCTAGGGACTTATACGGTCACCAA
- HE(RS)C,T,A,G,GG,AC,TT,AT,ACG,GT,CA,CC,AA
- c(RQ)13, Q took 5 steps
- The reason RQ took more steps then SQ is because
Q is closer to S than Q.
16Proposed distance (Cont)
- Formulate the numbers of steps it takes to
generate a sequence Q form a sequence S by
c(SQ)-c(S). - If S is closer to Q than R then we would expect
c(SQ)-c(Q)ltc(RQ)-c(R). - 11-7 lt 13-8 , TRUE!
17Distance Measure
- Sequence S and Q given, defined the function as
-
- dm1
- Normalize form (eliminate the effect of length)
- dm2
18Distance Measure (Cont)
- The idea of building sequence Q using S is the
sum distance. - Similarity, normalize version of d1 as follows
19Satisfaction conditions
- D(S,Q) ?0 where the equality is satisfied iff SQ
(identity). - D(S,Q)D(Q,S) (symmetry).
- D(S,Q)?D(S,T)D(T,Q) (triangle inequality).
- D(Q,R)D(S,T)?
- maxD(Q,S)D(R,T),D(Q,T)D(S,R)
- (additivity).
20Results and Discussion
- Neighbor joining by PHYLIP package and parsimony
and maximum likelihood methods are using
ClustalW. - 1000bp sequence and evolved it into two sequence
A and B using point mutations(10) and
segment-based modifications(10). - Evolved A into A1 and A2, B into B1 and B2, A to
A and B to B.
21Results (Cont)
22(No Transcript)
231
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
242
1
3
19
20
25Conclusion
- Proposed a new distance measure and its
variations. - The proposed does not require multiple sequence
alignment and is fully automatic. - Perform comparisons at the whole genome level.
- Unequal sequence length or the relatively
different positioning are not problem. - Distance measures do not use any evolutionary
model.