A new sequence distance measure for phylogenetic tree construction PowerPoint PPT Presentation

presentation player overlay
1 / 25
About This Presentation
Transcript and Presenter's Notes

Title: A new sequence distance measure for phylogenetic tree construction


1
A new sequence distance measure for phylogenetic
tree construction
  • Bioinformatics Vol19 Nov 2003
  • Hasan H. Otu and Khalid Sayood
  • HMS, Beth Israel Deaconess Med.

2
Abstract
  • Most existing approaches for phylogenetic
    inference use multiple sequence alignment of
    sequences and assume some sort of an evolutionary
    model.
  • The MSA does not work for all types of data, e.g.
    whole genome phylogeny and the evolutionary model
    may not always be correct.
  • A new distance measure based on relative
    information between sequence using Lempel-Ziv
    complexity.

3
Outline
  • Retrospect to phylogeny
  • Introduction
  • LZ 77
  • LZ 78
  • LZW
  • Proposed algorithm and distance measure
  • Result
  • Discussion

4
Types of data used in phylogenetic
inference Character-based methods Use the
aligned characters, such as DNA or protein
sequences, directly during tree inference.
Taxa Characters Species
A ATGGCTATTCTTATAGTACG Species
B ATCGCTAGTCTTATATTACA Species
C TTCACTAGACCTGTGGTCCA Species
D TTGACCAGACCTGTGGTCCG Species
E TTGACCAGTTCTCTAGTTCG Distance-based methods
Transform the sequence data into pair-wise
distances (dissimilarities), and then use the
matrix during tree building. A
B C D E Species A ---- 0.20
0.50 0.45 0.40 Species B 0.23 ---- 0.40
0.55 0.50 Species C 0.87 0.59 ----
0.15 0.40 Species D 0.73 1.12 0.17 ----
0.25 Species E 0.59 0.89 0.61 0.31 ----
5
Introduction
  • Some of the approaches in the first category
    utilize various distance measures (Jukes and
    Cantor, 1969 Kimura, 1980Barry and Hartigan,
    1987 Kishino and Hasegawa, 1989 Lake, 1994)
    which use different models of nucleotide
    substitutions or amino acid replacement.
  • The second category can further be divided into
    two groups based on the optimality criterion used
    in tree evaluation parsimony and maximum
    likelihood methods.

6
(No Transcript)
7
Introduction (Cont)
  • All of these methods require a multiple sequence
    alignment of sequences and some sort of an
    evolutionary model, these methods become
    insufficient for phylogenies using complete
    genomes.
  • Due to gene rearrangements, inversion,
    transposition and translocation, unequal length
    of sequences.

8
LZ 77
9
The encoding algorithm
  • Set the coding position to the beginning of the
    input stream
  • find the longest match in the window for the
    lookahead buffer
  • output the pair (P,C) with the following meaning
  • This gives the following instruction to the
    decoder "Go back P characters in the window and
    copy C characters to the output"

10
LZ78
11
The encoding algorithm
  • At the start, the dictionary and P are empty
  • C next character in the charstream
  • Is the string PC present in the dictionary?
  • if it is, P PC (extend P with C)
  • if not,
  • output these two objects to the codestream
  • the code word corresponding to P (if P is empty,
    output a zero)
  • C, in the same form as input from the charstream
  • add the string PC to the dictionary
  • P empty
  • are there more characters in the charstream?
  • if yes, return to step 2
  • if not
  • if P is not empty, output the code word
    corresponding to P
  • END.

12
LZW
13
Proposed distance
  • Given two sequence Q and S, consider the sequence
    SQ and its exhaustive history.
  • The number of the components needed to build Q
    when appended to S is c(SQ)-c(S), this number
    will be less than or equal to c(Q).
  • How much c(SQ)-c(S) is less than c(Q) will depend
    on the degree of similarity of S and Q.

14
Example
  • SAACGTACCATTG
  • HE(S)A,AC,G,T,ACC,AT,TG
  • c(S)7
  • RCTAGGGACTTAT
  • HE(R)C,T,A,G,GG,AC,TT,AT
  • c(R)8
  • QACGGTCACCAA
  • HE(Q)A,C,G,GT,CA,CC,AA
  • c(Q)7

15
Example (Cont)
  • SQAACGTACCATTGACGGTCACCAA
  • HE(SQ)A,AC,G,T,ACC,AT,TG,ACG,GT,C,ACCA
  • c(SQ)11, Q took 4 steps
  • RQCTAGGGACTTATACGGTCACCAA
  • HE(RS)C,T,A,G,GG,AC,TT,AT,ACG,GT,CA,CC,AA
  • c(RQ)13, Q took 5 steps
  • The reason RQ took more steps then SQ is because
    Q is closer to S than Q.

16
Proposed distance (Cont)
  • Formulate the numbers of steps it takes to
    generate a sequence Q form a sequence S by
    c(SQ)-c(S).
  • If S is closer to Q than R then we would expect
    c(SQ)-c(Q)ltc(RQ)-c(R).
  • 11-7 lt 13-8 , TRUE!

17
Distance Measure
  • Sequence S and Q given, defined the function as


  • dm1
  • Normalize form (eliminate the effect of length)
  • dm2

18
Distance Measure (Cont)
  • The idea of building sequence Q using S is the
    sum distance.
  • Similarity, normalize version of d1 as follows

19
Satisfaction conditions
  • D(S,Q) ?0 where the equality is satisfied iff SQ
    (identity).
  • D(S,Q)D(Q,S) (symmetry).
  • D(S,Q)?D(S,T)D(T,Q) (triangle inequality).
  • D(Q,R)D(S,T)?
  • maxD(Q,S)D(R,T),D(Q,T)D(S,R)
  • (additivity).

20
Results and Discussion
  • Neighbor joining by PHYLIP package and parsimony
    and maximum likelihood methods are using
    ClustalW.
  • 1000bp sequence and evolved it into two sequence
    A and B using point mutations(10) and
    segment-based modifications(10).
  • Evolved A into A1 and A2, B into B1 and B2, A to
    A and B to B.

21
Results (Cont)
22
(No Transcript)
23
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
24
2
1
3
19
20
25
Conclusion
  • Proposed a new distance measure and its
    variations.
  • The proposed does not require multiple sequence
    alignment and is fully automatic.
  • Perform comparisons at the whole genome level.
  • Unequal sequence length or the relatively
    different positioning are not problem.
  • Distance measures do not use any evolutionary
    model.
Write a Comment
User Comments (0)
About PowerShow.com