Repeats! - PowerPoint PPT Presentation

About This Presentation
Title:

Repeats!

Description:

Local multiple alignment of interspersed DNA repeats ... Repeats! * – PowerPoint PPT presentation

Number of Views:90
Avg rating:3.0/5.0
Slides: 36
Provided by: ToddT152
Learn more at: https://www.cbcb.umd.edu
Category:

less

Transcript and Presenter's Notes

Title: Repeats!


1
Repeats!
2
Introduction
  • A repeat family is a collection of repeats which
    appear multiple times in a genome.
  • Our objective is to identify all families of
    interspersed repeats within a single genome

3
Challenges when identifying repeat families
. . .
. . .
  • Challenges
  • Regions containing repeat occurrences are not
    known a priori
  • Repeat boundaries are not known a priori
  • Many repeat occurrences appear as partial copies

4
Why are repeats important
  • Repeats have been implicated in
  • Genome rearrangements (Kazazian, 2004 Achaz et
    al 2003)
  • Accelerated loss of gene order (Rocha et al,
    2003)
  • Creation of novel biological functions (Lynch et
    al, 2002)
  • Increased rate of evolution under stress (Capy et
    al, 2000)

5
Identifying repeats de novo
  • Assume we get a new genome and we know nothing
    about it, we can
  • Use a database of known repeats
    (RepeatMasker/RepBase)
  • novel repeat elements may not be in the database
  • repetitive gene families are never in the
    database
  • Identify repeats de novo using sequence analysis

6
Existing methods for detection of repeat families
  • Nearly all existing algorithms for de novo
    identification of repeat families rely on a set
    of pairwise similarities
  • REPuter (Kurtz et al., 2000)
  • RepeatFinder (Volfovsky et al., 2001)
  • RECON (Bao and Eddy, 2002)
  • RepeatGluer (Pevzner et al., 2004)
  • PILER (Edgar and Myers, 2005)
  • RepeatScout (Price et al, 2005)

7
Mutational forces at play
  • Over time, indels substitutions will affect
    copies of repeat families
  • AGGCTACCCCTTTAGGCTAGGGGGGAGGCTATCTCTCCTAGGCTATTTTT
    TAGCCTATT
  • AGGCTGCCCCTTTAGGCTDGGGGGGAGGCTATCTCTCCTAGGCTATTTTT
    TAGCCTATT
  • AGGCTGCCCCTTTAGGCTGGGGGGAGGCTCTCTCTCCTAGCCTATTTTTT
    AGCCTATT
  • AGGCTGCCCCTTTAGGCTGGGGGGAGGCTCTCTCTCCTAGCCTATTTTTT
    AGCDTATT
  • AGGCTGCCCCTTTAGGCTGGGGGGAGGCTCTCTCTCCTAGCCTATTTTTT
    AGCTATT
  • AGGCTGCCCCTTTAGGCTGGGGGGAGGCTCTCTCTCCTAGCCTATTTTTT
    AGCTATT
  • Require alignments ( gaps) to attempt to
    reconstruct true repeat boundaries

8
de novo repeat detection
  • One approach self-search with a pairwise
    local-alignment tool such as BLAST
  • Number of pairwise alignments grows O(r2) in the
    copy number of the repeat
  • Inherent difficulty defining repeat boundaries
    among collections of pairwise alignments

9
Alternative methods?
  • Local multiple alignment

A single local multiple alignment uses O(N) space
for a genome of length N
10
Local multiple alignment
  • Local multiple alignment has the inherent
    potential to avoid pitfalls associated with
    pairwise alignment.
  • But multiple alignment under the SP objective
    function remains intractable
  • Progressive alignment heuristics offer excellent
    speed and accuracy (i.e. MUSCLE).
  • So why not directly construct a multiple
    alignment?

11
(No Transcript)
12
Steps 1-3 Chaining seeds from the Input Sequence
  • The method incorporated three novel ideas
  • (1) palindromic spaced seed patterns to match
    both DNA strands simultaneously
  • (2) seed extension (chaining) in order of
    decreasing multiplicity, and
  • (3) procrastination when low multiplicity matches
    are encountered.

13
Step 4 Gapped Extension
  • After chaining a seed match, we must perform
    gapped extension to approximate the true repeat
    boundaries
  • This is an essential step to consider, assuming
    we would like to improve repeat boundary
    predictions
  • But how can this be done efficiently?

14
Our approach to gapped extension
TAGTTGGCCCTGAGGCATCTCTTTAACGTTTC-CTGGTCCTTTTTTCC
GACTAGCAGCCA GTTGCGGCCCCTGAGGCACTCTTTAACGT-TA-CTGG
TCCTTTCCTTTAATTTGACATGA CTTAAGGCCCCTGAGGATCTCTTTAA
CGTTTCTCTGGTCCTTTCCAAGAGCCCCCGTAGC
AGGTAGGCCCTCAGGCATCTCTTTAACGTTTC-CTGGTCCTTTCCGAGAC
AGGATGGACG CCCGAGCCCCTGAGGTATCTCTTTAACGTTTC-CTGGTC
CTTTCCTTAAAAAAATTAAAA TAAGCGGCCCCTGAGGCACTCTTTAACG
TTTA-CTGGTCCTTTCCAATTTGCTCTATGTC ATTGGGGCCCCTGAGGA
TCTCTTTAACGTTTA-CTGGTCCTTTCCGGCCCTTATAGGTAC GGCCAG
CCCCTGAGGCATCTCTTTAACGTTTCTCTGGTCCTTTCCAAAGAGCGCCC
GCGG CCCACGCCCCTGAGGCATCTCTTTAACGTTTC-CTGGTCCTTTCC
GACCGAATTAATTAA . . . GATTCGGCCCCGAGGCATCTCTTTAACG
TTTA-CTGGTCCTTTCGTTTCCCCCCGGCCCC
15
HMM approach to gapped extension
TAGTTGGCCCTGAGGCATCTCTTTAACGTTTC-CTGGTCCTTTTTTCC
GACTAGCAGCCA GTTGCGGCCCCTGAGGCACTCTTTAACGT-TA-CTGG
TCCTTTCCTTTAATTTGACATGA CTTAAGGCCCCTGAGGATCTCTTTAA
CGTTTCTCTGGTCCTTTCCAAGAGCCCCCGTAGC
AGGTAGGCCCTCAGGCATCTCTTTAACGTTTC-CTGGTCCTTTCCGAGAC
AGGATGGACG CCCGAGCCCCTGAGGTATCTCTTTAACGTTTC-CTGGTC
CTTTCCTTAAAAAAATTAAAA TAAGCGGCCCCTGAGGCACTCTTTAACG
TTTA-CTGGTCCTTTCCAATTTGCTCTATGTC ATTGGGGCCCCTGAGGA
TCTCTTTAACGTTTA-CTGGTCCTTTCCGGCCCTTATAGGTAC GGCCAG
CCCCTGAGGCATCTCTTTAACGTTTCTCTGGTCCTTTCCAAAGAGCGCCC
GCGG CCCACGCCCCTGAGGCATCTCTTTAACGTTTC-CTGGTCCTTTCC
GACCGAATTAATTAA . . . GATTCGGCCCCGAGGCATCTCTTTAACG
TTTA-CTGGTCCTTTCGTTTCCCCCCGGCCCC
Dynamically calculate extension window 70e
-0.01Mi Mi 200 , l 10
16
HMM approach to gapped extension
TAGTTGGCCCTGAGGCATCTCTTTAACGTTTC-CTGGTCCTTTTTTCC
GACTAGCAGCCA GTTGCGGCCCCTGAGGCACTCTTTAACGT-TA-CTGG
TCCTTTCCTTTAATTTGACATGA CTTAAGGCCCCTGAGGATCTCTTTAA
CGTTTCTCTGGTCCTTTCCAAGAGCCCCCGTAGC
AGGTAGGCCCTCAGGCATCTCTTTAACGTTTC-CTGGTCCTTTCCGAGAC
AGGATGGACG CCCGAGCCCCTGAGGTATCTCTTTAACGTTTC-CTGGTC
CTTTCCTTAAAAAAATTAAAA TAAGCGGCCCCTGAGGCACTCTTTAACG
TTTA-CTGGTCCTTTCCAATTTGCTCTATGTC ATTGGGGCCCCTGAGGA
TCTCTTTAACGTTTA-CTGGTCCTTTCCGGCCCTTATAGGTAC GGCCAG
CCCCTGAGGCATCTCTTTAACGTTTCTCTGGTCCTTTCCAAAGAGCGCCC
GCGG CCCACGCCCCTGAGGCATCTCTTTAACGTTTC-CTGGTCCTTTCC
GACCGAATTAATTAA . . . GATTCGGCCCCGAGGCATCTCTTTAACG
TTTA-CTGGTCCTTTCGTTTCCCCCCGGCCCC
Use MUSCLE to perform alignment of extension
window
17
HMM approach to gapped extension
ACAAGGGCCC-TGAGGCATCTCTTTAACGTTTC-CTGGTCCTTTTTTC
CGACTAGCAGCCA TACGAGCCCCCTGAGGCA-CTCTTTAACGT-TA-CT
GGTCCTTTCCTTTAATTTGACATGA TTCATCCCCCCTGAGG-ATCTCTT
TAACGTTTCTCTGGTCCTTTCCAAGAGCCCCCGTAGC
AGAACGGCCC-TCAGGCATCTCTTTAACGTTTC-CTGGTCCTTTCCGAGA
CAGGATGGACG AGGCCGGCCC-TGAGGTATCTCTTTAACGTTTC-CTGG
TCCTTTCCTTAAAAAAATTAAAA AACCCGCCCCCTGAGGCA-CTCTTTA
ACGTTTA-CTGGTCCTTTCCAATTTGCTCTATGTC ATTTTGCCCCCTGA
GG-ATCTCTTTAACGTTTA-CTGGTCCTTTCCGGCCCTTATAGGTAC GG
AAAGCCCC-TGAGGCATCTCTTTAACGTTTCTCTGGTCCTTTCCAAAGAG
CGCCCGCGG ATTCCGCCCC-TGAGGCATCTCTTTAACGTTTC-CTGGTC
CTTTCCGACCGAATTAATTAA . . . ATTCGGCCCC-CGAGGCATCTC
TTTAACGTTTA-CTGGTCCTTTCGTTTCCCCCCGGCCCC
Use HMM to detect unalign unrelated sequence
18
HMM approach to gapped extension
ACAAGGGCCC-TGAGGCATCTCTTTAACGTTTC-CTGGTCCTTTTTTC
CGACTAGCAGCCA TACGAGCCCCCTGAGGCA-CTCTTTAACGT-TA-CT
GGTCCTTTCCTTTAATTTGACATGA TTCATCCCCCCTGAGG-ATCTCTT
TAACGTTTCTCTGGTCCTTTCCAAGAGCCCCCGTAGC
AGAACGGCCC-TCAGGCATCTCTTTAACGTTTC-CTGGTCCTTTCCGAGA
CAGGATGGACG AGGCCGGCCC-TGAGGTATCTCTTTAACGTTTC-CTGG
TCCTTTCCTTAAAAAAATTAAAA AACCCGCCCCCTGAGGCA-CTCTTTA
ACGTTTA-CTGGTCCTTTCCAATTTGCTCTATGTC ATTTTGCCCCCTGA
GG-ATCTCTTTAACGTTTA-CTGGTCCTTTCCGGCCCTTATAGGTAC GG
AAAGCCCC-TGAGGCATCTCTTTAACGTTTCTCTGGTCCTTTCCAAAGAG
CGCCCGCGG ATTCCGCCCC-TGAGGCATCTCTTTAACGTTTC-CTGGTC
CTTTCCGACCGAATTAATTAA . . . ATTCGGCCCC-CGAGGCATCTC
TTTAACGTTTA-CTGGTCCTTTCGTTTCCCCCCGGCCCC
Extension successful, continue extending
19
HMM approach to gapped extension
ACAAGGGCCCTGAGGCATCTCTTTAACGTTTC-CTGGTCCTTTTTTCC
GACTAGCAGCCA ACGAGCCCCCTGAGGCA-CTCTTTAACGT-TA-CTGG
TCCTTTCCTTTAATTTGACATGA TCATCCCCCCTGAGG-ATCTCTTTAA
CGTTTCTCTGGTCCTTTCCAAGAGCCCCCGTAGC
AGAACGGCCCTCAGGCATCTCTTTAACGTTTC-CTGGTCCTTTCCGAGAC
AGGATGGACG AGGCCGGCCCTGAGGTATCTCTTTAACGTTTC-CTGGTC
CTTTCCTTAAAAAAATTAAAA ACCCGCCCCCTGAGGCA-CTCTTTAACG
TTTA-CTGGTCCTTTCCAATTTGCTCTATGTC TTTTGCCCCCTGAGG-A
TCTCTTTAACGTTTA-CTGGTCCTTTCCGGCCCTTATAGGTAC GGAAAG
CCCCTGAGGCATCTCTTTAACGTTTCTCTGGTCCTTTCCAAAGAGCGCCC
GCGG ATTCCGCCCCTGAGGCATCTCTTTAACGTTTC-CTGGTCCTTTCC
GACCGAATTAATTAA . . . ATTCGGCCCC-GAGGCATCTCTTTAACG
TTTA-CTGGTCCTTTCGTTTCCCCCCGGCCCC
20
HMM approach to gapped extension
ACAAGGGCCCTGAGGCATCTCTTTAACGTTTC-CTGGTCCTTTTTTCC
GACTAGCAGCCA ACGAGCCCCCTGAGGCA-CTCTTTAACGT-TA-CTGG
TCCTTTCCTTTAATTTGACATGA TCATCCCCCCTGAGG-ATCTCTTTAA
CGTTTCTCTGGTCCTTTCCAAGAGCCCCCGTAGC
AGAACGGCCCTCAGGCATCTCTTTAACGTTTC-CTGGTCCTTTCCGAGAC
AGGATGGACG AGGCCGGCCCTGAGGTATCTCTTTAACGTTTC-CTGGTC
CTTTCCTTAAAAAAATTAAAA ACCCGCCCCCTGAGGCA-CTCTTTAACG
TTTA-CTGGTCCTTTCCAATTTGCTCTATGTC TTTTGCCCCCTGAGG-A
TCTCTTTAACGTTTA-CTGGTCCTTTCCGGCCCTTATAGGTAC GGAAAG
CCCCTGAGGCATCTCTTTAACGTTTCTCTGGTCCTTTCCAAAGAGCGCCC
GCGG ATTCCGCCCCTGAGGCATCTCTTTAACGTTTC-CTGGTCCTTTCC
GACCGAATTAATTAA . . . ATTCGGCCCC-GAGGCATCTCTTTAACG
TTTA-CTGGTCCTTTCGTTTCCCCCCGGCCCC
Use HMM to detect unalign unrelated sequence
21
HMM approach to gapped extension
ACAAGGGCCCTGAGGCATCTCTTTAACGTTTC-CTGGTCCTTTTTTCC
GACTAGCAGCCA ACGAGCCCCCTGAGGCA-CTCTTTAACGT-TA-CTGG
TCCTTTCCTTTAATTTGACATGA TCATCCCCCCTGAGG-ATCTCTTTAA
CGTTTCTCTGGTCCTTTCCAAGAGCCCCCGTAGC
AGAACGGCCCTCAGGCATCTCTTTAACGTTTC-CTGGTCCTTTCCGAGAC
AGGATGGACG AGGCCGGCCCTGAGGTATCTCTTTAACGTTTC-CTGGTC
CTTTCCTTAAAAAAATTAAAA ACCCGCCCCCTGAGGCA-CTCTTTAACG
TTTA-CTGGTCCTTTCCAATTTGCTCTATGTC TTTTGCCCCCTGAGG-A
TCTCTTTAACGTTTA-CTGGTCCTTTCCGGCCCTTATAGGTAC GGAAAG
CCCCTGAGGCATCTCTTTAACGTTTCTCTGGTCCTTTCCAAAGAGCGCCC
GCGG ATTCCGCCCCTGAGGCATCTCTTTAACGTTTC-CTGGTCCTTTCC
GACCGAATTAATTAA . . . ATTCGGCCCC-GAGGCATCTCTTTAACG
TTTA-CTGGTCCTTTCGTTTCCCCCCGGCCCC
Finished leftward extension, now to the right
22
HMM approach to gapped extension
ACAAGGGCCCTGAGGCATCTCTTTAACGTTTC-CTGGTCCTTTTTTCC
GACTAGCAGCCA ACGAGCCCCCTGAGGCA-CTCTTTAACGT-TA-CTGG
TCCTTTCCTTTAATTTGACATGA TCATCCCCCCTGAGG-ATCTCTTTAA
CGTTTCTCTGGTCCTTTCCAAGAGCCCCCGTAGC
AGAACGGCCCTCAGGCATCTCTTTAACGTTTC-CTGGTCCTTTCCGAGAC
AGGATGGACG AGGCCGGCCCTGAGGTATCTCTTTAACGTTTC-CTGGTC
CTTTCCTTAAAAAAATTAAAA ACCCGCCCCCTGAGGCA-CTCTTTAACG
TTTA-CTGGTCCTTTCCAATTTGCTCTATGTC TTTTGCCCCCTGAGG-A
TCTCTTTAACGTTTA-CTGGTCCTTTCCGGCCCTTATAGGTAC GGAAAG
CCCCTGAGGCATCTCTTTAACGTTTCTCTGGTCCTTTCCAAAGAGCGCCC
GCGG ATTCCGCCCCTGAGGCATCTCTTTAACGTTTC-CTGGTCCTTTCC
GACCGAATTAATTAA . . . ATTCGGCCCC-GAGGCATCTCTTTAACG
TTTA-CTGGTCCTTTCGTTTCCCCCCGGCCCC
23
HMM approach to gapped extension
TAGTTGGCCCTGAGGCATCTCTTTAACGTTTC-CTGGTCCTTTTTTCC
GA---GCAGCCA TACGAGCCCCTGAGGCA-CTCTTTAACGT-TA-CTGG
TCC---TTTCCTTTAATTTGACA TTCATGCCCCTGAGG-ATCTCTTTAA
CGTTTCTCTGGTCC---TTTCCAAGAGCCCCCGT
AGAAAGGCCCTCAGGCATCTCTTTAACGTTTC-CTGGTCC---TTTCGAG
ACTAGGATGG CCGATGCCCCTGAGGTATCTCTTTAACGTTTC-CTGGTC
C---TTTCCTTAAAAAAATTA AACCCGCCCCTGAGGCA-CTCTTTAACG
TTTA-CTGGTCC---TTTCCAATTTGCTCTAT TTTTTGCCCCTGAGG-A
TCTCTTTAACGTTTA-CTGGTCC---TTTCCGGCCCTTATAGG GGAAAG
CCCCTGAGGCATCTCTTTAACGTTTCTCTGGTCC---TTTCCAAAGAGCG
CCCG CCTATGCCCCTGAGGCATCTCTTTAACGTTTC-CTGGTCC---TT
TCCGACCGAATTAAT . . . -TTCGGCCCC-GAGGCATCTCTTTAACG
TTTA-CTGGTCC---TTTCGTTTCCCCCCGGC
Perform MUSCLE alignment on window
24
HMM approach to gapped extension
TAGTTGGCCCTGAGGCATCTCTTTAACGTTTC-CTGGTCCTTTTTTCC
GA---GCAGCCA TACGAGCCCCTGAGGCA-CTCTTTAACGT-TA-CTGG
TCC---TTTCCTTTAATTTGACA TTCATGCCCCTGAGG-ATCTCTTTAA
CGTTTCTCTGGTCC---TTTCCAAGAGCCCCCGT
AGAAAGGCCCTCAGGCATCTCTTTAACGTTTC-CTGGTCC---TTTCGAG
ACTAGGATGG CCGATGCCCCTGAGGTATCTCTTTAACGTTTC-CTGGTC
C---TTTCCTTAAAAAAATTA AACCCGCCCCTGAGGCA-CTCTTTAACG
TTTA-CTGGTCC---TTTCCAATTTGCTCTAT TTTTTGCCCCTGAGG-A
TCTCTTTAACGTTTA-CTGGTCC---TTTCCGGCCCTTATAGG GGAAAG
CCCCTGAGGCATCTCTTTAACGTTTCTCTGGTCC---TTTCCAAAGAGCG
CCCG CCTATGCCCCTGAGGCATCTCTTTAACGTTTC-CTGGTCC---TT
TCCGACCGAATTAAT . . . -TTCGGCCCC-GAGGCATCTCTTTAACG
TTTA-CTGGTCC---TTTCGTTTCCCCCCGGC
Use HMM to detect unalign unrelated sequence
25
HMM approach to gapped extension
TAGTTGGCCCTGAGGCATCTCTTTAACGTTTC-CTGGTCCTTTTTTCC
GA---GCAGCCA TACGAGCCCCTGAGGCA-CTCTTTAACGT-TA-CTGG
TCC---TTTCCTTTAATTTGACA TTCATGCCCCTGAGG-ATCTCTTTAA
CGTTTCTCTGGTCC---TTTCCAAGAGCCCCCGT
AGAAAGGCCCTCAGGCATCTCTTTAACGTTTC-CTGGTCC---TTTCGAG
ACTAGGATGG CCGATGCCCCTGAGGTATCTCTTTAACGTTTC-CTGGTC
C---TTTCCTTAAAAAAATTA AACCCGCCCCTGAGGCA-CTCTTTAACG
TTTA-CTGGTCC---TTTCCAATTTGCTCTAT TTTTTGCCCCTGAGG-A
TCTCTTTAACGTTTA-CTGGTCC---TTTCCGGCCCTTATAGG GGAAAG
CCCCTGAGGCATCTCTTTAACGTTTCTCTGGTCC---TTTCCAAAGAGCG
CCCG CCTATGCCCCTGAGGCATCTCTTTAACGTTTC-CTGGTCC---TT
TCCGACCGAATTAAT . . . -TTCGGCCCC-GAGGCATCTCTTTAACG
TTTA-CTGGTCC---TTTCGTTTCCCCCCGGC
Extension successful, continue extending
26
HMM approach to gapped extension
TAGTTGGCCCTGAGGCATCTCTTTAACGTTTC-CTGGTCCTTTTTTCC
GAGCAGCCACCA TACGAGCCCCTGAGGCA-CTCTTTAACGT-TA-CTGG
TCC---TTTCCTTTAATTTGACA TTCATGCCCCTGAGG-ATCTCTTTAA
CGTTTCTCTGGTCC---TTTCCAAGAGCCCCCGT
AGAAAGGCCCTCAGGCATCTCTTTAACGTTTC-CTGGTCC---TTTCGAG
ACTAGGATGG CCGATGCCCCTGAGGTATCTCTTTAACGTTTC-CTGGTC
C---TTTCCTTAAAAAAATTA AACCCGCCCCTGAGGCA-CTCTTTAACG
TTTA-CTGGTCC---TTTCCAATTTGCTCTAT TTTTTGCCCCTGAGG-A
TCTCTTTAACGTTTA-CTGGTCC---TTTCCGGCCCTTATAGG GGAAAG
CCCCTGAGGCATCTCTTTAACGTTTCTCTGGTCC---TTTCCAAAGAGCG
CCCG CCTATGCCCCTGAGGCATCTCTTTAACGTTTC-CTGGTCC---TT
TCCGACCGAATTAAT . . . -TTCGGCCCC-GAGGCATCTCTTTAACG
TTTA-CTGGTCC---TTTCGTTTCCCCCCGGC
Use MUSCLE to perform alignment of extension
window
27
HMM approach to gapped extension
TAGTTGGCCCTGAGGCATCTCTTTAACGTTTC-CTGGTCCTTTTTTCC
---GAGCAGCCAC- TACGAGCCCCTGAGGCA-CTCTTTAACGT-TA-CT
GGTCC---TTTCCTTTAATTTGA---- TTCATGCCCCTGAGG-ATCTC
TTTAACGTTTCTCTGGTCC---TTTCC----AAGAGCCCCC
AGAAAGGCCCTCAGGCATCTCTTTAACGTTTC-CTGGTCC---TTTCG--
-AGACTAGGAT- CCGATGCCCCTGAGGTATCTCTTTAACGTTTC-CTGG
TCC---TTTCCTTAAAAAAAT---- AACCCGCCCCTGAGGCA-CTCTTT
AACGTTTA-CTGGTCC---TTTCC---AATTTGCTCT- TTTTTGCCCCT
GAGG-ATCTCTTTAACGTTTA-CTGGTCC---TTTCC----GGCCCTTAT
A GGAAAGCCCCTGAGGCATCTCTTTAACGTTTCTCTGGTCC---TTTCC
---AAAGAGCGCC- CCTATGCCCCTGAGGCATCTCTTTAACGTTTC-CT
GGTCC---TTTCC----GACCGAATTA . . . -TTCGGCCCC-GAGGC
ATCTCTTTAACGTTTA-CTGGTCC---TTTCG----TTTCCCCCCG
Use HMM to detect unalign unrelated sequence
28
HMM approach to gapped extension
TAGTTGGCCCTGAGGCATCTCTTTAACGTTTC-CTGGTCCTTTTTTCC
---GAGCAGCCAC- TACGAGCCCCTGAGGCA-CTCTTTAACGT-TA-CT
GGTCC---TTTCCTTTAATTTGA---- TTCATGCCCCTGAGG-ATCTC
TTTAACGTTTCTCTGGTCC---TTTCC----AAGAGCCCCC
AGAAAGGCCCTCAGGCATCTCTTTAACGTTTC-CTGGTCC---TTTCG--
-AGACTAGGAT- CCGATGCCCCTGAGGTATCTCTTTAACGTTTC-CTGG
TCC---TTTCCTTAAAAAAAT---- AACCCGCCCCTGAGGCA-CTCTTT
AACGTTTA-CTGGTCC---TTTCC---AATTTGCTCT- TTTTTGCCCCT
GAGG-ATCTCTTTAACGTTTA-CTGGTCC---TTTCC----GGCCCTTAT
A GGAAAGCCCCTGAGGCATCTCTTTAACGTTTCTCTGGTCC---TTTCC
---AAAGAGCGCC- CCTATGCCCCTGAGGCATCTCTTTAACGTTTC-CT
GGTCC---TTTCC----GACCGAATTA . . . -TTCGGCCCC-GAGGC
ATCTCTTTAACGTTTA-CTGGTCC---TTTCG----TTTCCCCCCG
Extension failed, stop extending
29
Wait a moment..
  • The MUSCLE alignment software reports the highest
    scoring global multiple alignment of the input
    sequences, regardless of common ancestry.
  • As a result, it is likely that this method
    forcibly aligns unrelated sequence.
  • HMMs to detect alignments of unrelated sequence.

30
Step 5 detecting unrelated sequence
  • The HMM consists of two hidden states, Homologous
    and Unrelated.
  • The observable states are the pairwise alignment
    columns, which are all possible pairs in
    A,G,C,T,- with strand and species symmetry
  • i.e. AGGATCCT.
  • The emission probabilities for each possible pair
    of aligned nucleotides were extracted from the
    HOXD substitution matrix presented by Chiaromonte
    et al.

31
0.5
UUUU
U
H
  • Compute emission frequencies for the Unrelated
    state of our HMM using the background frequencies
    of G/C and A/T, assuming strand and species
    symmetry
  • UAA UAT UTA UTT (fAT)/2 (fAT)/2
  • UCC UCG UGC UGG (fGC)/2 (fGC)/2
  • UAC UAG UTC UAG (fAT)/2 (fGC)/2
  • UCA UCT UGA UTT (fGT)/2 (fAT)/2

32
0.5
UUUUUU
H
UU
  • To empirically estimate gap-open and extend
    values for the unrelated state, align a 10-kb,
    48 GC content region taken from E. coli CFT073
    (Accession AF447814.1, coordinates 37,300-38,300)
    with an unrelated sequence.

33
0.5
UUUUUUUUUUUU
H
UU
  • Alignment with MUSCLE on unrelated sequence and
    counted the number of gap-open and gap-extend
    columns in the alignment of unrelated sequences.

34
0.5
UUUUUUUUUUUUUUUUUUHHHHHHHHHHHHHHHHHHHHHHHHHHHHH
H
UU
  • Gap-open and extend frequencies for the
    homologous state were estimated by constructing
    an alignment of 10kb of orthologous sequence
    shared among a pair of divergent organisms.

35
0.5
UUUUUUUUUUUUUUUUUUHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH
H
UU
Write a Comment
User Comments (0)
About PowerShow.com