Title: Repeats!
1Repeats!
2Introduction
- A repeat family is a collection of repeats which
appear multiple times in a genome.
- Our objective is to identify all families of
interspersed repeats within a single genome
3Challenges when identifying repeat families
. . .
. . .
- Challenges
- Regions containing repeat occurrences are not
known a priori - Repeat boundaries are not known a priori
- Many repeat occurrences appear as partial copies
4Why are repeats important
- Repeats have been implicated in
- Genome rearrangements (Kazazian, 2004 Achaz et
al 2003) - Accelerated loss of gene order (Rocha et al,
2003) - Creation of novel biological functions (Lynch et
al, 2002) - Increased rate of evolution under stress (Capy et
al, 2000)
5Identifying repeats de novo
- Assume we get a new genome and we know nothing
about it, we can - Use a database of known repeats
(RepeatMasker/RepBase) - novel repeat elements may not be in the database
- repetitive gene families are never in the
database - Identify repeats de novo using sequence analysis
6Existing methods for detection of repeat families
- Nearly all existing algorithms for de novo
identification of repeat families rely on a set
of pairwise similarities
- REPuter (Kurtz et al., 2000)
- RepeatFinder (Volfovsky et al., 2001)
- RECON (Bao and Eddy, 2002)
- RepeatGluer (Pevzner et al., 2004)
- PILER (Edgar and Myers, 2005)
- RepeatScout (Price et al, 2005)
7Mutational forces at play
- Over time, indels substitutions will affect
copies of repeat families - AGGCTACCCCTTTAGGCTAGGGGGGAGGCTATCTCTCCTAGGCTATTTTT
TAGCCTATT - AGGCTGCCCCTTTAGGCTDGGGGGGAGGCTATCTCTCCTAGGCTATTTTT
TAGCCTATT - AGGCTGCCCCTTTAGGCTGGGGGGAGGCTCTCTCTCCTAGCCTATTTTTT
AGCCTATT - AGGCTGCCCCTTTAGGCTGGGGGGAGGCTCTCTCTCCTAGCCTATTTTTT
AGCDTATT - AGGCTGCCCCTTTAGGCTGGGGGGAGGCTCTCTCTCCTAGCCTATTTTTT
AGCTATT - AGGCTGCCCCTTTAGGCTGGGGGGAGGCTCTCTCTCCTAGCCTATTTTTT
AGCTATT - Require alignments ( gaps) to attempt to
reconstruct true repeat boundaries
8de novo repeat detection
- One approach self-search with a pairwise
local-alignment tool such as BLAST - Number of pairwise alignments grows O(r2) in the
copy number of the repeat - Inherent difficulty defining repeat boundaries
among collections of pairwise alignments
9Alternative methods?
A single local multiple alignment uses O(N) space
for a genome of length N
10Local multiple alignment
- Local multiple alignment has the inherent
potential to avoid pitfalls associated with
pairwise alignment. - But multiple alignment under the SP objective
function remains intractable - Progressive alignment heuristics offer excellent
speed and accuracy (i.e. MUSCLE). - So why not directly construct a multiple
alignment?
11(No Transcript)
12Steps 1-3 Chaining seeds from the Input Sequence
- The method incorporated three novel ideas
- (1) palindromic spaced seed patterns to match
both DNA strands simultaneously - (2) seed extension (chaining) in order of
decreasing multiplicity, and - (3) procrastination when low multiplicity matches
are encountered.
13Step 4 Gapped Extension
- After chaining a seed match, we must perform
gapped extension to approximate the true repeat
boundaries - This is an essential step to consider, assuming
we would like to improve repeat boundary
predictions - But how can this be done efficiently?
14Our approach to gapped extension
TAGTTGGCCCTGAGGCATCTCTTTAACGTTTC-CTGGTCCTTTTTTCC
GACTAGCAGCCA GTTGCGGCCCCTGAGGCACTCTTTAACGT-TA-CTGG
TCCTTTCCTTTAATTTGACATGA CTTAAGGCCCCTGAGGATCTCTTTAA
CGTTTCTCTGGTCCTTTCCAAGAGCCCCCGTAGC
AGGTAGGCCCTCAGGCATCTCTTTAACGTTTC-CTGGTCCTTTCCGAGAC
AGGATGGACG CCCGAGCCCCTGAGGTATCTCTTTAACGTTTC-CTGGTC
CTTTCCTTAAAAAAATTAAAA TAAGCGGCCCCTGAGGCACTCTTTAACG
TTTA-CTGGTCCTTTCCAATTTGCTCTATGTC ATTGGGGCCCCTGAGGA
TCTCTTTAACGTTTA-CTGGTCCTTTCCGGCCCTTATAGGTAC GGCCAG
CCCCTGAGGCATCTCTTTAACGTTTCTCTGGTCCTTTCCAAAGAGCGCCC
GCGG CCCACGCCCCTGAGGCATCTCTTTAACGTTTC-CTGGTCCTTTCC
GACCGAATTAATTAA . . . GATTCGGCCCCGAGGCATCTCTTTAACG
TTTA-CTGGTCCTTTCGTTTCCCCCCGGCCCC
15HMM approach to gapped extension
TAGTTGGCCCTGAGGCATCTCTTTAACGTTTC-CTGGTCCTTTTTTCC
GACTAGCAGCCA GTTGCGGCCCCTGAGGCACTCTTTAACGT-TA-CTGG
TCCTTTCCTTTAATTTGACATGA CTTAAGGCCCCTGAGGATCTCTTTAA
CGTTTCTCTGGTCCTTTCCAAGAGCCCCCGTAGC
AGGTAGGCCCTCAGGCATCTCTTTAACGTTTC-CTGGTCCTTTCCGAGAC
AGGATGGACG CCCGAGCCCCTGAGGTATCTCTTTAACGTTTC-CTGGTC
CTTTCCTTAAAAAAATTAAAA TAAGCGGCCCCTGAGGCACTCTTTAACG
TTTA-CTGGTCCTTTCCAATTTGCTCTATGTC ATTGGGGCCCCTGAGGA
TCTCTTTAACGTTTA-CTGGTCCTTTCCGGCCCTTATAGGTAC GGCCAG
CCCCTGAGGCATCTCTTTAACGTTTCTCTGGTCCTTTCCAAAGAGCGCCC
GCGG CCCACGCCCCTGAGGCATCTCTTTAACGTTTC-CTGGTCCTTTCC
GACCGAATTAATTAA . . . GATTCGGCCCCGAGGCATCTCTTTAACG
TTTA-CTGGTCCTTTCGTTTCCCCCCGGCCCC
Dynamically calculate extension window 70e
-0.01Mi Mi 200 , l 10
16HMM approach to gapped extension
TAGTTGGCCCTGAGGCATCTCTTTAACGTTTC-CTGGTCCTTTTTTCC
GACTAGCAGCCA GTTGCGGCCCCTGAGGCACTCTTTAACGT-TA-CTGG
TCCTTTCCTTTAATTTGACATGA CTTAAGGCCCCTGAGGATCTCTTTAA
CGTTTCTCTGGTCCTTTCCAAGAGCCCCCGTAGC
AGGTAGGCCCTCAGGCATCTCTTTAACGTTTC-CTGGTCCTTTCCGAGAC
AGGATGGACG CCCGAGCCCCTGAGGTATCTCTTTAACGTTTC-CTGGTC
CTTTCCTTAAAAAAATTAAAA TAAGCGGCCCCTGAGGCACTCTTTAACG
TTTA-CTGGTCCTTTCCAATTTGCTCTATGTC ATTGGGGCCCCTGAGGA
TCTCTTTAACGTTTA-CTGGTCCTTTCCGGCCCTTATAGGTAC GGCCAG
CCCCTGAGGCATCTCTTTAACGTTTCTCTGGTCCTTTCCAAAGAGCGCCC
GCGG CCCACGCCCCTGAGGCATCTCTTTAACGTTTC-CTGGTCCTTTCC
GACCGAATTAATTAA . . . GATTCGGCCCCGAGGCATCTCTTTAACG
TTTA-CTGGTCCTTTCGTTTCCCCCCGGCCCC
Use MUSCLE to perform alignment of extension
window
17HMM approach to gapped extension
ACAAGGGCCC-TGAGGCATCTCTTTAACGTTTC-CTGGTCCTTTTTTC
CGACTAGCAGCCA TACGAGCCCCCTGAGGCA-CTCTTTAACGT-TA-CT
GGTCCTTTCCTTTAATTTGACATGA TTCATCCCCCCTGAGG-ATCTCTT
TAACGTTTCTCTGGTCCTTTCCAAGAGCCCCCGTAGC
AGAACGGCCC-TCAGGCATCTCTTTAACGTTTC-CTGGTCCTTTCCGAGA
CAGGATGGACG AGGCCGGCCC-TGAGGTATCTCTTTAACGTTTC-CTGG
TCCTTTCCTTAAAAAAATTAAAA AACCCGCCCCCTGAGGCA-CTCTTTA
ACGTTTA-CTGGTCCTTTCCAATTTGCTCTATGTC ATTTTGCCCCCTGA
GG-ATCTCTTTAACGTTTA-CTGGTCCTTTCCGGCCCTTATAGGTAC GG
AAAGCCCC-TGAGGCATCTCTTTAACGTTTCTCTGGTCCTTTCCAAAGAG
CGCCCGCGG ATTCCGCCCC-TGAGGCATCTCTTTAACGTTTC-CTGGTC
CTTTCCGACCGAATTAATTAA . . . ATTCGGCCCC-CGAGGCATCTC
TTTAACGTTTA-CTGGTCCTTTCGTTTCCCCCCGGCCCC
Use HMM to detect unalign unrelated sequence
18HMM approach to gapped extension
ACAAGGGCCC-TGAGGCATCTCTTTAACGTTTC-CTGGTCCTTTTTTC
CGACTAGCAGCCA TACGAGCCCCCTGAGGCA-CTCTTTAACGT-TA-CT
GGTCCTTTCCTTTAATTTGACATGA TTCATCCCCCCTGAGG-ATCTCTT
TAACGTTTCTCTGGTCCTTTCCAAGAGCCCCCGTAGC
AGAACGGCCC-TCAGGCATCTCTTTAACGTTTC-CTGGTCCTTTCCGAGA
CAGGATGGACG AGGCCGGCCC-TGAGGTATCTCTTTAACGTTTC-CTGG
TCCTTTCCTTAAAAAAATTAAAA AACCCGCCCCCTGAGGCA-CTCTTTA
ACGTTTA-CTGGTCCTTTCCAATTTGCTCTATGTC ATTTTGCCCCCTGA
GG-ATCTCTTTAACGTTTA-CTGGTCCTTTCCGGCCCTTATAGGTAC GG
AAAGCCCC-TGAGGCATCTCTTTAACGTTTCTCTGGTCCTTTCCAAAGAG
CGCCCGCGG ATTCCGCCCC-TGAGGCATCTCTTTAACGTTTC-CTGGTC
CTTTCCGACCGAATTAATTAA . . . ATTCGGCCCC-CGAGGCATCTC
TTTAACGTTTA-CTGGTCCTTTCGTTTCCCCCCGGCCCC
Extension successful, continue extending
19HMM approach to gapped extension
ACAAGGGCCCTGAGGCATCTCTTTAACGTTTC-CTGGTCCTTTTTTCC
GACTAGCAGCCA ACGAGCCCCCTGAGGCA-CTCTTTAACGT-TA-CTGG
TCCTTTCCTTTAATTTGACATGA TCATCCCCCCTGAGG-ATCTCTTTAA
CGTTTCTCTGGTCCTTTCCAAGAGCCCCCGTAGC
AGAACGGCCCTCAGGCATCTCTTTAACGTTTC-CTGGTCCTTTCCGAGAC
AGGATGGACG AGGCCGGCCCTGAGGTATCTCTTTAACGTTTC-CTGGTC
CTTTCCTTAAAAAAATTAAAA ACCCGCCCCCTGAGGCA-CTCTTTAACG
TTTA-CTGGTCCTTTCCAATTTGCTCTATGTC TTTTGCCCCCTGAGG-A
TCTCTTTAACGTTTA-CTGGTCCTTTCCGGCCCTTATAGGTAC GGAAAG
CCCCTGAGGCATCTCTTTAACGTTTCTCTGGTCCTTTCCAAAGAGCGCCC
GCGG ATTCCGCCCCTGAGGCATCTCTTTAACGTTTC-CTGGTCCTTTCC
GACCGAATTAATTAA . . . ATTCGGCCCC-GAGGCATCTCTTTAACG
TTTA-CTGGTCCTTTCGTTTCCCCCCGGCCCC
20HMM approach to gapped extension
ACAAGGGCCCTGAGGCATCTCTTTAACGTTTC-CTGGTCCTTTTTTCC
GACTAGCAGCCA ACGAGCCCCCTGAGGCA-CTCTTTAACGT-TA-CTGG
TCCTTTCCTTTAATTTGACATGA TCATCCCCCCTGAGG-ATCTCTTTAA
CGTTTCTCTGGTCCTTTCCAAGAGCCCCCGTAGC
AGAACGGCCCTCAGGCATCTCTTTAACGTTTC-CTGGTCCTTTCCGAGAC
AGGATGGACG AGGCCGGCCCTGAGGTATCTCTTTAACGTTTC-CTGGTC
CTTTCCTTAAAAAAATTAAAA ACCCGCCCCCTGAGGCA-CTCTTTAACG
TTTA-CTGGTCCTTTCCAATTTGCTCTATGTC TTTTGCCCCCTGAGG-A
TCTCTTTAACGTTTA-CTGGTCCTTTCCGGCCCTTATAGGTAC GGAAAG
CCCCTGAGGCATCTCTTTAACGTTTCTCTGGTCCTTTCCAAAGAGCGCCC
GCGG ATTCCGCCCCTGAGGCATCTCTTTAACGTTTC-CTGGTCCTTTCC
GACCGAATTAATTAA . . . ATTCGGCCCC-GAGGCATCTCTTTAACG
TTTA-CTGGTCCTTTCGTTTCCCCCCGGCCCC
Use HMM to detect unalign unrelated sequence
21HMM approach to gapped extension
ACAAGGGCCCTGAGGCATCTCTTTAACGTTTC-CTGGTCCTTTTTTCC
GACTAGCAGCCA ACGAGCCCCCTGAGGCA-CTCTTTAACGT-TA-CTGG
TCCTTTCCTTTAATTTGACATGA TCATCCCCCCTGAGG-ATCTCTTTAA
CGTTTCTCTGGTCCTTTCCAAGAGCCCCCGTAGC
AGAACGGCCCTCAGGCATCTCTTTAACGTTTC-CTGGTCCTTTCCGAGAC
AGGATGGACG AGGCCGGCCCTGAGGTATCTCTTTAACGTTTC-CTGGTC
CTTTCCTTAAAAAAATTAAAA ACCCGCCCCCTGAGGCA-CTCTTTAACG
TTTA-CTGGTCCTTTCCAATTTGCTCTATGTC TTTTGCCCCCTGAGG-A
TCTCTTTAACGTTTA-CTGGTCCTTTCCGGCCCTTATAGGTAC GGAAAG
CCCCTGAGGCATCTCTTTAACGTTTCTCTGGTCCTTTCCAAAGAGCGCCC
GCGG ATTCCGCCCCTGAGGCATCTCTTTAACGTTTC-CTGGTCCTTTCC
GACCGAATTAATTAA . . . ATTCGGCCCC-GAGGCATCTCTTTAACG
TTTA-CTGGTCCTTTCGTTTCCCCCCGGCCCC
Finished leftward extension, now to the right
22HMM approach to gapped extension
ACAAGGGCCCTGAGGCATCTCTTTAACGTTTC-CTGGTCCTTTTTTCC
GACTAGCAGCCA ACGAGCCCCCTGAGGCA-CTCTTTAACGT-TA-CTGG
TCCTTTCCTTTAATTTGACATGA TCATCCCCCCTGAGG-ATCTCTTTAA
CGTTTCTCTGGTCCTTTCCAAGAGCCCCCGTAGC
AGAACGGCCCTCAGGCATCTCTTTAACGTTTC-CTGGTCCTTTCCGAGAC
AGGATGGACG AGGCCGGCCCTGAGGTATCTCTTTAACGTTTC-CTGGTC
CTTTCCTTAAAAAAATTAAAA ACCCGCCCCCTGAGGCA-CTCTTTAACG
TTTA-CTGGTCCTTTCCAATTTGCTCTATGTC TTTTGCCCCCTGAGG-A
TCTCTTTAACGTTTA-CTGGTCCTTTCCGGCCCTTATAGGTAC GGAAAG
CCCCTGAGGCATCTCTTTAACGTTTCTCTGGTCCTTTCCAAAGAGCGCCC
GCGG ATTCCGCCCCTGAGGCATCTCTTTAACGTTTC-CTGGTCCTTTCC
GACCGAATTAATTAA . . . ATTCGGCCCC-GAGGCATCTCTTTAACG
TTTA-CTGGTCCTTTCGTTTCCCCCCGGCCCC
23HMM approach to gapped extension
TAGTTGGCCCTGAGGCATCTCTTTAACGTTTC-CTGGTCCTTTTTTCC
GA---GCAGCCA TACGAGCCCCTGAGGCA-CTCTTTAACGT-TA-CTGG
TCC---TTTCCTTTAATTTGACA TTCATGCCCCTGAGG-ATCTCTTTAA
CGTTTCTCTGGTCC---TTTCCAAGAGCCCCCGT
AGAAAGGCCCTCAGGCATCTCTTTAACGTTTC-CTGGTCC---TTTCGAG
ACTAGGATGG CCGATGCCCCTGAGGTATCTCTTTAACGTTTC-CTGGTC
C---TTTCCTTAAAAAAATTA AACCCGCCCCTGAGGCA-CTCTTTAACG
TTTA-CTGGTCC---TTTCCAATTTGCTCTAT TTTTTGCCCCTGAGG-A
TCTCTTTAACGTTTA-CTGGTCC---TTTCCGGCCCTTATAGG GGAAAG
CCCCTGAGGCATCTCTTTAACGTTTCTCTGGTCC---TTTCCAAAGAGCG
CCCG CCTATGCCCCTGAGGCATCTCTTTAACGTTTC-CTGGTCC---TT
TCCGACCGAATTAAT . . . -TTCGGCCCC-GAGGCATCTCTTTAACG
TTTA-CTGGTCC---TTTCGTTTCCCCCCGGC
Perform MUSCLE alignment on window
24HMM approach to gapped extension
TAGTTGGCCCTGAGGCATCTCTTTAACGTTTC-CTGGTCCTTTTTTCC
GA---GCAGCCA TACGAGCCCCTGAGGCA-CTCTTTAACGT-TA-CTGG
TCC---TTTCCTTTAATTTGACA TTCATGCCCCTGAGG-ATCTCTTTAA
CGTTTCTCTGGTCC---TTTCCAAGAGCCCCCGT
AGAAAGGCCCTCAGGCATCTCTTTAACGTTTC-CTGGTCC---TTTCGAG
ACTAGGATGG CCGATGCCCCTGAGGTATCTCTTTAACGTTTC-CTGGTC
C---TTTCCTTAAAAAAATTA AACCCGCCCCTGAGGCA-CTCTTTAACG
TTTA-CTGGTCC---TTTCCAATTTGCTCTAT TTTTTGCCCCTGAGG-A
TCTCTTTAACGTTTA-CTGGTCC---TTTCCGGCCCTTATAGG GGAAAG
CCCCTGAGGCATCTCTTTAACGTTTCTCTGGTCC---TTTCCAAAGAGCG
CCCG CCTATGCCCCTGAGGCATCTCTTTAACGTTTC-CTGGTCC---TT
TCCGACCGAATTAAT . . . -TTCGGCCCC-GAGGCATCTCTTTAACG
TTTA-CTGGTCC---TTTCGTTTCCCCCCGGC
Use HMM to detect unalign unrelated sequence
25HMM approach to gapped extension
TAGTTGGCCCTGAGGCATCTCTTTAACGTTTC-CTGGTCCTTTTTTCC
GA---GCAGCCA TACGAGCCCCTGAGGCA-CTCTTTAACGT-TA-CTGG
TCC---TTTCCTTTAATTTGACA TTCATGCCCCTGAGG-ATCTCTTTAA
CGTTTCTCTGGTCC---TTTCCAAGAGCCCCCGT
AGAAAGGCCCTCAGGCATCTCTTTAACGTTTC-CTGGTCC---TTTCGAG
ACTAGGATGG CCGATGCCCCTGAGGTATCTCTTTAACGTTTC-CTGGTC
C---TTTCCTTAAAAAAATTA AACCCGCCCCTGAGGCA-CTCTTTAACG
TTTA-CTGGTCC---TTTCCAATTTGCTCTAT TTTTTGCCCCTGAGG-A
TCTCTTTAACGTTTA-CTGGTCC---TTTCCGGCCCTTATAGG GGAAAG
CCCCTGAGGCATCTCTTTAACGTTTCTCTGGTCC---TTTCCAAAGAGCG
CCCG CCTATGCCCCTGAGGCATCTCTTTAACGTTTC-CTGGTCC---TT
TCCGACCGAATTAAT . . . -TTCGGCCCC-GAGGCATCTCTTTAACG
TTTA-CTGGTCC---TTTCGTTTCCCCCCGGC
Extension successful, continue extending
26HMM approach to gapped extension
TAGTTGGCCCTGAGGCATCTCTTTAACGTTTC-CTGGTCCTTTTTTCC
GAGCAGCCACCA TACGAGCCCCTGAGGCA-CTCTTTAACGT-TA-CTGG
TCC---TTTCCTTTAATTTGACA TTCATGCCCCTGAGG-ATCTCTTTAA
CGTTTCTCTGGTCC---TTTCCAAGAGCCCCCGT
AGAAAGGCCCTCAGGCATCTCTTTAACGTTTC-CTGGTCC---TTTCGAG
ACTAGGATGG CCGATGCCCCTGAGGTATCTCTTTAACGTTTC-CTGGTC
C---TTTCCTTAAAAAAATTA AACCCGCCCCTGAGGCA-CTCTTTAACG
TTTA-CTGGTCC---TTTCCAATTTGCTCTAT TTTTTGCCCCTGAGG-A
TCTCTTTAACGTTTA-CTGGTCC---TTTCCGGCCCTTATAGG GGAAAG
CCCCTGAGGCATCTCTTTAACGTTTCTCTGGTCC---TTTCCAAAGAGCG
CCCG CCTATGCCCCTGAGGCATCTCTTTAACGTTTC-CTGGTCC---TT
TCCGACCGAATTAAT . . . -TTCGGCCCC-GAGGCATCTCTTTAACG
TTTA-CTGGTCC---TTTCGTTTCCCCCCGGC
Use MUSCLE to perform alignment of extension
window
27HMM approach to gapped extension
TAGTTGGCCCTGAGGCATCTCTTTAACGTTTC-CTGGTCCTTTTTTCC
---GAGCAGCCAC- TACGAGCCCCTGAGGCA-CTCTTTAACGT-TA-CT
GGTCC---TTTCCTTTAATTTGA---- TTCATGCCCCTGAGG-ATCTC
TTTAACGTTTCTCTGGTCC---TTTCC----AAGAGCCCCC
AGAAAGGCCCTCAGGCATCTCTTTAACGTTTC-CTGGTCC---TTTCG--
-AGACTAGGAT- CCGATGCCCCTGAGGTATCTCTTTAACGTTTC-CTGG
TCC---TTTCCTTAAAAAAAT---- AACCCGCCCCTGAGGCA-CTCTTT
AACGTTTA-CTGGTCC---TTTCC---AATTTGCTCT- TTTTTGCCCCT
GAGG-ATCTCTTTAACGTTTA-CTGGTCC---TTTCC----GGCCCTTAT
A GGAAAGCCCCTGAGGCATCTCTTTAACGTTTCTCTGGTCC---TTTCC
---AAAGAGCGCC- CCTATGCCCCTGAGGCATCTCTTTAACGTTTC-CT
GGTCC---TTTCC----GACCGAATTA . . . -TTCGGCCCC-GAGGC
ATCTCTTTAACGTTTA-CTGGTCC---TTTCG----TTTCCCCCCG
Use HMM to detect unalign unrelated sequence
28HMM approach to gapped extension
TAGTTGGCCCTGAGGCATCTCTTTAACGTTTC-CTGGTCCTTTTTTCC
---GAGCAGCCAC- TACGAGCCCCTGAGGCA-CTCTTTAACGT-TA-CT
GGTCC---TTTCCTTTAATTTGA---- TTCATGCCCCTGAGG-ATCTC
TTTAACGTTTCTCTGGTCC---TTTCC----AAGAGCCCCC
AGAAAGGCCCTCAGGCATCTCTTTAACGTTTC-CTGGTCC---TTTCG--
-AGACTAGGAT- CCGATGCCCCTGAGGTATCTCTTTAACGTTTC-CTGG
TCC---TTTCCTTAAAAAAAT---- AACCCGCCCCTGAGGCA-CTCTTT
AACGTTTA-CTGGTCC---TTTCC---AATTTGCTCT- TTTTTGCCCCT
GAGG-ATCTCTTTAACGTTTA-CTGGTCC---TTTCC----GGCCCTTAT
A GGAAAGCCCCTGAGGCATCTCTTTAACGTTTCTCTGGTCC---TTTCC
---AAAGAGCGCC- CCTATGCCCCTGAGGCATCTCTTTAACGTTTC-CT
GGTCC---TTTCC----GACCGAATTA . . . -TTCGGCCCC-GAGGC
ATCTCTTTAACGTTTA-CTGGTCC---TTTCG----TTTCCCCCCG
Extension failed, stop extending
29Wait a moment..
- The MUSCLE alignment software reports the highest
scoring global multiple alignment of the input
sequences, regardless of common ancestry. - As a result, it is likely that this method
forcibly aligns unrelated sequence. - HMMs to detect alignments of unrelated sequence.
30Step 5 detecting unrelated sequence
- The HMM consists of two hidden states, Homologous
and Unrelated. - The observable states are the pairwise alignment
columns, which are all possible pairs in
A,G,C,T,- with strand and species symmetry - i.e. AGGATCCT.
- The emission probabilities for each possible pair
of aligned nucleotides were extracted from the
HOXD substitution matrix presented by Chiaromonte
et al.
310.5
UUUU
U
H
- Compute emission frequencies for the Unrelated
state of our HMM using the background frequencies
of G/C and A/T, assuming strand and species
symmetry - UAA UAT UTA UTT (fAT)/2 (fAT)/2
- UCC UCG UGC UGG (fGC)/2 (fGC)/2
- UAC UAG UTC UAG (fAT)/2 (fGC)/2
- UCA UCT UGA UTT (fGT)/2 (fAT)/2
320.5
UUUUUU
H
UU
- To empirically estimate gap-open and extend
values for the unrelated state, align a 10-kb,
48 GC content region taken from E. coli CFT073
(Accession AF447814.1, coordinates 37,300-38,300)
with an unrelated sequence.
330.5
UUUUUUUUUUUU
H
UU
- Alignment with MUSCLE on unrelated sequence and
counted the number of gap-open and gap-extend
columns in the alignment of unrelated sequences.
340.5
UUUUUUUUUUUUUUUUUUHHHHHHHHHHHHHHHHHHHHHHHHHHHHH
H
UU
- Gap-open and extend frequencies for the
homologous state were estimated by constructing
an alignment of 10kb of orthologous sequence
shared among a pair of divergent organisms.
350.5
UUUUUUUUUUUUUUUUUUHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH
H
UU