Title: Transposable Elements (TE) in genomic sequence
1Transposable Elements (TE) in genomic sequence
2Contents
- Definition
- De novo identification of repeat families in
large genomes (RepeatScout) - Alkes L. Price, Neil C. Jones and Pavel A.
Pevzner - Combined Evidence Annotation of Transposable
Elements in Genome Sequences - Hadi Quesneville, Casey M. Bergman, Olivier
Andrieu, Delphine Autard, Danielle Nouaud,
Michael Ashburner, Dominique Anxolabehere
3Mobile element/Transposable element
- Transposon
- - a segment of DNA that can move around to
different positions in the genome of a single
cell. - - cut out of its location and inserted into a
new location. - - consisting of DNA.
- Retrotransposon
- - copy and paste into a new location.
- - the copy is made of RNA and transcribed back
into DNA using reverse transcriptase. - - long terminal repeats (LTRs) at its ends.
- gt expect to get information of evolution,
mutation, changes of amount of DNA in the genome.
4(No Transcript)
5(No Transcript)
6RepeatScout
7Definition
- Repeat family a collection of similar sequences
which appear many times in a genome. - the Alu repeat family has over 1 million
approximate occurrences in the human genome - 50 Human genome
- l-mer substring whose length is l
8Backgroud
- The current status on identification method of
repeat families - Given an existing library of repeat families
- RepeatMasker
- De novo identification
- REPuter (Kurtz et al., 2000)
- RepeatFinder (Volfovsky et al., 2001)
- RECON (Bao and Eddy, 2002)
- RepeatGluer (Pevzner et al., 2004)
- PILER (Edgar and Myers, 2005)
- RepeatScout
9Overview of RepeatScout
- Method
- Builds a table of high frequency l-mers as seeds
- Extends each seed to a longer consensus sequence
- Main advantage
- an efficient method of similarity search which
enables a rigorous definition of repeat
boundaries.
10How to create l-mer table
Sequence
i
i1
i2
j
k
Hash table
l-mer1
l-mer2
l-mer3
l-mer4
l-mer5
l-mer6
frequency
Position of last occurrence
11Output of l-mer table
- AAAAAAAAAAAGATA 8 2920943
- AAAAAAAGGAAAGAA 5 2468525
- AGGCTTGAACAATGG 3 1425014
- AAAAAAAAGAAAGAA 62 3009663
- GTTGGTTTCAAAGAA 7 2855871
- AAAAAAAATTTTTTT 22 2992836
- ATTCAAGTTAAATGG 4 1473342
- ATTCAATGTAACCAC 3 1463008
- ATGCATGCAATGCAT 9 1788944
- ATGCATTTAAAAGAA 3 1464381
- AAAAAACTCACTCCA 5 1489159
12How to build all positions of repeats
Sequence
i
i1
i2
j
k
Hash table
l-mer1
l-mer2
l-mer3
l-mer4
l-mer5
l-mer6
j
i
k
i2
13Query sequence (with l-mer1)
S1
S2
S3
S4
S5
Extending Q maximizing objective function one
nucleotide at a time
14Objective Function
- Q the length of Q
- C minimum threshold on the number of repeat
elements - a(Q, Sk) a pairwise fit_preferred alignment
score -
p Incomplete-fit penalty
15Output of optimized Q
- gtR0
- GGCCGGGCGCGGTGGCTCACGCCTGTAATCCCAGCACTTTGGGAGGCCGA
GGCGGGCGGATCACTTGAGGTCAGGAGTTC - GAGACCAGCCTGGCCAACATGGTGAAACCCCGTCTCTACTAAAAATACAA
AAATTAGCCGGGCGTGGTGGCGCGCGCCTG - TAATCCCAGCTACTCGGGAGGCTGAGGCAGGAGAATCGCTTGAACCCGGG
AGGCGGAGGTTGCAGTGAGCCGAGATCGCG - CCACTGCACTCCAGCCTGGGCGACAGAGCGAGACTCCGTCTCAAAAAAAA
AAAAAAAAAAAAAAAAAAA - gtR1
- AAAAGGCAGCAGAAACCTCTGCAGACTTAAATGTCCCTGTCTGACAGCTT
TGAAGAGAGTAGTGGTTCTCCCAGCACGCA - GCTGGAGATCTGAGAACGGACAGACTGCCTCCTCAAGTGGATCCCTGACC
CCCGAGTAGCCTAACTGGGAGGCACCCCCC - AGTAGGGGCAGACTGACACCTCACACGGCCAGGTACTCCTCTGAGAAAAA
ACTTCCAGAGGAACAATCAGGCAGCAACAT - TTGCTGCTCACCAATATCCACTGTTCTGCAGCCTCTGCTGCTGATACCCA
GGCAAACAGGGTCTGGAGTGGACCTCCAGC - AAACTCCAACAGACCTGCAGCTGAGGGTCCTGTCTGTTAGAAGGAAAACT
AACAAACAGAAAGGACATCCACACCAAAAA - CCCATCTGTACGTCACCATCATCAAAGACCAAAAGTAGATAAAACCACAA
AGATGGGGAAAAAACAGAGCAGAAAAACTG - GAAACTCTAAAAAGCAGAGCGCCTCTCCTCCTCCAAAGGAACGCAGCTCC
TCACCAGCAACGGAACAAAGCTGGACGGAG - AATGACTTTGATGAGTTGAGAGAAGAAGGCTTCAGATGATCAAACTACTC
CAAGCTAAAGGAGGAAATTCAAACCCATGG - CAAAGAAGTTAAAAACCTTGAAAAAAAATTAGACGAATGGATAACTAGAA
TAACCAATGCAGAGAAGTCCTTAAAGGAGC - TGATGGAGCTGAAAACCAAGGCTCGAGAACTACGTGAAGAATGCACAAGC
CTCAGGAGCCGATGCGATCAACTGGAAGAA - AGGGTATCAGTGATGGAAGATCAAATGAATGAAATGAAGTGAGAAGAGAA
GTTTAGAGAAAAAAGAATAAAAAGAAATGA - gtR2
- TTTTTTTTTTTTTTTAGATGCGGGGTGTCACTGTGTTGCTCAGGCTGGTC
TCAAACTCCTGGGCTCAAGTGATCCTCCCA
16Parameter setting and post processing
- Parameter setting
- Recommend the smallest l 15
- For the arbitrary length L,
- The length of Q up to 10,000bp on each side
- Remove repeat families with Q lt 50
- Postprocessing
- Tandem Repeat finder, Nseg
- Remove repeat families with gt50 of their length
annotated as low-complexity and tandem repeats - RepeatMasker
- Mask the repeat families based on the library
17Benchmark
- C.briggsae genome (108Mb)
- 7h on a single 0.5 GHz DEC Alpha processor
18Combined evidence model of TE
19Overview
- Query Sequences Drosophila melanogaster (Fruit
fly) Release 3, 4 - Combined evidence model pipeline of
RepeatMasker, BLASTER, TBLASTX, all-by-all
BLASTN, RECON, and TE-HMM - - Methods for the annotation of known TE
families - - Methods for the annotation of anonymous TE
families - Benchmark FlyBase Release 3.1 annotation
- Sensitivity and specificity,
characteristics of boundary
20Tools
- Blaster
- compares a query sequences against a subject
databank. - Launches one of the BLAST (BLASTN, TBLASTN,
BLASTX, TBLASTX). - Cut long sequences before launching BLAST and
reassembles the results. - MATCHER
- Maps match results onto query sequences by
filtering overlapping hits. - Keeps the match results with E-value lt 10-10 and
length gt20 - Chains the remaining matches by dynamic
programming. - GROUPER
- Gather similar sequences into groups
21Measures
- For each nucleotide,
- TP correctly annotated as belonging to a TE
- FP falsely predicted as belonging to a TE
- TN correctly annotated as not belonging to a TE
- FN falsely predicted as not belonging to a TE
22(No Transcript)
23(No Transcript)
24Method for the Annotation of known TE families
- BLASTER using BLASTN and MATCHER (BLRn)
- RepeatMasker (RM)
- RepeatMasker with MATCHER (RMm)
-
25Method for the Annotation of known TE families
- BLASTER using BLASTN and MATCHER (BLRn)
- RepeatMasker (RM)
- RepeatMasker with MATCHER (RMm)
- RepeatMasker-BLASTER (RMBLR) combined hits from
both BLRn and RM and give them to MATCHER -
26Method for the Annotation of anonymous TE
families
- all-by-all comparison with BLASTER using BLASTN,
MATCHER, and GROUPER - RECON
- BLASTER using TBLASTX and MATCHER
- HMM
-
27What they (we) learned
- Overall, BLRn outperforms RM with respect to the
precise determination of TE boundaries. - RM is more sensitive for the detection of small
and divergent TE. - The difference between BLRn and RM make them
complementary for TE annotation. - A combined-evidence framework can improve the
quality and confidence of TE annotation.
28Pipeline structure
- TE detection software BLASTER, RepeatMasker,
TE-HMM, and RECON - Tandem repeat detection software RepeatMasker,
Tandem Repeat Finder (TRF), Mreps - Database MySQL
- Open Portable Batch System
- Whole genomic sequence was segmented into chucks
of 200kb overlapping by 10kb. - The results from different tool were stored in
the database. - XML file is generated from the stored results and
loaded into the Apollo genome annotation tool.
29The Annotation Pipeline
30(No Transcript)