Transposable Elements (TE) in genomic sequence - PowerPoint PPT Presentation

About This Presentation

Title:

Transposable Elements (TE) in genomic sequence

Description:

De novo identification of repeat families in large genomes ... cut out of its location and inserted into a new location. - consisting of DNA. Retrotransposon ... – PowerPoint PPT presentation

Number of Views:526

Avg rating:3.0/5.0

Slides: 31

Provided by: dong98

Learn more at: http://darwin.informatics.indiana.edu

Category:

more less

Transcript and Presenter's Notes

Title: Transposable Elements (TE) in genomic sequence

1
Transposable Elements (TE) in genomic sequence

Mina Rho

2
Contents

Definition
De novo identification of repeat families in
large genomes (RepeatScout)
Alkes L. Price, Neil C. Jones and Pavel A.
Pevzner
Combined Evidence Annotation of Transposable
Elements in Genome Sequences
Hadi Quesneville, Casey M. Bergman, Olivier
Andrieu, Delphine Autard, Danielle Nouaud,
Michael Ashburner, Dominique Anxolabehere

3
Mobile element/Transposable element

Transposon
- a segment of DNA that can move around to
different positions in the genome of a single
cell.
- cut out of its location and inserted into a
new location.
- consisting of DNA.
Retrotransposon
- copy and paste into a new location.
- the copy is made of RNA and transcribed back
into DNA using reverse transcriptase.
- long terminal repeats (LTRs) at its ends.
gt expect to get information of evolution,
mutation, changes of amount of DNA in the genome.

4
(No Transcript)
5
(No Transcript)
6
RepeatScout
7
Definition

Repeat family a collection of similar sequences
which appear many times in a genome.
the Alu repeat family has over 1 million
approximate occurrences in the human genome
50 Human genome
l-mer substring whose length is l

8
Backgroud

The current status on identification method of
repeat families
Given an existing library of repeat families
RepeatMasker
De novo identification
REPuter (Kurtz et al., 2000)
RepeatFinder (Volfovsky et al., 2001)
RECON (Bao and Eddy, 2002)
RepeatGluer (Pevzner et al., 2004)
PILER (Edgar and Myers, 2005)
RepeatScout

9
Overview of RepeatScout

Method
Builds a table of high frequency l-mers as seeds
Extends each seed to a longer consensus sequence
Main advantage
an efficient method of similarity search which
enables a rigorous definition of repeat
boundaries.

10
How to create l-mer table
Sequence
i
i1
i2
j
k
Hash table
l-mer1
l-mer2
l-mer3
l-mer4
l-mer5
l-mer6
frequency
Position of last occurrence
11
Output of l-mer table

AAAAAAAAAAAGATA 8 2920943
AAAAAAAGGAAAGAA 5 2468525
AGGCTTGAACAATGG 3 1425014
AAAAAAAAGAAAGAA 62 3009663
GTTGGTTTCAAAGAA 7 2855871
AAAAAAAATTTTTTT 22 2992836
ATTCAAGTTAAATGG 4 1473342
ATTCAATGTAACCAC 3 1463008
ATGCATGCAATGCAT 9 1788944
ATGCATTTAAAAGAA 3 1464381
AAAAAACTCACTCCA 5 1489159

12
How to build all positions of repeats
Sequence
i
i1
i2
j
k
Hash table
l-mer1
l-mer2
l-mer3
l-mer4
l-mer5
l-mer6
j
i
k
i2
13
Query sequence (with l-mer1)
S1
S2
S3
S4
S5
Extending Q maximizing objective function one
nucleotide at a time
14
Objective Function

Q the length of Q
C minimum threshold on the number of repeat
elements
a(Q, Sk) a pairwise fit_preferred alignment
score

p Incomplete-fit penalty
15
Output of optimized Q

gtR0
GGCCGGGCGCGGTGGCTCACGCCTGTAATCCCAGCACTTTGGGAGGCCGA
GGCGGGCGGATCACTTGAGGTCAGGAGTTC
GAGACCAGCCTGGCCAACATGGTGAAACCCCGTCTCTACTAAAAATACAA
AAATTAGCCGGGCGTGGTGGCGCGCGCCTG
TAATCCCAGCTACTCGGGAGGCTGAGGCAGGAGAATCGCTTGAACCCGGG
AGGCGGAGGTTGCAGTGAGCCGAGATCGCG
CCACTGCACTCCAGCCTGGGCGACAGAGCGAGACTCCGTCTCAAAAAAAA
AAAAAAAAAAAAAAAAAAA
gtR1
AAAAGGCAGCAGAAACCTCTGCAGACTTAAATGTCCCTGTCTGACAGCTT
TGAAGAGAGTAGTGGTTCTCCCAGCACGCA
GCTGGAGATCTGAGAACGGACAGACTGCCTCCTCAAGTGGATCCCTGACC
CCCGAGTAGCCTAACTGGGAGGCACCCCCC
AGTAGGGGCAGACTGACACCTCACACGGCCAGGTACTCCTCTGAGAAAAA
ACTTCCAGAGGAACAATCAGGCAGCAACAT
TTGCTGCTCACCAATATCCACTGTTCTGCAGCCTCTGCTGCTGATACCCA
GGCAAACAGGGTCTGGAGTGGACCTCCAGC
AAACTCCAACAGACCTGCAGCTGAGGGTCCTGTCTGTTAGAAGGAAAACT
AACAAACAGAAAGGACATCCACACCAAAAA
CCCATCTGTACGTCACCATCATCAAAGACCAAAAGTAGATAAAACCACAA
AGATGGGGAAAAAACAGAGCAGAAAAACTG
GAAACTCTAAAAAGCAGAGCGCCTCTCCTCCTCCAAAGGAACGCAGCTCC
TCACCAGCAACGGAACAAAGCTGGACGGAG
AATGACTTTGATGAGTTGAGAGAAGAAGGCTTCAGATGATCAAACTACTC
CAAGCTAAAGGAGGAAATTCAAACCCATGG
CAAAGAAGTTAAAAACCTTGAAAAAAAATTAGACGAATGGATAACTAGAA
TAACCAATGCAGAGAAGTCCTTAAAGGAGC
TGATGGAGCTGAAAACCAAGGCTCGAGAACTACGTGAAGAATGCACAAGC
CTCAGGAGCCGATGCGATCAACTGGAAGAA
AGGGTATCAGTGATGGAAGATCAAATGAATGAAATGAAGTGAGAAGAGAA
GTTTAGAGAAAAAAGAATAAAAAGAAATGA
gtR2
TTTTTTTTTTTTTTTAGATGCGGGGTGTCACTGTGTTGCTCAGGCTGGTC
TCAAACTCCTGGGCTCAAGTGATCCTCCCA

16
Parameter setting and post processing

Parameter setting
Recommend the smallest l 15
For the arbitrary length L,
The length of Q up to 10,000bp on each side
Remove repeat families with Q lt 50
Postprocessing
Tandem Repeat finder, Nseg
Remove repeat families with gt50 of their length
annotated as low-complexity and tandem repeats
RepeatMasker
Mask the repeat families based on the library

17
Benchmark

C.briggsae genome (108Mb)
7h on a single 0.5 GHz DEC Alpha processor

18
Combined evidence model of TE
19
Overview

Query Sequences Drosophila melanogaster (Fruit
fly) Release 3, 4
Combined evidence model pipeline of
RepeatMasker, BLASTER, TBLASTX, all-by-all
BLASTN, RECON, and TE-HMM
- Methods for the annotation of known TE
families
- Methods for the annotation of anonymous TE
families
Benchmark FlyBase Release 3.1 annotation
Sensitivity and specificity,
characteristics of boundary

20
Tools

Blaster
compares a query sequences against a subject
databank.
Launches one of the BLAST (BLASTN, TBLASTN,
BLASTX, TBLASTX).
Cut long sequences before launching BLAST and
reassembles the results.
MATCHER
Maps match results onto query sequences by
filtering overlapping hits.
Keeps the match results with E-value lt 10-10 and
length gt20
Chains the remaining matches by dynamic
programming.
GROUPER
Gather similar sequences into groups

21
Measures

For each nucleotide,
TP correctly annotated as belonging to a TE
FP falsely predicted as belonging to a TE
TN correctly annotated as not belonging to a TE
FN falsely predicted as not belonging to a TE

22
(No Transcript)
23
(No Transcript)
24
Method for the Annotation of known TE families

BLASTER using BLASTN and MATCHER (BLRn)
RepeatMasker (RM)
RepeatMasker with MATCHER (RMm)

25
Method for the Annotation of known TE families

BLASTER using BLASTN and MATCHER (BLRn)
RepeatMasker (RM)
RepeatMasker with MATCHER (RMm)
RepeatMasker-BLASTER (RMBLR) combined hits from
both BLRn and RM and give them to MATCHER

26
Method for the Annotation of anonymous TE
families

all-by-all comparison with BLASTER using BLASTN,
MATCHER, and GROUPER
RECON
BLASTER using TBLASTX and MATCHER
HMM

27
What they (we) learned

Overall, BLRn outperforms RM with respect to the
precise determination of TE boundaries.
RM is more sensitive for the detection of small
and divergent TE.
The difference between BLRn and RM make them
complementary for TE annotation.
A combined-evidence framework can improve the
quality and confidence of TE annotation.

28
Pipeline structure

TE detection software BLASTER, RepeatMasker,
TE-HMM, and RECON
Tandem repeat detection software RepeatMasker,
Tandem Repeat Finder (TRF), Mreps
Database MySQL
Open Portable Batch System
Whole genomic sequence was segmented into chucks
of 200kb overlapping by 10kb.
The results from different tool were stored in
the database.
XML file is generated from the stored results and
loaded into the Apollo genome annotation tool.