Transposable Elements (TE) in genomic sequence - PowerPoint PPT Presentation

About This Presentation
Title:

Transposable Elements (TE) in genomic sequence

Description:

De novo identification of repeat families in large genomes ... cut out of its location and inserted into a new location. - consisting of DNA. Retrotransposon ... – PowerPoint PPT presentation

Number of Views:526
Avg rating:3.0/5.0
Slides: 31
Provided by: dong98
Category:

less

Transcript and Presenter's Notes

Title: Transposable Elements (TE) in genomic sequence


1
Transposable Elements (TE) in genomic sequence
  • Mina Rho

2
Contents
  • Definition
  • De novo identification of repeat families in
    large genomes (RepeatScout)
  • Alkes L. Price, Neil C. Jones and Pavel A.
    Pevzner
  • Combined Evidence Annotation of Transposable
    Elements in Genome Sequences
  • Hadi Quesneville, Casey M. Bergman, Olivier
    Andrieu, Delphine Autard, Danielle Nouaud,
    Michael Ashburner, Dominique Anxolabehere

3
Mobile element/Transposable element
  • Transposon
  • - a segment of DNA that can move around to
    different positions in the genome of a single
    cell.
  • - cut out of its location and inserted into a
    new location.
  • - consisting of DNA.
  • Retrotransposon
  • - copy and paste into a new location.
  • - the copy is made of RNA and transcribed back
    into DNA using reverse transcriptase.
  • - long terminal repeats (LTRs) at its ends.
  • gt expect to get information of evolution,
    mutation, changes of amount of DNA in the genome.

4
(No Transcript)
5
(No Transcript)
6
RepeatScout
7
Definition
  • Repeat family a collection of similar sequences
    which appear many times in a genome.
  • the Alu repeat family has over 1 million
    approximate occurrences in the human genome
  • 50 Human genome
  • l-mer substring whose length is l

8
Backgroud
  • The current status on identification method of
    repeat families
  • Given an existing library of repeat families
  • RepeatMasker
  • De novo identification
  • REPuter (Kurtz et al., 2000)
  • RepeatFinder (Volfovsky et al., 2001)
  • RECON (Bao and Eddy, 2002)
  • RepeatGluer (Pevzner et al., 2004)
  • PILER (Edgar and Myers, 2005)
  • RepeatScout

9
Overview of RepeatScout
  • Method
  • Builds a table of high frequency l-mers as seeds
  • Extends each seed to a longer consensus sequence
  • Main advantage
  • an efficient method of similarity search which
    enables a rigorous definition of repeat
    boundaries.

10
How to create l-mer table
Sequence
i
i1
i2
j
k
Hash table
l-mer1
l-mer2
l-mer3
l-mer4
l-mer5
l-mer6
frequency
Position of last occurrence
11
Output of l-mer table
  • AAAAAAAAAAAGATA 8 2920943
  • AAAAAAAGGAAAGAA 5 2468525
  • AGGCTTGAACAATGG 3 1425014
  • AAAAAAAAGAAAGAA 62 3009663
  • GTTGGTTTCAAAGAA 7 2855871
  • AAAAAAAATTTTTTT 22 2992836
  • ATTCAAGTTAAATGG 4 1473342
  • ATTCAATGTAACCAC 3 1463008
  • ATGCATGCAATGCAT 9 1788944
  • ATGCATTTAAAAGAA 3 1464381
  • AAAAAACTCACTCCA 5 1489159

12
How to build all positions of repeats
Sequence
i
i1
i2
j
k
Hash table
l-mer1
l-mer2
l-mer3
l-mer4
l-mer5
l-mer6
j
i
k
i2
13
Query sequence (with l-mer1)
S1
S2
S3
S4
S5
Extending Q maximizing objective function one
nucleotide at a time
14
Objective Function
  • Q the length of Q
  • C minimum threshold on the number of repeat
    elements
  • a(Q, Sk) a pairwise fit_preferred alignment
    score

p Incomplete-fit penalty
15
Output of optimized Q
  • gtR0
  • GGCCGGGCGCGGTGGCTCACGCCTGTAATCCCAGCACTTTGGGAGGCCGA
    GGCGGGCGGATCACTTGAGGTCAGGAGTTC
  • GAGACCAGCCTGGCCAACATGGTGAAACCCCGTCTCTACTAAAAATACAA
    AAATTAGCCGGGCGTGGTGGCGCGCGCCTG
  • TAATCCCAGCTACTCGGGAGGCTGAGGCAGGAGAATCGCTTGAACCCGGG
    AGGCGGAGGTTGCAGTGAGCCGAGATCGCG
  • CCACTGCACTCCAGCCTGGGCGACAGAGCGAGACTCCGTCTCAAAAAAAA
    AAAAAAAAAAAAAAAAAAA
  • gtR1
  • AAAAGGCAGCAGAAACCTCTGCAGACTTAAATGTCCCTGTCTGACAGCTT
    TGAAGAGAGTAGTGGTTCTCCCAGCACGCA
  • GCTGGAGATCTGAGAACGGACAGACTGCCTCCTCAAGTGGATCCCTGACC
    CCCGAGTAGCCTAACTGGGAGGCACCCCCC
  • AGTAGGGGCAGACTGACACCTCACACGGCCAGGTACTCCTCTGAGAAAAA
    ACTTCCAGAGGAACAATCAGGCAGCAACAT
  • TTGCTGCTCACCAATATCCACTGTTCTGCAGCCTCTGCTGCTGATACCCA
    GGCAAACAGGGTCTGGAGTGGACCTCCAGC
  • AAACTCCAACAGACCTGCAGCTGAGGGTCCTGTCTGTTAGAAGGAAAACT
    AACAAACAGAAAGGACATCCACACCAAAAA
  • CCCATCTGTACGTCACCATCATCAAAGACCAAAAGTAGATAAAACCACAA
    AGATGGGGAAAAAACAGAGCAGAAAAACTG
  • GAAACTCTAAAAAGCAGAGCGCCTCTCCTCCTCCAAAGGAACGCAGCTCC
    TCACCAGCAACGGAACAAAGCTGGACGGAG
  • AATGACTTTGATGAGTTGAGAGAAGAAGGCTTCAGATGATCAAACTACTC
    CAAGCTAAAGGAGGAAATTCAAACCCATGG
  • CAAAGAAGTTAAAAACCTTGAAAAAAAATTAGACGAATGGATAACTAGAA
    TAACCAATGCAGAGAAGTCCTTAAAGGAGC
  • TGATGGAGCTGAAAACCAAGGCTCGAGAACTACGTGAAGAATGCACAAGC
    CTCAGGAGCCGATGCGATCAACTGGAAGAA
  • AGGGTATCAGTGATGGAAGATCAAATGAATGAAATGAAGTGAGAAGAGAA
    GTTTAGAGAAAAAAGAATAAAAAGAAATGA
  • gtR2
  • TTTTTTTTTTTTTTTAGATGCGGGGTGTCACTGTGTTGCTCAGGCTGGTC
    TCAAACTCCTGGGCTCAAGTGATCCTCCCA

16
Parameter setting and post processing
  • Parameter setting
  • Recommend the smallest l 15
  • For the arbitrary length L,
  • The length of Q up to 10,000bp on each side
  • Remove repeat families with Q lt 50
  • Postprocessing
  • Tandem Repeat finder, Nseg
  • Remove repeat families with gt50 of their length
    annotated as low-complexity and tandem repeats
  • RepeatMasker
  • Mask the repeat families based on the library

17
Benchmark
  • C.briggsae genome (108Mb)
  • 7h on a single 0.5 GHz DEC Alpha processor

18
Combined evidence model of TE
19
Overview
  • Query Sequences Drosophila melanogaster (Fruit
    fly) Release 3, 4
  • Combined evidence model pipeline of
    RepeatMasker, BLASTER, TBLASTX, all-by-all
    BLASTN, RECON, and TE-HMM
  • - Methods for the annotation of known TE
    families
  • - Methods for the annotation of anonymous TE
    families
  • Benchmark FlyBase Release 3.1 annotation
  • Sensitivity and specificity,
    characteristics of boundary

20
Tools
  • Blaster
  • compares a query sequences against a subject
    databank.
  • Launches one of the BLAST (BLASTN, TBLASTN,
    BLASTX, TBLASTX).
  • Cut long sequences before launching BLAST and
    reassembles the results.
  • MATCHER
  • Maps match results onto query sequences by
    filtering overlapping hits.
  • Keeps the match results with E-value lt 10-10 and
    length gt20
  • Chains the remaining matches by dynamic
    programming.
  • GROUPER
  • Gather similar sequences into groups

21
Measures
  • For each nucleotide,
  • TP correctly annotated as belonging to a TE
  • FP falsely predicted as belonging to a TE
  • TN correctly annotated as not belonging to a TE
  • FN falsely predicted as not belonging to a TE

22
(No Transcript)
23
(No Transcript)
24
Method for the Annotation of known TE families
  • BLASTER using BLASTN and MATCHER (BLRn)
  • RepeatMasker (RM)
  • RepeatMasker with MATCHER (RMm)

25
Method for the Annotation of known TE families
  • BLASTER using BLASTN and MATCHER (BLRn)
  • RepeatMasker (RM)
  • RepeatMasker with MATCHER (RMm)
  • RepeatMasker-BLASTER (RMBLR) combined hits from
    both BLRn and RM and give them to MATCHER

26
Method for the Annotation of anonymous TE
families
  • all-by-all comparison with BLASTER using BLASTN,
    MATCHER, and GROUPER
  • RECON
  • BLASTER using TBLASTX and MATCHER
  • HMM

27
What they (we) learned
  • Overall, BLRn outperforms RM with respect to the
    precise determination of TE boundaries.
  • RM is more sensitive for the detection of small
    and divergent TE.
  • The difference between BLRn and RM make them
    complementary for TE annotation.
  • A combined-evidence framework can improve the
    quality and confidence of TE annotation.

28
Pipeline structure
  • TE detection software BLASTER, RepeatMasker,
    TE-HMM, and RECON
  • Tandem repeat detection software RepeatMasker,
    Tandem Repeat Finder (TRF), Mreps
  • Database MySQL
  • Open Portable Batch System
  • Whole genomic sequence was segmented into chucks
    of 200kb overlapping by 10kb.
  • The results from different tool were stored in
    the database.
  • XML file is generated from the stored results and
    loaded into the Apollo genome annotation tool.

29
The Annotation Pipeline
30
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com