Genomic Repeat Visualisation Using Suffix Arrays - PowerPoint PPT Presentation

1 / 45
About This Presentation
Title:

Genomic Repeat Visualisation Using Suffix Arrays

Description:

A sliding window is ran over the entire sequence to divide it into all ... Machinery and Manufacturers' by Charles Babbage with artificial repeat inserted 16times. ... – PowerPoint PPT presentation

Number of Views:24
Avg rating:3.0/5.0
Slides: 46
Provided by: velblodVid
Category:

less

Transcript and Presenter's Notes

Title: Genomic Repeat Visualisation Using Suffix Arrays


1
Genomic Repeat Visualisation Using Suffix Arrays
  • Nava Whiteford
  • Department of Chemistry
  • University of Southampton
  • new_at_soton.ac.uk

2
Repeat Visualisation Using Suffix Arrays
  • The Analysis
  • Artificial Sequences
  • Genomic Sequences
  • The Algorithm
  • Larger Sequences
  • Non-genomic sequences

3
The repeatscore plot
  • A sliding window is ran over the entire sequence
    to divide it into all substrings of a given
    length. (in this case 2).
  • ATGCATATA
  • AT TG GC CA AT TA AT TA

4
The repeatscore plot
  • A sliding window is ran over the entire sequence
    to divide it into all substrings of a given
    length. (in this case 2).
  • ATGCATATA
  • AT TG GC CA AT TA AT TA

5
The repeatscore plot
  • A sliding window is ran over the entire sequence
    to divide it into all substrings of a given
    length. (in this case 2).
  • ATGCATATA
  • AT TG GC CA AT TA AT TA

6
The repeatscore plot
  • A sliding window is ran over the entire sequence
    to divide it into all substrings of a given
    length. (in this case 2).
  • ATGCATATA
  • AT TG GC CA AT TA AT TA

7
The repeatscore plot
  • A sliding window is ran over the entire sequence
    to divide it into all substrings of a given
    length. (in this case 2).
  • ATGCATATA
  • AT TG GC CA AT TA AT TA

8
The repeatscore plot
  • A sliding window is ran over the entire sequence
    to divide it into all substrings of a given
    length. (in this case 2).
  • ATGCATATA
  • AT TG GC CA AT TA AT TA

9
The repeatscore plot
  • A sliding window is ran over the entire sequence
    to divide it into all substrings of a given
    length. (in this case 2).
  • ATGCATATA
  • AT TG GC CA AT TA AT TA

10
The repeatscore plot
  • A sliding window is ran over the entire sequence
    to divide it into all substrings of a given
    length. (in this case 2).
  • ATGCATATA
  • AT TG GC CA AT TA AT TA

11
The repeatscore plot
  • A sliding window is ran over the entire sequence
    to divide it into all substrings of a given
    length. (in this case 2).
  • ATGCATATA
  • AT TG GC CA AT TA AT TA

12
The repeatscore plot
  • A sliding window is ran over the entire sequence
    to divide it into all substrings of a given
    length. (in this case 2).
  • ATGCATATA
  • AT TG GC CA AT TA AT TA

13
The repeatscore plot
  • A sliding window is ran over the entire sequence
    to divide it into all substrings of a given
    length. (in this case 2).
  • ATGCATATA
  • AT TG GC CA AT TA AT TA

1
2
3
14
The repeatscore plot
  • A sliding window is ran over the entire sequence
    to divide it into all substrings of a given
    length. (in this case 2).
  • ATGCATATA
  • AT TG GC CA AT TA AT TA

AT Occurs 3 time(s) TG Occurs 1 time(s) GC Occurs
1 time(s) CA Occurs 1 time(s) TA Occurs 2 time(s)
15
The repeatscore plot
  • A sliding window is ran over the entire sequence
    to divide it into all substrings of a given
    length. (in this case 2).
  • ATGCATATA
  • AT TG GC CA AT TA AT TA

AT Occurs 3 time(s) TG Occurs 1 time(s) GC Occurs
1 time(s) CA Occurs 1 time(s) TA Occurs 2 time(s)
16
The repeat-score plot
17
The repeat-score plot
  • The resulting matrix is then plotted as an image

18
Repeatscore plots of Artificial Sequences
  • Small repeats

Reverse strand is also included
19
Random Sequences
20
DNA Sequences
  • The language of life
  • Composed of four different bases A, T,
    G and C
  • Sequences range in size from 2000bp to 670
    billion bp.

21
Small Genomic Sequences
Lambda Phage
22
Small Genomic Sequences
Lambda Phage
Random Sequence
23
E.Coli
24
E.Coli
25
E.Coli
Sequences coding for rRNA
Known inter-genic repeat elements
26
E.Coli
27
Repeats in Genomic Sequences
28
A Linear time algorithm
  • The plots shown would take hours to construct
    using traditional methods.
  • The algorithms used would not scale linearly
  • It is not feasible to create these plots on large
    sequences unless more advanced algorithms are
    used.

29
The suffix array
  • Original string banana
  • banana
  • anana
  • nana
  • ana
  • na
  • a

All suffixes
30
The suffix array
  • Original string banana
  • banana
  • anana
  • nana
  • ana
  • na
  • a
  • a
  • ana
  • anana
  • banana
  • na
  • nana

All suffixes
In sorted order
31
Generating the repeatscore plot
a ana anana banana na nana
32
Generating the repeatscore plot
a ana anana banana na nana
33
Whole human genome
34
Whole human genome
35
Whole human genome
36
Human Chromosome 18
37
Arabidopsis thaliana chromosome 1, coding region
38
Fibonacci derived sequences
39
Gallus gallus chromosome 20
40
Application to other sequences
  • Analysing writing styles
  • Finding plagiarised text
  • Any sequence that may contain motif based,
    language like structure.

41
Shakespeare
42
Text document containing the text The quick
brown fox jumped over the lazy dog 16times.
43
On the Economy of Machinery and Manufacturers
by Charles Babbage with artificial repeat
inserted 16times.
44
On the Economy of Machinery and Manufacturers
by Charles Babbage with artificial repeat
inserted 16times.
45
Conclusion
  • This new visualisation technique can highlight
    repeat structure in sequences.
  • In genomic sequences this maybe useful in
    generating annotation.
  • There are applications in other areas worth
    pursuing.
  • Our next step is to allow the repeatscore plot to
    be easily interrogated by a user in order to
    better understand the repeat structure.
Write a Comment
User Comments (0)
About PowerShow.com