Title: Genomic Repeat Visualisation Using Suffix Arrays
1Genomic Repeat Visualisation Using Suffix Arrays
- Nava Whiteford
- Department of Chemistry
- University of Southampton
- new_at_soton.ac.uk
2Repeat Visualisation Using Suffix Arrays
- The Analysis
- Artificial Sequences
- Genomic Sequences
- The Algorithm
- Larger Sequences
- Non-genomic sequences
3The repeatscore plot
- A sliding window is ran over the entire sequence
to divide it into all substrings of a given
length. (in this case 2). - ATGCATATA
- AT TG GC CA AT TA AT TA
4The repeatscore plot
- A sliding window is ran over the entire sequence
to divide it into all substrings of a given
length. (in this case 2). - ATGCATATA
- AT TG GC CA AT TA AT TA
5The repeatscore plot
- A sliding window is ran over the entire sequence
to divide it into all substrings of a given
length. (in this case 2). - ATGCATATA
- AT TG GC CA AT TA AT TA
6The repeatscore plot
- A sliding window is ran over the entire sequence
to divide it into all substrings of a given
length. (in this case 2). - ATGCATATA
- AT TG GC CA AT TA AT TA
7The repeatscore plot
- A sliding window is ran over the entire sequence
to divide it into all substrings of a given
length. (in this case 2). - ATGCATATA
- AT TG GC CA AT TA AT TA
8The repeatscore plot
- A sliding window is ran over the entire sequence
to divide it into all substrings of a given
length. (in this case 2). - ATGCATATA
- AT TG GC CA AT TA AT TA
9The repeatscore plot
- A sliding window is ran over the entire sequence
to divide it into all substrings of a given
length. (in this case 2). - ATGCATATA
- AT TG GC CA AT TA AT TA
10The repeatscore plot
- A sliding window is ran over the entire sequence
to divide it into all substrings of a given
length. (in this case 2). - ATGCATATA
- AT TG GC CA AT TA AT TA
11The repeatscore plot
- A sliding window is ran over the entire sequence
to divide it into all substrings of a given
length. (in this case 2). - ATGCATATA
- AT TG GC CA AT TA AT TA
12The repeatscore plot
- A sliding window is ran over the entire sequence
to divide it into all substrings of a given
length. (in this case 2). - ATGCATATA
- AT TG GC CA AT TA AT TA
13The repeatscore plot
- A sliding window is ran over the entire sequence
to divide it into all substrings of a given
length. (in this case 2). - ATGCATATA
- AT TG GC CA AT TA AT TA
1
2
3
14The repeatscore plot
- A sliding window is ran over the entire sequence
to divide it into all substrings of a given
length. (in this case 2). - ATGCATATA
- AT TG GC CA AT TA AT TA
AT Occurs 3 time(s) TG Occurs 1 time(s) GC Occurs
1 time(s) CA Occurs 1 time(s) TA Occurs 2 time(s)
15The repeatscore plot
- A sliding window is ran over the entire sequence
to divide it into all substrings of a given
length. (in this case 2). - ATGCATATA
- AT TG GC CA AT TA AT TA
AT Occurs 3 time(s) TG Occurs 1 time(s) GC Occurs
1 time(s) CA Occurs 1 time(s) TA Occurs 2 time(s)
16The repeat-score plot
17The repeat-score plot
- The resulting matrix is then plotted as an image
18Repeatscore plots of Artificial Sequences
Reverse strand is also included
19Random Sequences
20DNA Sequences
- The language of life
- Composed of four different bases A, T,
G and C - Sequences range in size from 2000bp to 670
billion bp.
21Small Genomic Sequences
Lambda Phage
22Small Genomic Sequences
Lambda Phage
Random Sequence
23E.Coli
24E.Coli
25E.Coli
Sequences coding for rRNA
Known inter-genic repeat elements
26E.Coli
27Repeats in Genomic Sequences
28A Linear time algorithm
- The plots shown would take hours to construct
using traditional methods. - The algorithms used would not scale linearly
- It is not feasible to create these plots on large
sequences unless more advanced algorithms are
used.
29The suffix array
- banana
- anana
- nana
- ana
- na
- a
All suffixes
30The suffix array
- banana
- anana
- nana
- ana
- na
- a
- a
- ana
- anana
- banana
- na
- nana
All suffixes
In sorted order
31Generating the repeatscore plot
a ana anana banana na nana
32Generating the repeatscore plot
a ana anana banana na nana
33Whole human genome
34Whole human genome
35Whole human genome
36Human Chromosome 18
37Arabidopsis thaliana chromosome 1, coding region
38Fibonacci derived sequences
39Gallus gallus chromosome 20
40Application to other sequences
- Analysing writing styles
- Finding plagiarised text
- Any sequence that may contain motif based,
language like structure.
41Shakespeare
42Text document containing the text The quick
brown fox jumped over the lazy dog 16times.
43On the Economy of Machinery and Manufacturers
by Charles Babbage with artificial repeat
inserted 16times.
44On the Economy of Machinery and Manufacturers
by Charles Babbage with artificial repeat
inserted 16times.
45Conclusion
- This new visualisation technique can highlight
repeat structure in sequences. - In genomic sequences this maybe useful in
generating annotation. - There are applications in other areas worth
pursuing. - Our next step is to allow the repeatscore plot to
be easily interrogated by a user in order to
better understand the repeat structure.