Title: Investigating effective and novel techniques to conduct pairwise alignment of very large genomic seq
1 Investigating effective and novel techniques to
conduct pairwise alignment of very large genomic
sequences. William D. Kenworthy Semester 2, 2003
2 Overview
- Aims of the project
- Background to genomic sequence alignment process
- Literature Review
- Proposed new approach
- Feature Based Sequence Alignment algorithm (FBSA)
- Results and discussion
- Conclusions and future directions
3 Aims
- The aims of this project
- Investigate the application of existing
algorithms to very large genomic sequence
alignment - Propose a novel technique aligning very large
sequences - To evaluate the performance of the proposed system
4 Background
Simple Representation of the Human Genome
1
X
Y
22
Chromosome 6
3000 million nucleotides
Chromosome 6p21
170 Million nucleotides
19,894,959 nucleotides
3,246 nucleotides
TCAGGACCGCGACAAGCCAGCCCAGATCCGCTTCAGCAACATTTCCGCCG
CCAAAGCGGT...
1,082 amino acids
MTDDKDVLRDVWFGRIPTCFTLYQDEITEREAE...
1 or more proteins
5 Background
Genomic Features
- Various features exist in genomic sequences
- Features such as genes are often fragmented and
scattered across a local area of the sequence
One gene
19,894,959 nucleotides
3,246 nucleotides
TCAGGACCGCGACAAGCCAGCCCAGATCCGCTTCAGCAACATTTCCGCCG
CCAAAGCGGT...
- Nucleotide or DNA (deoxyribonucleic acid)
molecules are each represented by one of the
letters A, C, G or T
6 Background
Alignment
- Sequence alignment involves comparing two or more
sequences alongside each other in order to
examine them for differences and similarities.
Individual 1
TCAGGACCGCGACAAGCCAGCCCAGATCCGCTTCAGCAACATTTCCGCCG
CCAAAGCGGT...
TCAGGACCGCGACATGCCCGC--AGATCCGCTTCAGCAACATTTCCGCCG
CCAAAGCGGT...
Individual 2
- Common algorithmic approaches are concerned with
attempting to insert gaps in order to align the
symbols into their correct evolutionary position
7 Literature Review
- A number of algorithms are currently in use for
aligning small to medium genomic sequences (up to
a few thousand nucleotides in size) - Smith and Waterman considered the most accurate
- Also improvements by Gotoh, Miller and Webb have
further improved the basic algorithm - BLAST or Basic Local Alignment and Search tool is
a fast, heuristic local alignment algorithm - FASTA, another heuristic local alignment
algorithm - These are all local alignment algorithms and are
concerned with finding the best local matches
between areas of high similarity
8 Many possible alignments
- It has calculated there are a large number of
possible alignments between any two sequences of
length n - Problem Long sequences (gt1000 symbols and above)
can take a very long time and a large amount of
resources to align (accurate or not!).
9 Literature Review
Aligning very large sequences
- proposals based on patterns or motifs
- a group of heuristic algorithms have been
developed. - SHAHA, MUMmer, pipmaker/blastz, ...
- These involve various methods of calculating
unique (often overlapping) motifs to characterise
the sequences - These motifs are then assembled to create the
final alignment - In some cases, a secondary alignment is done to
improve the accuracy.
10 Literature Review
Assemble the motifs
Individual 1
19,894,959 nucleotides
TCAGGACCGCGACAAGCCAGCCCAGATCCGCTTCAGCAACATTTCCGCCG
CCAAAGCGGT...
TCAGGACCGCGACAAGCCAGCCCAGATCCGCTTCAGCAACATTTCCGCCG
CCAAAGCGGT...
19,894,959 nucleotides
Individual 2
11 Literature Review
disadvantages of motif based approaches
- although fast and efficient in resource usage
- but
- miss-alignment can and does occur
- motifs are often small, and overlapping so gains
are sometimes small - Some are very heuristic in their approach -
accuracy suffers so much a second stage of
alignment is required. - Not aware of biological features in a sequence so
can distort them (don't ignore features)
12Proposed new approach to aligning very large
sequences
- Genomic sequences already contain features such
as repetitive elements, genes, low complexity
regions etc. - This is the basis of the new approach "Feature
Based Sequence Alignment process" or "FBSA" - These features characterise the sequence, and
intuitively, can be seen to be conserved across
two related sequences. - Instead of discarding them as the existing
alignment process dictates, we include them in
the process!
13 FBSA
Genomic sequences already contain natural motifs
...
- Genes and other features are already present, and
can be used to guide the alignment process
Individual 1
19,894,959 nucleotides
TCAGGACCGCGACAAGCCAGCCCAGATCCGCTTCAGCAACATTTCCGCCG
CCAAAGCGGT...
TCAGGACCGCGACAAGCCAGCCCAGATCCGCTTCAGCAACATTTCCGCCG
CCAAAGCGGT...
19,894,959 nucleotides
Individual 2
14 Hypothesis
The hypothesis for this project is that "it is
possible to devise an approach for aligning very
large genetic sequences in a fast, efficient and
intuitive manner by making use of existing
information which is held within the genetic
structure evidenced by the sequences under
examination, to guide the alignment process."
15 The Plan
- Investigate the features that best characterise
the physical structure of a large genetic
sequence. - Investigate methods to extract and use the most
suitable features to guide the alignment process. - Design and test new algorithms based on features
- Examine the results and report on findings
16 Why repetitive elements?
- It became apparent that the class of features
known as "repetitive elements" were ideal
subjects. - Very common in many genomes (Human, Mouse, Rice)
- Size is generally a few hundred base pairs
- Class includes "Low Complexity Repeats"
- Software to identify these elements is part of
any standard alignment process (which normally
results in the removal of the identified
elements)
17 Low Complexity Regions
- Areas of simple repetitive patterns create
difficulties for algorithmic approaches to
alignment. - AAAAAAATAAAAAAATAAAAAAAT
- Problem Common algorithms require such regions
to be removed before attempting alignment to
improve performance and accuracy. - FBSA can make use of this information
18 FBSA algorithm
Identify features (e.g., using repeatmasker)
in both original sequences
- repeatmasker is a software program used to
identify features by using a stored lookup table - The output files are used by most standard
algorithms to "mask" out and remove repetitive
elements and loq complexity regions
19 FBSA algorithm
Identify features (e.g., using repeatmasker)
in both original sequences
Create a symbol table from the identified
features
- A symbol table is created to map symbols to
different types of feature - Computers manipulate symbols - not molecules!
- The symbols used are integers
- easy and simple to manipulate
20 FBSA algorithm
Identify features (e.g., using repeatmasker)
in both original sequences
Create a symbol table from the identified
features
Create feature sequence files for repeats
identified
- A vector is created for each sequence with
symbols representing each of the features
identified in the original sequences, in the
correct order of appearance.
21 FBSA algorithm
Identify features (e.g., using repeatmasker)
in both original sequences
Create a symbol table from the identified
features
Create feature sequence files for repeats
identified
Align feature files using an adapted Smith and
Waterman algorithm.
- The core of the new algorithm
- The Smith and Waterman algorithm has been adapted
to use integer symbols instead of alpha characters
22 FBSA algorithm
Identify features (e.g., using repeatmasker)
in both original sequences
Create a symbol table from the identified
features
Create feature sequence files for repeats
identified
Align feature files using an adapted Smith and
Waterman algorithm.
Subdivide both original input sequences on
identified feature boundaries
- Each matched pair of features AND the
corresponding areas between features is generated
as individual pairs of sub-sequences.
23 FBSA algorithm
Identify features (e.g., using repeatmasker)
in both original sequences
Create a symbol table from the identified
features
Create feature sequence files for repeats
identified
Align feature files using an adapted Smith and
Waterman algorithm.
Subdivide both original input sequences on
identified feature boundaries
- Each individual pair of sub-sequences is aligned
using a standard algorithm (ClustalW was used in
this project)
Align sub-sequences (e.g., using Blast, Fasta,
SSearch, or ClustalW)
24 FBSA algorithm
Identify features (e.g., using repeatmasker)
in both original sequences
Create a symbol table from the identified
features
- Using information from the symbol table and
repeatmasker lists, reassemble the individual,
and now aligned sub-sequences into a
super-sequence in one of the standard formats
Create feature sequence files for repeats
identified
Align feature files using an adapted Smith and
Waterman algorithm.
Subdivide both original input sequences on
identified feature boundaries
Align sub-sequences (e.g., using Blast, Fasta,
SSearch, or ClustalW)
Assemble results and prepare for viewing or
further processing
25 FBSA
Nucleotides are now pre-aligned
- Gaps have been inserted at natural boundaries to
put orthogonal features in the correct
relationship
Individual 1
TCAGGACCGCGACAAGCCAGCCCAGATCCGCTTCAGCAACATTTCCGCCG
CCAAAGCGGT...
TCAGGACCGCGACAAGCCAGCCCAGATCCGCTTCAGCAACATTTCCGCCG
CCAAAGCGGT...
Individual 2
26 Characteristics
- The resulting super-alignment has a number of
desirable properties - major insertions and deletions are made at a high
level, and at a biologically significant boundary
- not in the middle of important features as
machine generated motif based methods may do. - The pre-aligned status of each-subsequence allows
one of the standard dynamic algorithms to be used
closer to its area of most efficient operation. - Assembly of the sub-sequences is accurate because
it is under the control of the original
information.
27 Performance Evaluation
Quality
- Measurement of alignment quality is usually a
combination of - Number of gaps
- Number of matches
- Percent identity
- Measured by EMBOSS "infoalign"
- Time to produce the final alignment
- How does FBSA rate
28 Time
- Timings are for the "pre-alignment" stage
- The results reflect the fact that FBSA only needs
to handle hundreds of features instead of
thousands of bases - Comparisons with ClustalW AL022723 vs AF055066
- FBSA 18 minutes (1000 pairs of sub-sequences)
- possibility to parrallelise and obtain further
gains - ClustalW gt 11 hours (1 pair)
29 Feature Density
- General performance is dependent on the number of
features present - Some areas of the genome are characterised as
feature rich, and others as feature poor - Index was relatively constant for test sequences
used. - New statistical methods for assessing sequence
quality are required when large genetic sequences
are involved
30 Future paths
- Many avenues for further research became
apparent - weighted substitutions based on feature homology
- cross-species performance
- mostly tested on human sequences
- some aribidopsis and rice sequences used.
- tuning of the type of features used
- is it appropriate to include LCRs?
31 Future paths
- Generating meaningful views of large sequence
data - Parrallelisation and job management
- Upper and lower size limits
- Investigate feature fragmentation
32 Publication
Bellgard, M. and Kenworthy, W. (to be published
2003) FBSA feature-based sequence alignment
technique for very large sequences Accepted for
publication by Applied Bioinformatics.
33I would like to acknowledgment the contributions
of
- A number of people who were involved with project
including - Professor M. Bellgard for the original idea, and
for his supervision of the project - David Shibeci for his help in provision of
computational resources, and the programming of
some of the software modules used in this
research - David Dunn for his advice and patient
explanations on matters biological - Adam Hunter for data preparation when using the
CMAP utility
34 Finally
- The FBSA algorithm developed during the course of
this research improves on the performance of
current algorithms when applied to very large
genomic sequences. - It has opened up many avenues for further
research - Publication implies that the concept is
considered viable in the wider community of
computational biology
35Questions?