Investigating effective and novel techniques to conduct pairwise alignment of very large genomic seq - PowerPoint PPT Presentation

1 / 35

About This Presentation

Title:

Investigating effective and novel techniques to conduct pairwise alignment of very large genomic seq

Description:

Align 'feature' files using an adapted Smith and Waterman algorithm. ... The Smith and Waterman algorithm has been adapted to use integer symbols instead ... – PowerPoint PPT presentation

Number of Views:53

Avg rating:3.0/5.0

Slides: 36

Provided by: itMurd

Category:

more less

Transcript and Presenter's Notes

Title: Investigating effective and novel techniques to conduct pairwise alignment of very large genomic seq

1
Investigating effective and novel techniques to
conduct pairwise alignment of very large genomic
sequences. William D. Kenworthy Semester 2, 2003
2
Overview

Aims of the project
Background to genomic sequence alignment process
Literature Review
Proposed new approach
Feature Based Sequence Alignment algorithm (FBSA)
Results and discussion
Conclusions and future directions

3
Aims

The aims of this project
Investigate the application of existing
algorithms to very large genomic sequence
alignment
Propose a novel technique aligning very large
sequences
To evaluate the performance of the proposed system

4
Background
Simple Representation of the Human Genome
1
X
Y
22
Chromosome 6
3000 million nucleotides
Chromosome 6p21
170 Million nucleotides
19,894,959 nucleotides
3,246 nucleotides
TCAGGACCGCGACAAGCCAGCCCAGATCCGCTTCAGCAACATTTCCGCCG
CCAAAGCGGT...
1,082 amino acids
MTDDKDVLRDVWFGRIPTCFTLYQDEITEREAE...
1 or more proteins
5
Background
Genomic Features

Various features exist in genomic sequences
Features such as genes are often fragmented and
scattered across a local area of the sequence

One gene
19,894,959 nucleotides
3,246 nucleotides
TCAGGACCGCGACAAGCCAGCCCAGATCCGCTTCAGCAACATTTCCGCCG
CCAAAGCGGT...

Nucleotide or DNA (deoxyribonucleic acid)
molecules are each represented by one of the
letters A, C, G or T

6
Background
Alignment

Sequence alignment involves comparing two or more
sequences alongside each other in order to
examine them for differences and similarities.

Individual 1
TCAGGACCGCGACAAGCCAGCCCAGATCCGCTTCAGCAACATTTCCGCCG
CCAAAGCGGT...
TCAGGACCGCGACATGCCCGC--AGATCCGCTTCAGCAACATTTCCGCCG
CCAAAGCGGT...
Individual 2

Common algorithmic approaches are concerned with
attempting to insert gaps in order to align the
symbols into their correct evolutionary position

7
Literature Review

A number of algorithms are currently in use for
aligning small to medium genomic sequences (up to
a few thousand nucleotides in size)
Smith and Waterman considered the most accurate
Also improvements by Gotoh, Miller and Webb have
further improved the basic algorithm
BLAST or Basic Local Alignment and Search tool is
a fast, heuristic local alignment algorithm
FASTA, another heuristic local alignment
algorithm
These are all local alignment algorithms and are
concerned with finding the best local matches
between areas of high similarity

8
Many possible alignments

It has calculated there are a large number of
possible alignments between any two sequences of
length n
Problem Long sequences (gt1000 symbols and above)
can take a very long time and a large amount of
resources to align (accurate or not!).

9
Literature Review
Aligning very large sequences

proposals based on patterns or motifs
a group of heuristic algorithms have been
developed.
SHAHA, MUMmer, pipmaker/blastz, ...
These involve various methods of calculating
unique (often overlapping) motifs to characterise
the sequences
These motifs are then assembled to create the
final alignment
In some cases, a secondary alignment is done to
improve the accuracy.

10
Literature Review
Assemble the motifs
Individual 1
19,894,959 nucleotides
TCAGGACCGCGACAAGCCAGCCCAGATCCGCTTCAGCAACATTTCCGCCG
CCAAAGCGGT...
TCAGGACCGCGACAAGCCAGCCCAGATCCGCTTCAGCAACATTTCCGCCG
CCAAAGCGGT...
19,894,959 nucleotides
Individual 2
11
Literature Review
disadvantages of motif based approaches

although fast and efficient in resource usage
but
miss-alignment can and does occur
motifs are often small, and overlapping so gains
are sometimes small
Some are very heuristic in their approach -
accuracy suffers so much a second stage of
alignment is required.
Not aware of biological features in a sequence so
can distort them (don't ignore features)

12
Proposed new approach to aligning very large
sequences

Genomic sequences already contain features such
as repetitive elements, genes, low complexity
regions etc.
This is the basis of the new approach "Feature
Based Sequence Alignment process" or "FBSA"
These features characterise the sequence, and
intuitively, can be seen to be conserved across
two related sequences.
Instead of discarding them as the existing
alignment process dictates, we include them in
the process!

13
FBSA
Genomic sequences already contain natural motifs
...

Genes and other features are already present, and
can be used to guide the alignment process

Individual 1
19,894,959 nucleotides
TCAGGACCGCGACAAGCCAGCCCAGATCCGCTTCAGCAACATTTCCGCCG
CCAAAGCGGT...
TCAGGACCGCGACAAGCCAGCCCAGATCCGCTTCAGCAACATTTCCGCCG
CCAAAGCGGT...
19,894,959 nucleotides
Individual 2
14
Hypothesis
The hypothesis for this project is that "it is
possible to devise an approach for aligning very
large genetic sequences in a fast, efficient and
intuitive manner by making use of existing
information which is held within the genetic
structure evidenced by the sequences under
examination, to guide the alignment process."
15
The Plan

Investigate the features that best characterise
the physical structure of a large genetic
sequence.
Investigate methods to extract and use the most
suitable features to guide the alignment process.
Design and test new algorithms based on features
Examine the results and report on findings

16
Why repetitive elements?

It became apparent that the class of features
known as "repetitive elements" were ideal
subjects.
Very common in many genomes (Human, Mouse, Rice)
Size is generally a few hundred base pairs
Class includes "Low Complexity Repeats"
Software to identify these elements is part of
any standard alignment process (which normally
results in the removal of the identified
elements)

17
Low Complexity Regions

Areas of simple repetitive patterns create
difficulties for algorithmic approaches to
alignment.
AAAAAAATAAAAAAATAAAAAAAT
Problem Common algorithms require such regions
to be removed before attempting alignment to
improve performance and accuracy.
FBSA can make use of this information

18
FBSA algorithm
Identify features (e.g., using repeatmasker)
in both original sequences

repeatmasker is a software program used to
identify features by using a stored lookup table
The output files are used by most standard
algorithms to "mask" out and remove repetitive
elements and loq complexity regions

19
FBSA algorithm
Identify features (e.g., using repeatmasker)
in both original sequences
Create a symbol table from the identified
features

A symbol table is created to map symbols to
different types of feature
Computers manipulate symbols - not molecules!
The symbols used are integers
easy and simple to manipulate

20
FBSA algorithm
Identify features (e.g., using repeatmasker)
in both original sequences
Create a symbol table from the identified
features
Create feature sequence files for repeats
identified

A vector is created for each sequence with
symbols representing each of the features
identified in the original sequences, in the
correct order of appearance.

21
FBSA algorithm
Identify features (e.g., using repeatmasker)
in both original sequences
Create a symbol table from the identified
features
Create feature sequence files for repeats
identified
Align feature files using an adapted Smith and
Waterman algorithm.

The core of the new algorithm
The Smith and Waterman algorithm has been adapted
to use integer symbols instead of alpha characters

22
FBSA algorithm
Identify features (e.g., using repeatmasker)
in both original sequences
Create a symbol table from the identified
features
Create feature sequence files for repeats
identified
Align feature files using an adapted Smith and
Waterman algorithm.
Subdivide both original input sequences on
identified feature boundaries

Each matched pair of features AND the
corresponding areas between features is generated
as individual pairs of sub-sequences.

23
FBSA algorithm
Identify features (e.g., using repeatmasker)
in both original sequences
Create a symbol table from the identified
features
Create feature sequence files for repeats
identified
Align feature files using an adapted Smith and
Waterman algorithm.
Subdivide both original input sequences on
identified feature boundaries

Each individual pair of sub-sequences is aligned
using a standard algorithm (ClustalW was used in
this project)

Align sub-sequences (e.g., using Blast, Fasta,
SSearch, or ClustalW)
24
FBSA algorithm
Identify features (e.g., using repeatmasker)
in both original sequences
Create a symbol table from the identified
features

Using information from the symbol table and
repeatmasker lists, reassemble the individual,
and now aligned sub-sequences into a
super-sequence in one of the standard formats

Create feature sequence files for repeats
identified
Align feature files using an adapted Smith and
Waterman algorithm.
Subdivide both original input sequences on
identified feature boundaries
Align sub-sequences (e.g., using Blast, Fasta,
SSearch, or ClustalW)
Assemble results and prepare for viewing or
further processing
25
FBSA
Nucleotides are now pre-aligned

Gaps have been inserted at natural boundaries to
put orthogonal features in the correct
relationship

Individual 1
TCAGGACCGCGACAAGCCAGCCCAGATCCGCTTCAGCAACATTTCCGCCG
CCAAAGCGGT...
TCAGGACCGCGACAAGCCAGCCCAGATCCGCTTCAGCAACATTTCCGCCG
CCAAAGCGGT...
Individual 2
26
Characteristics

The resulting super-alignment has a number of
desirable properties
major insertions and deletions are made at a high
level, and at a biologically significant boundary
- not in the middle of important features as
machine generated motif based methods may do.
The pre-aligned status of each-subsequence allows
one of the standard dynamic algorithms to be used
closer to its area of most efficient operation.
Assembly of the sub-sequences is accurate because
it is under the control of the original
information.

27
Performance Evaluation
Quality

Measurement of alignment quality is usually a
combination of
Number of gaps
Number of matches
Percent identity
Measured by EMBOSS "infoalign"
Time to produce the final alignment
How does FBSA rate

28
Time

Timings are for the "pre-alignment" stage
The results reflect the fact that FBSA only needs
to handle hundreds of features instead of
thousands of bases
Comparisons with ClustalW AL022723 vs AF055066
FBSA 18 minutes (1000 pairs of sub-sequences)
possibility to parrallelise and obtain further
gains
ClustalW gt 11 hours (1 pair)

29
Feature Density

General performance is dependent on the number of
features present
Some areas of the genome are characterised as
feature rich, and others as feature poor
Index was relatively constant for test sequences
used.
New statistical methods for assessing sequence
quality are required when large genetic sequences
are involved

30
Future paths

Many avenues for further research became
apparent
weighted substitutions based on feature homology
cross-species performance
mostly tested on human sequences
some aribidopsis and rice sequences used.
tuning of the type of features used
is it appropriate to include LCRs?

31
Future paths

Generating meaningful views of large sequence
data
Parrallelisation and job management
Upper and lower size limits
Investigate feature fragmentation

32
Publication
Bellgard, M. and Kenworthy, W. (to be published
2003) FBSA feature-based sequence alignment
technique for very large sequences Accepted for
publication by Applied Bioinformatics.
33
I would like to acknowledgment the contributions
of

A number of people who were involved with project
including
Professor M. Bellgard for the original idea, and
for his supervision of the project
David Shibeci for his help in provision of
computational resources, and the programming of
some of the software modules used in this
research
David Dunn for his advice and patient
explanations on matters biological
Adam Hunter for data preparation when using the
CMAP utility

34
Finally

The FBSA algorithm developed during the course of
this research improves on the performance of
current algorithms when applied to very large
genomic sequences.
It has opened up many avenues for further
research
Publication implies that the concept is
considered viable in the wider community of
computational biology

35
Questions?

Write a Comment

User Comments (0)