Investigating effective and novel techniques to conduct pairwise alignment of very large genomic seq - PowerPoint PPT Presentation

1 / 35
About This Presentation
Title:

Investigating effective and novel techniques to conduct pairwise alignment of very large genomic seq

Description:

Align 'feature' files using an adapted Smith and Waterman algorithm. ... The Smith and Waterman algorithm has been adapted to use integer symbols instead ... – PowerPoint PPT presentation

Number of Views:53
Avg rating:3.0/5.0
Slides: 36
Provided by: itMurd
Category:

less

Transcript and Presenter's Notes

Title: Investigating effective and novel techniques to conduct pairwise alignment of very large genomic seq


1
Investigating effective and novel techniques to
conduct pairwise alignment of very large genomic
sequences. William D. Kenworthy Semester 2, 2003
2
Overview
  • Aims of the project
  • Background to genomic sequence alignment process
  • Literature Review
  • Proposed new approach
  • Feature Based Sequence Alignment algorithm (FBSA)
  • Results and discussion
  • Conclusions and future directions

3
Aims
  • The aims of this project
  • Investigate the application of existing
    algorithms to very large genomic sequence
    alignment
  • Propose a novel technique aligning very large
    sequences
  • To evaluate the performance of the proposed system

4
Background
Simple Representation of the Human Genome
1
X
Y
22
Chromosome 6
3000 million nucleotides
Chromosome 6p21
170 Million nucleotides
19,894,959 nucleotides
3,246 nucleotides
TCAGGACCGCGACAAGCCAGCCCAGATCCGCTTCAGCAACATTTCCGCCG
CCAAAGCGGT...
1,082 amino acids
MTDDKDVLRDVWFGRIPTCFTLYQDEITEREAE...
1 or more proteins
5
Background
Genomic Features
  • Various features exist in genomic sequences
  • Features such as genes are often fragmented and
    scattered across a local area of the sequence

One gene
19,894,959 nucleotides
3,246 nucleotides
TCAGGACCGCGACAAGCCAGCCCAGATCCGCTTCAGCAACATTTCCGCCG
CCAAAGCGGT...
  • Nucleotide or DNA (deoxyribonucleic acid)
    molecules are each represented by one of the
    letters A, C, G or T

6
Background
Alignment
  • Sequence alignment involves comparing two or more
    sequences alongside each other in order to
    examine them for differences and similarities.

Individual 1
TCAGGACCGCGACAAGCCAGCCCAGATCCGCTTCAGCAACATTTCCGCCG
CCAAAGCGGT...
TCAGGACCGCGACATGCCCGC--AGATCCGCTTCAGCAACATTTCCGCCG
CCAAAGCGGT...
Individual 2
  • Common algorithmic approaches are concerned with
    attempting to insert gaps in order to align the
    symbols into their correct evolutionary position

7
Literature Review
  • A number of algorithms are currently in use for
    aligning small to medium genomic sequences (up to
    a few thousand nucleotides in size)
  • Smith and Waterman considered the most accurate
  • Also improvements by Gotoh, Miller and Webb have
    further improved the basic algorithm
  • BLAST or Basic Local Alignment and Search tool is
    a fast, heuristic local alignment algorithm
  • FASTA, another heuristic local alignment
    algorithm
  • These are all local alignment algorithms and are
    concerned with finding the best local matches
    between areas of high similarity

8
Many possible alignments
  • It has calculated there are a large number of
    possible alignments between any two sequences of
    length n
  • Problem Long sequences (gt1000 symbols and above)
    can take a very long time and a large amount of
    resources to align (accurate or not!).

9
Literature Review
Aligning very large sequences
  • proposals based on patterns or motifs
  • a group of heuristic algorithms have been
    developed.
  • SHAHA, MUMmer, pipmaker/blastz, ...
  • These involve various methods of calculating
    unique (often overlapping) motifs to characterise
    the sequences
  • These motifs are then assembled to create the
    final alignment
  • In some cases, a secondary alignment is done to
    improve the accuracy.

10
Literature Review
Assemble the motifs
Individual 1
19,894,959 nucleotides
TCAGGACCGCGACAAGCCAGCCCAGATCCGCTTCAGCAACATTTCCGCCG
CCAAAGCGGT...
TCAGGACCGCGACAAGCCAGCCCAGATCCGCTTCAGCAACATTTCCGCCG
CCAAAGCGGT...
19,894,959 nucleotides
Individual 2
11
Literature Review
disadvantages of motif based approaches
  • although fast and efficient in resource usage
  • but
  • miss-alignment can and does occur
  • motifs are often small, and overlapping so gains
    are sometimes small
  • Some are very heuristic in their approach -
    accuracy suffers so much a second stage of
    alignment is required.
  • Not aware of biological features in a sequence so
    can distort them (don't ignore features)

12
Proposed new approach to aligning very large
sequences
  • Genomic sequences already contain features such
    as repetitive elements, genes, low complexity
    regions etc.
  • This is the basis of the new approach "Feature
    Based Sequence Alignment process" or "FBSA"
  • These features characterise the sequence, and
    intuitively, can be seen to be conserved across
    two related sequences.
  • Instead of discarding them as the existing
    alignment process dictates, we include them in
    the process!

13
FBSA
Genomic sequences already contain natural motifs
...
  • Genes and other features are already present, and
    can be used to guide the alignment process

Individual 1
19,894,959 nucleotides
TCAGGACCGCGACAAGCCAGCCCAGATCCGCTTCAGCAACATTTCCGCCG
CCAAAGCGGT...
TCAGGACCGCGACAAGCCAGCCCAGATCCGCTTCAGCAACATTTCCGCCG
CCAAAGCGGT...
19,894,959 nucleotides
Individual 2
14
Hypothesis
The hypothesis for this project is that "it is
possible to devise an approach for aligning very
large genetic sequences in a fast, efficient and
intuitive manner by making use of existing
information which is held within the genetic
structure evidenced by the sequences under
examination, to guide the alignment process."
15
The Plan
  • Investigate the features that best characterise
    the physical structure of a large genetic
    sequence.
  • Investigate methods to extract and use the most
    suitable features to guide the alignment process.
  • Design and test new algorithms based on features
  • Examine the results and report on findings

16
Why repetitive elements?
  • It became apparent that the class of features
    known as "repetitive elements" were ideal
    subjects.
  • Very common in many genomes (Human, Mouse, Rice)
  • Size is generally a few hundred base pairs
  • Class includes "Low Complexity Repeats"
  • Software to identify these elements is part of
    any standard alignment process (which normally
    results in the removal of the identified
    elements)

17
Low Complexity Regions
  • Areas of simple repetitive patterns create
    difficulties for algorithmic approaches to
    alignment.
  • AAAAAAATAAAAAAATAAAAAAAT
  • Problem Common algorithms require such regions
    to be removed before attempting alignment to
    improve performance and accuracy.
  • FBSA can make use of this information

18
FBSA algorithm
Identify features (e.g., using repeatmasker)
in both original sequences
  • repeatmasker is a software program used to
    identify features by using a stored lookup table
  • The output files are used by most standard
    algorithms to "mask" out and remove repetitive
    elements and loq complexity regions

19
FBSA algorithm
Identify features (e.g., using repeatmasker)
in both original sequences
Create a symbol table from the identified
features
  • A symbol table is created to map symbols to
    different types of feature
  • Computers manipulate symbols - not molecules!
  • The symbols used are integers
  • easy and simple to manipulate

20
FBSA algorithm
Identify features (e.g., using repeatmasker)
in both original sequences
Create a symbol table from the identified
features
Create feature sequence files for repeats
identified
  • A vector is created for each sequence with
    symbols representing each of the features
    identified in the original sequences, in the
    correct order of appearance.

21
FBSA algorithm
Identify features (e.g., using repeatmasker)
in both original sequences
Create a symbol table from the identified
features
Create feature sequence files for repeats
identified
Align feature files using an adapted Smith and
Waterman algorithm.
  • The core of the new algorithm
  • The Smith and Waterman algorithm has been adapted
    to use integer symbols instead of alpha characters

22
FBSA algorithm
Identify features (e.g., using repeatmasker)
in both original sequences
Create a symbol table from the identified
features
Create feature sequence files for repeats
identified
Align feature files using an adapted Smith and
Waterman algorithm.
Subdivide both original input sequences on
identified feature boundaries
  • Each matched pair of features AND the
    corresponding areas between features is generated
    as individual pairs of sub-sequences.

23
FBSA algorithm
Identify features (e.g., using repeatmasker)
in both original sequences
Create a symbol table from the identified
features
Create feature sequence files for repeats
identified
Align feature files using an adapted Smith and
Waterman algorithm.
Subdivide both original input sequences on
identified feature boundaries
  • Each individual pair of sub-sequences is aligned
    using a standard algorithm (ClustalW was used in
    this project)

Align sub-sequences (e.g., using Blast, Fasta,
SSearch, or ClustalW)
24
FBSA algorithm
Identify features (e.g., using repeatmasker)
in both original sequences
Create a symbol table from the identified
features
  • Using information from the symbol table and
    repeatmasker lists, reassemble the individual,
    and now aligned sub-sequences into a
    super-sequence in one of the standard formats

Create feature sequence files for repeats
identified
Align feature files using an adapted Smith and
Waterman algorithm.
Subdivide both original input sequences on
identified feature boundaries
Align sub-sequences (e.g., using Blast, Fasta,
SSearch, or ClustalW)
Assemble results and prepare for viewing or
further processing
25
FBSA
Nucleotides are now pre-aligned
  • Gaps have been inserted at natural boundaries to
    put orthogonal features in the correct
    relationship

Individual 1
TCAGGACCGCGACAAGCCAGCCCAGATCCGCTTCAGCAACATTTCCGCCG
CCAAAGCGGT...
TCAGGACCGCGACAAGCCAGCCCAGATCCGCTTCAGCAACATTTCCGCCG
CCAAAGCGGT...
Individual 2
26
Characteristics
  • The resulting super-alignment has a number of
    desirable properties
  • major insertions and deletions are made at a high
    level, and at a biologically significant boundary
    - not in the middle of important features as
    machine generated motif based methods may do.
  • The pre-aligned status of each-subsequence allows
    one of the standard dynamic algorithms to be used
    closer to its area of most efficient operation.
  • Assembly of the sub-sequences is accurate because
    it is under the control of the original
    information.

27
Performance Evaluation
Quality
  • Measurement of alignment quality is usually a
    combination of
  • Number of gaps
  • Number of matches
  • Percent identity
  • Measured by EMBOSS "infoalign"
  • Time to produce the final alignment
  • How does FBSA rate

28
Time
  • Timings are for the "pre-alignment" stage
  • The results reflect the fact that FBSA only needs
    to handle hundreds of features instead of
    thousands of bases
  • Comparisons with ClustalW AL022723 vs AF055066
  • FBSA 18 minutes (1000 pairs of sub-sequences)
  • possibility to parrallelise and obtain further
    gains
  • ClustalW gt 11 hours (1 pair)

29
Feature Density
  • General performance is dependent on the number of
    features present
  • Some areas of the genome are characterised as
    feature rich, and others as feature poor
  • Index was relatively constant for test sequences
    used.
  • New statistical methods for assessing sequence
    quality are required when large genetic sequences
    are involved

30
Future paths
  • Many avenues for further research became
    apparent
  • weighted substitutions based on feature homology
  • cross-species performance
  • mostly tested on human sequences
  • some aribidopsis and rice sequences used.
  • tuning of the type of features used
  • is it appropriate to include LCRs?

31
Future paths
  • Generating meaningful views of large sequence
    data
  • Parrallelisation and job management
  • Upper and lower size limits
  • Investigate feature fragmentation

32
Publication
Bellgard, M. and Kenworthy, W. (to be published
2003) FBSA feature-based sequence alignment
technique for very large sequences Accepted for
publication by Applied Bioinformatics.
33
I would like to acknowledgment the contributions
of
  • A number of people who were involved with project
    including
  • Professor M. Bellgard for the original idea, and
    for his supervision of the project
  • David Shibeci for his help in provision of
    computational resources, and the programming of
    some of the software modules used in this
    research
  • David Dunn for his advice and patient
    explanations on matters biological
  • Adam Hunter for data preparation when using the
    CMAP utility

34
Finally
  • The FBSA algorithm developed during the course of
    this research improves on the performance of
    current algorithms when applied to very large
    genomic sequences.
  • It has opened up many avenues for further
    research
  • Publication implies that the concept is
    considered viable in the wider community of
    computational biology

35
Questions?
Write a Comment
User Comments (0)
About PowerShow.com