Title: Honours Research Project
1Honours Research Project
- Pair-Wise Feature Based Sequence Alignment of
Large Genomic Sequences between Same Different
Species - Allison Speed
- Supervisor Matthew Bellgard
2Sequence
- A string of genetic characters
- Contain genetic information of an organism
- Genome projects are concerned with the sequencing
of different species - Example Human Genome Project (HGP)
- Sequences are allocated a unique identifier,
accession number, upon been submitted to a
genetic database - Sequence length is measured by the number of
characters or base-pairs (bp) - Two types of sequences
- Nucleotide sequence and
- Protein sequence
3Nucleotide Sequence
- Deoxyribonucleic acid (DNA) is the basic building
block of life - DNA is made up of molecular chemicals called
nucleotide bases - Adenine (A)
- Guanine (G)
- Cytosine (C)
- Thymine (T)
- These bases are paired together (A-T) (G-C) to
form a DNA strand
Nucleotide sequence AGTCGCGATCGTGATCGA
4Protein Sequence
Protein Molecules
- A triplet of nucleotide bases make 1 amino acid
- Amino acids are represented by 20 different
characters - Compared to the 4 nucleotide bases
- Amino acids make proteins
Protein sequence GHILMNPNRSTYWHGHHN
Proteins are required for the structure,
function and regulation of cells, tissues and
organs (Atwood Parry-Smith 1999, p.207)
5Comparative Genomic Analysis
- Similarities between different sequences exist
- Sequences are graphed using a dot plot to view
this similarity - Genetic sequences from the same and different
species are compared - Assists in understanding the functionality
evolutionary history of DNA - Can infer the functionality of one sequence based
on the known function of another, similar sequence
6Comparative Genomics
- By comparing the human genome with the genomes
of different organisms, researchers can better
understand the structure and function of human
genes and thereby develop new strategies in the
battle against human disease - (Spencer 2002)
7Sequence Alignment
- To ensure correct comparison, sequences must
firstly be aligned - Pair-wise sequence alignment is the matching of
genetic characters between 2 sequences - ATGGTGAGGATTGCCTTTG
-
- ATGGTGAGGATTGCCTTTG
8Large Sequence Alignment
- Genome projects have generated, continue to
generate, vast amounts of sequence data. - There is a need to analyse the data
- Resulting in a strong demand for quality
alignment tools - Alignment of large sequences (gt 1000bp) is a
difficult task - An accurate alignment takes time much
processing power to produce - The need for effective large sequence alignment
methods is the problem of interest for this
research project
9Features
Sequence
Gene B
Gene A
- A feature is a segment within a sequence that has
structure - Have biological relevance and provide useful
information about a sequence - Example genes
- Features are known prior to sequence alignment
- Some features interfere with alignment algorithms
- So normally such features are removed to create a
more accurate alignment - If features are known prior to alignment, can
they be used to assist the alignment process?
10FBSA
- Feature based sequence alignment (FBSA)
- A new concept to sequence alignment
- Proposed by Bellgard Kenworthy (2003)
- Use biological features to anchor an alignment
between two large sequences - Features which cause problems in other alignment
methods, assist the alignment process of FBSA
11FBSA Process
Sequence 1
Gene A
Gene B
Gene C
Gene D
Sequence 2
Gene A
Gene B
Gene C
- Identify features
- Compare sequences and match shared features
- Align at feature based level
- Align at nucleotide level
12Advantages to FBSA
- At the feature-based level, sequences are much
shorter thus sequence alignment is faster - Dependent on feature density
- Smith Waterman algorithm is used to align the
features - Produces accurate alignments
- Enables parallel processing
- Features can break sequences up into natural
partitions for individual analysis and processing
13FBSA Research
- Kenworthy (2003) demonstrated the pair-wise FBSA
of two large sequences from the same species - Developed FBSA program
14Honours Project
- Research Question
- Can FBSA be further developed to align two large
sequences between - different species?
15Scope of Research Project
- 3 aims
- FBSA of large sequences from the same species,
such as human - Further develop FBSA to align large sequences
between different species, such as mouse and
human - Develop a prototype of a visual FBSA tool
16Aim 1
- Two human sequences from chromosome 6
- Accession numbers
- AC004213.1 (41,617 bp)
- AL022723.4 (148,834 bp)
- Suitable for FBSA because high level of
similarity - Depicted by dot plot
Dot plot Horizontal axis AL022723.4 Vertical
axis AC004213.1
17Method Taken for Aim 1
Sequence 1
Sequence 2
Repeatmasker
List of repetitive elements
FBSA Program
output
Feature based alignment of Sequence 1 Sequence 2
Feature plot of Sequence 1 Sequence 2
18Aim 1 Results
- Feature-based alignment analysed
- Feature matches were followed to verify a correct
alignment - A feature based alignment is considerably easier
to analyse than a nucleotide or protein alignment - Features are sizeable chunks of sequence data
that are more human readable - Following an alignment between features is both
easier and less time consuming - Aim 1 successfully completed
- 1 feature was sufficient to assist the alignment
process
19Aim 2 FBSA Different Species
- Mouse sequence from chromosome 19
- Human Sequence from chromosome 10
- Both sequences 100,000bp in length
- Regions of high similarity
20Method Aim 2
- Repeated FBSA method used in aim 1
- ? Failed
- Not enough features in common
- More features needed to be identified
- Specific areas of similarity were selected for
additional feature investigation - 9 regions identified and extracted for individual
processing
21Region 3
Region 2
Region 1
Region 4
Region 5
Region 6
Region 7
Region 8
Region 9
22Further Feature Investigation
- Search for additional features categorized into 3
stages - Stage 1 repetitive elements and predicted genes
- Stage 2 expressed sequence tags (ESTs) and
proteins - Stage 3 nucleotide matches
23Stage 1 Repetitive Elements Predicted Genes
- The 9 regions were processed for
- repetitive elements using RepeatMasker and
- predicted genes using GenScan
- A feature map for each region was created from
the output - The maps were analysed and any feature matches
highlighted - Although several features were found to match, it
was clear that further feature investigation was
needed
Mouse
Region 1
Human
24Stage 2ESTs and Proteins
- Region 9 was selected as the initial region for
additional feature processing - The largest of the regions is highly conserved
- Region 9 was searched for
- ESTs using blastn both the mouse and human EST
databases - proteins using blastx
- Search results needed to be interpreted
- Top ten results added to the feature map of
region 9 - Highly successful
- Remaining 8 regions processed for ESTs and
proteins - Search results of poor quality
- Only 1 EST identified in region 6 of the mouse
sequence
25Stage 3Nucleotide Matches
- First 8 regions required more features to be
identified - Another method was devised
- Sequences in each region were aligned
- Areas of alignment were extracted to create
sub-regions - Each sub-region searched for matches from the
nucleotide database using blastn
26Stage 3 2
- From the 8 regions, 16 sub-regions were extracted
processed for nucleotide matches - Search results included in feature maps
- Repeated for region 9
- 24 sub-regions extracted processed
Region 2 6 alignments ? 6 sub-regions
27Feature Map of Region 9
28Aim 2 Results 1
- Sequence conservation shown in dot plot was not
reflected by the number of - repetitive elements, proteins, ESTs and predicted
proteins - But, nucleotide matches in the areas of
similarity provided the features needed. - Nucleotide matches are not typically, in a
biological sense, a feature - Have been treated as a feature to assist
alignment - Better to align nucleotide matches rather than to
force an alignment between features that do not
match
29Aim 2 Results 2
- Many features were not shared between the two
species - Feature density does not indicate the
appropriateness of a sequence for FBSA - Two sequences may be rich in features, but at the
same time share few features. - The number of features shared between 2 sequences
is a more meaningful calculation - Aim 2 successfully completed
30Aim 3 Develop a Visual FBSA Tool
- Program design began while working on aim 2
- Challenges in aim 2 prevented further work on aim
3 - Became out of scope for the research
- The program would need to be highly
user-interactive - Process of additional feature investigation is
human dependent - Overcome challenges of feature matching
- An ideal research project in the future
- The development of such a program would benefit
FBSA
31Future Research
- Automate FBSA process
- With use of additional features
- Investigate feature calculations
- How much of the sequence needs to be made up of
shared features for FBSA to be worthwhile? - Develop cut-off values to assist user
- FBSA of sequences from other species
- Types of features needed for FBSA may vary
32Conclusion
- FBSA is a new concept in the pair wise alignment
of large genomic sequences with much future
possibility - Same species and between-species FBSA has been
successfully demonstrated - Additional features have been explored and used
- Future research would be most beneficial to its
development
33Thankyou
34Definitions
- Chromosome Structural carrier of hereditary
characteristics A certain number of
chromosomes is characteristic of each species of
plant animal. E.g. the potato has 48
chromosomes - Gene The fundamental physical functional unit
of heredity. A gene is an ordered sequence of
nucleotides located in a particular position on a
particular chromosome that encodes a specific
functional product (Atwood Parry-Smith 1999) - Homology A similar component in two organisms
(e.g. genes with strongly similar sequences) that
can be attributed to a common ancestor of the two
organisms during evolution (Mount, 2001) - Repetitive DNA Sequences of varying lengths that
occur in multiple copies in the genome it
represents much of the human genome. (Doe
Genomics)