Honours Research Project - PowerPoint PPT Presentation

1 / 34

About This Presentation

Title:

Honours Research Project

Description:

Sequences are allocated a unique identifier, accession number, upon been ... Gene: The fundamental physical & functional unit of heredity. ... – PowerPoint PPT presentation

Number of Views:121

Avg rating:3.0/5.0

Slides: 35

Provided by: cbbcg

Category:

more less

Transcript and Presenter's Notes

Title: Honours Research Project

1
Honours Research Project

Pair-Wise Feature Based Sequence Alignment of
Large Genomic Sequences between Same Different
Species
Allison Speed
Supervisor Matthew Bellgard

2
Sequence

A string of genetic characters
Contain genetic information of an organism
Genome projects are concerned with the sequencing
of different species
Example Human Genome Project (HGP)
Sequences are allocated a unique identifier,
accession number, upon been submitted to a
genetic database
Sequence length is measured by the number of
characters or base-pairs (bp)
Two types of sequences
Nucleotide sequence and
Protein sequence

3
Nucleotide Sequence

Deoxyribonucleic acid (DNA) is the basic building
block of life
DNA is made up of molecular chemicals called
nucleotide bases
Adenine (A)
Guanine (G)
Cytosine (C)
Thymine (T)
These bases are paired together (A-T) (G-C) to
form a DNA strand

Nucleotide sequence AGTCGCGATCGTGATCGA
4
Protein Sequence
Protein Molecules

A triplet of nucleotide bases make 1 amino acid
Amino acids are represented by 20 different
characters
Compared to the 4 nucleotide bases
Amino acids make proteins

Protein sequence GHILMNPNRSTYWHGHHN
Proteins are required for the structure,
function and regulation of cells, tissues and
organs (Atwood Parry-Smith 1999, p.207)
5
Comparative Genomic Analysis

Similarities between different sequences exist
Sequences are graphed using a dot plot to view
this similarity
Genetic sequences from the same and different
species are compared
Assists in understanding the functionality
evolutionary history of DNA
Can infer the functionality of one sequence based
on the known function of another, similar sequence

6
Comparative Genomics

By comparing the human genome with the genomes
of different organisms, researchers can better
understand the structure and function of human
genes and thereby develop new strategies in the
battle against human disease
(Spencer 2002)

7
Sequence Alignment

To ensure correct comparison, sequences must
firstly be aligned
Pair-wise sequence alignment is the matching of
genetic characters between 2 sequences
ATGGTGAGGATTGCCTTTG
ATGGTGAGGATTGCCTTTG

8
Large Sequence Alignment

Genome projects have generated, continue to
generate, vast amounts of sequence data.
There is a need to analyse the data
Resulting in a strong demand for quality
alignment tools
Alignment of large sequences (gt 1000bp) is a
difficult task
An accurate alignment takes time much
processing power to produce
The need for effective large sequence alignment
methods is the problem of interest for this
research project

9
Features
Sequence
Gene B
Gene A

A feature is a segment within a sequence that has
structure
Have biological relevance and provide useful
information about a sequence
Example genes
Features are known prior to sequence alignment
Some features interfere with alignment algorithms
So normally such features are removed to create a
more accurate alignment
If features are known prior to alignment, can
they be used to assist the alignment process?

10
FBSA

Feature based sequence alignment (FBSA)
A new concept to sequence alignment
Proposed by Bellgard Kenworthy (2003)
Use biological features to anchor an alignment
between two large sequences
Features which cause problems in other alignment
methods, assist the alignment process of FBSA

11
FBSA Process
Sequence 1
Gene A
Gene B
Gene C
Gene D
Sequence 2
Gene A
Gene B
Gene C

Identify features
Compare sequences and match shared features
Align at feature based level
Align at nucleotide level

12
Advantages to FBSA

At the feature-based level, sequences are much
shorter thus sequence alignment is faster
Dependent on feature density
Smith Waterman algorithm is used to align the
features
Produces accurate alignments
Enables parallel processing
Features can break sequences up into natural
partitions for individual analysis and processing

13
FBSA Research

Kenworthy (2003) demonstrated the pair-wise FBSA
of two large sequences from the same species
Developed FBSA program

14
Honours Project

Research Question
Can FBSA be further developed to align two large
sequences between
different species?

15
Scope of Research Project

3 aims
FBSA of large sequences from the same species,
such as human
Further develop FBSA to align large sequences
between different species, such as mouse and
human
Develop a prototype of a visual FBSA tool

16
Aim 1

Two human sequences from chromosome 6
Accession numbers
AC004213.1 (41,617 bp)
AL022723.4 (148,834 bp)
Suitable for FBSA because high level of
similarity
Depicted by dot plot

Dot plot Horizontal axis AL022723.4 Vertical
axis AC004213.1
17
Method Taken for Aim 1
Sequence 1
Sequence 2
Repeatmasker
List of repetitive elements
FBSA Program
output
Feature based alignment of Sequence 1 Sequence 2
Feature plot of Sequence 1 Sequence 2
18
Aim 1 Results

Feature-based alignment analysed
Feature matches were followed to verify a correct
alignment
A feature based alignment is considerably easier
to analyse than a nucleotide or protein alignment
Features are sizeable chunks of sequence data
that are more human readable
Following an alignment between features is both
easier and less time consuming
Aim 1 successfully completed
1 feature was sufficient to assist the alignment
process

19
Aim 2 FBSA Different Species

Mouse sequence from chromosome 19
Human Sequence from chromosome 10
Both sequences 100,000bp in length
Regions of high similarity

20
Method Aim 2

Repeated FBSA method used in aim 1
? Failed
Not enough features in common
More features needed to be identified
Specific areas of similarity were selected for
additional feature investigation
9 regions identified and extracted for individual
processing

21
Region 3
Region 2
Region 1
Region 4
Region 5
Region 6
Region 7
Region 8
Region 9
22
Further Feature Investigation

Search for additional features categorized into 3
stages
Stage 1 repetitive elements and predicted genes
Stage 2 expressed sequence tags (ESTs) and
proteins
Stage 3 nucleotide matches

23
Stage 1 Repetitive Elements Predicted Genes

The 9 regions were processed for
repetitive elements using RepeatMasker and
predicted genes using GenScan
A feature map for each region was created from
the output
The maps were analysed and any feature matches
highlighted
Although several features were found to match, it
was clear that further feature investigation was
needed

Mouse
Region 1
Human
24
Stage 2ESTs and Proteins

Region 9 was selected as the initial region for
additional feature processing
The largest of the regions is highly conserved
Region 9 was searched for
ESTs using blastn both the mouse and human EST
databases
proteins using blastx
Search results needed to be interpreted
Top ten results added to the feature map of
region 9
Highly successful
Remaining 8 regions processed for ESTs and
proteins
Search results of poor quality
Only 1 EST identified in region 6 of the mouse
sequence

25
Stage 3Nucleotide Matches

First 8 regions required more features to be
identified
Another method was devised

Sequences in each region were aligned
Areas of alignment were extracted to create
sub-regions
Each sub-region searched for matches from the
nucleotide database using blastn

26
Stage 3 2

From the 8 regions, 16 sub-regions were extracted
processed for nucleotide matches
Search results included in feature maps
Repeated for region 9
24 sub-regions extracted processed

Region 2 6 alignments ? 6 sub-regions
27
Feature Map of Region 9
28
Aim 2 Results 1

Sequence conservation shown in dot plot was not
reflected by the number of
repetitive elements, proteins, ESTs and predicted
proteins
But, nucleotide matches in the areas of
similarity provided the features needed.
Nucleotide matches are not typically, in a
biological sense, a feature
Have been treated as a feature to assist
alignment
Better to align nucleotide matches rather than to
force an alignment between features that do not
match

29
Aim 2 Results 2

Many features were not shared between the two
species
Feature density does not indicate the
appropriateness of a sequence for FBSA
Two sequences may be rich in features, but at the
same time share few features.
The number of features shared between 2 sequences
is a more meaningful calculation
Aim 2 successfully completed

30
Aim 3 Develop a Visual FBSA Tool

Program design began while working on aim 2
Challenges in aim 2 prevented further work on aim
3
Became out of scope for the research
The program would need to be highly
user-interactive
Process of additional feature investigation is
human dependent
Overcome challenges of feature matching
An ideal research project in the future
The development of such a program would benefit
FBSA

31
Future Research

Automate FBSA process
With use of additional features
Investigate feature calculations
How much of the sequence needs to be made up of
shared features for FBSA to be worthwhile?
Develop cut-off values to assist user
FBSA of sequences from other species
Types of features needed for FBSA may vary

32
Conclusion

FBSA is a new concept in the pair wise alignment
of large genomic sequences with much future
possibility
Same species and between-species FBSA has been
successfully demonstrated
Additional features have been explored and used
Future research would be most beneficial to its
development

33
Thankyou

Questions?

34
Definitions

Chromosome Structural carrier of hereditary
characteristics A certain number of
chromosomes is characteristic of each species of
plant animal. E.g. the potato has 48
chromosomes
Gene The fundamental physical functional unit
of heredity. A gene is an ordered sequence of
nucleotides located in a particular position on a
particular chromosome that encodes a specific
functional product (Atwood Parry-Smith 1999)
Homology A similar component in two organisms
(e.g. genes with strongly similar sequences) that
can be attributed to a common ancestor of the two
organisms during evolution (Mount, 2001)
Repetitive DNA Sequences of varying lengths that
occur in multiple copies in the genome it
represents much of the human genome. (Doe
Genomics)