COMPARATIVE GENOMICS: GENOMEWIDE ANALYSIS IN METAZOAN EUKARYOTES - PowerPoint PPT Presentation

1 / 47

About This Presentation

Title:

COMPARATIVE GENOMICS: GENOMEWIDE ANALYSIS IN METAZOAN EUKARYOTES

Description:

To model a specific type of negative selection ... Can be overcome by identifying DNA motifs evolving at a slower rate than the ... – PowerPoint PPT presentation

Number of Views:67

Avg rating:3.0/5.0

Slides: 48

Provided by: bioinforma74

Category:

more less

Transcript and Presenter's Notes

Title: COMPARATIVE GENOMICS: GENOMEWIDE ANALYSIS IN METAZOAN EUKARYOTES

1
COMPARATIVE GENOMICS GENOME-WIDE ANALYSIS IN
METAZOAN EUKARYOTES

Mao-Feng Ger
02/08/2006

Completely sequenced genome could be used for
large-scale comparative analysis
Effective methods for enormous data are
objectives
Main areas in comparative genomics
Whole-genome alignment
Gene prediction
Regulatory-region prediction

Introduction
Whole-genome alignments
Gene prediction
Finding regulatory regions
conclusion

4
(No Transcript)
5
Sequenced genomes

In NCBI Genomic Project Database, up to 02/05,
the of genome project
Archaea 51 (complete , draft assembly, in
progress)
Bactria 821 (complete , draft assembly, in
progress)
Eukaryota 391 (complete, draft assembly, in
progress, organelles)
In eukaryota, 174 are metazoans.

6
Comparative genomics

Presumption two genomes are from a common
ancestor, so every bp is the combination of the
original genome and the action of evolution
Evolution mutation selection
Can be represented by a rate matrix
Selection
Negative selection
Neutral selection
Positive selection

7
BLOSUM and PAM rate matrices

PAM (Percent Accepted Mutation)
From a set of proteins which are at least 85
identical
Numeric suffix means the number of self
multiplication
BLOSUM (BLOcks SUbstitution Matrix)
More empirical and from a large dataset
Contructed by extracting ungapped segments
(blocks) from a set of aligned protein families
Numeric suffix means at least x identity to the
blocks

8
Difficulties in aligning genomes

Knowing so little about evolution processes that
wed better focus on functional sequences
Because genome size differences and genome
readiness, doing whole-genome alignments is
pretty difficult
Recently, there are more and more programs
dealing with large-scale comparisons. Biologist
need to know these approaches.

Introduction
Whole-genome alignments
Gene prediction
Finding regulatory regions
conclusion

10
Precomputed alignments

Several groups have made large cross-species
comparisons
UC Santa Cruz/PennState (translated BLAT or
BLASTZ)
Berkeley Genome Pipeline (BLAT/AVID)
Ensembl (Phusion/Blastn)

11
Whole-genome alignment
12
Which genome to align

Sufficient similarity between genomes enable the
easy identification of homologous regions
Example DNA alignment between human and mouse
resulted in finding new genes and gene regulatory
regions
Alignment between human and puffer fish, though
less easy, is still feasible

13
Comparing genomes at protein level

Not closely related genomes might have problems
to align genomes at nucleotide level
At protein level, might lost info which can help
finding new genes and regulatory sequences
It is better to start from closely related genomes

14
Alignment strategy

Dynamic programming makes alignment tractable as
long as you follow a few rules
Needleman/Wunch align sequences globally

Smith/waterman align sequences locally
No negative score, at least 0
Tracing to 0
However, limitations
Cannot handle rearrangement such as inversion,
duplication, translocation
For long sequence (gt10,000 bp), very expensive in
time and memory usage

16
Seeding strategy

Because correct alignment comes from stretches of
ungapped matches
So, first finding a set of ungapped matches
(seeds)
Then, extending gapped alignment from where seeds
happens.
Loss in sensitivity but reward in time and memory
usage
Consecutive model and Two weighted-spaced model

Simply put,
Seeding
Seeds used as nucleation point for extension
Dynamic programming to produce gapped alignments
In this review, we focus on 4 whole-genome
alignment methods
BLASTZ
BLAT/AVID
BLAT/LAGAN
WABA

18
BLASTZ

Local aligner, like BLASTZ and BLAT, are highly
sensitive but less specific
BLASTZ applies several methods to increase
sensitivity and specificity
Seeding instead of 11 consecutive model, new
BLASTZ used two weighted-spaced model(12 of 19
and tolerate a transition among 12)
Extend the seeds without gaps
Extend gapped alignment down-weight
low-complexity matches first

19
(No Transcript)
20

In mouse-human alignment case, using a specific
scoring matrix from known mouse-human homology
region
A post-processing step is needed to sort out the
most significant orthologues in multiple matches
Overall, BLASTZ covered 98 coding region in
mouse and human genome, indicating it is highly
sensitive for identifying well-conserved regions

21
BLAT

A local aligner
Untranslated designed to align cDNA to genomic
sequences and less effective at lt 90 identity
Translated mode more effective in genome
comparison. With mask for repeats and
low-complexity, the output is faster and cleaner
Produce a set of ungapped alignments, good in
speed at the expense of overall sensitivity
Used in human-pufferfish genome comparison

22
Global alignment

3 steps
1 finding the maximal repeated region
2 clean matches first, then repeat matched
Recursively step 1 and 2
3 lt4kb, use NW algorithm gt 4kb, no significant
alignment

23
(No Transcript)
24
AVID

Assumption strictly homologous and no gene
duplication, inversion, translocation
When apply to a whole-genome, it needs a
pre-processing step to identify syntenic regions

25
LAGAN

The advantage of LAGAN over AVID is that it can
align larger sequences
Lower memory requirement
Different matching algorithm in step 1 (not
necessary to find exact matches)
In conjunction with BLAT, it has been applied to
rat-human and rat-mouse comparisons

26
MLAGAN

An extension of LAGAN
Can do multiple alignment
Align closely related genomes first, then
incorporate others in order of phylogenetic
distance

27
WABA

Take genetic code degeneracy into consideration
Seeding step based on nucleotides and use two
weighted-spaced rule 6of8, which allow the third
position to mismatch
No extension step, but group proper seeds to
define homologous regions

28
Biological correctness

There is no best way to do alignments
Know evolution inadequately to indicate which one
is superior
Different algorithms are tuned to different
genome comparisons (ex. BLASTZ in human-mouse
case and WABA in C. eleganC. briggsae case)
Purposes are different
Align as much as possible, regardless of
selection (ex. AVID, LAGAN, BLASTZ)
Identify conserved regions which are under
selection (ex. BLAT, WABA)

Most programs concern maximizing the homologous
bps, while biologist are interested in conserved
regions for a function.
For example, in the mouse-human alignment, 40
are alignable, but only 6 are under selection
To make things worse, substitution rate varies
across genome.

Introduction
Whole-genome alignments
Gene prediction
Finding regulatory regions
conclusion

31
Defining gene structure

Still a challenge because of poor signal-to-noise
level
Comparison between closely related genomes could
provide additional info (dual genome gene
predictors)
Different programs are with different
presumptions, so users need to know the strengths
and limitations

32
Dual genome gene predictor

To model a specific type of negative selection
Assumption in alignments, most differences are
neutral and regions without many mutations are
conserved.
Combine other info, such splicing, wobble effect
to get a better model

Can be subdivided into 3 classes
Pair-HMM take math approach to determine joint
gene structure and alignment
Informant approaches fix on alignment to provide
a better gene prediction
Exon-finding approaches try to demark the exons
without splicing them together

34
Pair-HMM

HMM can be used to predict gene structure in a
single genome
Paire-HMM can find the most likely path to have
generate these sequences and provide the
alignment as well as gene prediction
Contain two set of orthologous genes

Two pair HMM approaches
SLAM
DoubleScan
Both need to optimize parameters for a specific
species and better efficiency
SLAM uses AVID method to do rough alignment,
while DoubleScan uses BLAST

36
Informant appraoches

Use only one sequence to predict gene structure,
and other one sequence is just for additional
info by its alighment
Can predict not only genome sequence but also
different inputs, like unassembled reads
3 methods are available TwinScan, SGP-2,
GenomeScan
Need precise parameters, so have their own
alignment methods (often BLAST)

37
Exon prediction

A carefully parameterized TBLASTX method designed
to provide specific exon prediction from
Tetraodon
Sacrifice a certain amount sensitivity for high
specificity

38
Which method to use

A particularly successful way to do this work
used informant methods combined with some simple
criteria
Produce a strong prediction in mammalian genomes

Introduction
Whole-genome alignments
Gene prediction
Finding regulatory regions
conclusion

40
Finding regulatory regions

Called phylogentic footprinting (analogous with
DNAase footprinting)
Functionally important regions are mutated less
These cis-regulatory motifs can be dertermined
by
Finding common motifs in orthologous sequences
Aligingn orthologous sequences first, then
indentifying common regions
Previously known motifs might help

41
Which region to use

5 and 3 flanking regions as well as intronic
sequences
Difficulties in finding regulatory regions
5 end is often the least well-defined, so we
need experimental evidence of promoters
Enhancers could be several kilobases away
In addition to experimental evidence, guessing
and systematic comparison is needed to potential
cis-regulatory regions

42
Evolutionary issues

Two orthologous genes might have very different
regulatory cis-elements, such as paralogous genes
How evolution affects cis-regulatory motifs is
still poorly understood
Intra-mammal comparison show a large amount of
non-functional conservation, while in
intravertebrate, it is hard to detect

Neutral drift effect could destroy or create
cis-regulatory sites at a certain rate
However, expression pattern could remain
little/no changed
Raise the possibility of compensate mutation
Recently, some researchers try to distinguish
regulatory regions from neutrally evolving DNA by
genome sequence alignments

44
Motif overrepresentation

Motif finding programs do not consider the
phylogenetic relationship between homologous
sequences
Can be overcome by identifying DNA motifs
evolving at a slower rate than the surrounding
sequences
All motif-finding techniques work better with
increasing amounts of sequences

45
Alignment for finding regulatory region

Aligning regions of homology in the non-coding
regions near the orthologous genes
More and more researches show that cis-regulatory
elements are in non-conserved regions

Introduction
Whole-genome alignments
Gene prediction
Finding regulatory regions
conclusion

47
conclusion

With more genomes to be sequenced, we can
investigate the evolution effects on specific
regions over the entire genomes
With precomputed data, users can focus at the
biological level
3 advances needed to be made
Need more genomes to improve the power
Power can be improved by knowing how negative
selection works for different functional
contraints
Knowing more about positive selection