Multiple-Alignment

About This Presentation

Title:

Multiple-Alignment

Description:

... Efficient Tools for Large-Scale Multiple Alignment of Genomic DNA ... Visualizing Interpreting the results The multiple ... data sets that can ... – PowerPoint PPT presentation

Number of Views:259

Avg rating:3.0/5.0

Slides: 95

Provided by: Alg90

Category:

more less

Transcript and Presenter's Notes

Title: Multiple-Alignment

1
Multiple-Alignment

???????????????????

2
Reference

Michael Brudno, Chuong B. Do, Gregory M. Cooper,
Michael F. Kim, Eugene Davydov, NISC Comparative
Sequencing Program, Eric D. Green, Arend Sidow,
and Serafim BatzoglouLAGAN and Multi-LAGAN
Efficient Tools for Large-Scale Multiple
Alignment of Genomic DNAGenome Res., Apr 2003
13 721 - 731 doi10.1101/gr.926603
Michael Brudno, Alexander Poliakov, Asaf Salamov,
Gregory M. Cooper, Arend Sidow, Edward M. Rubin,
Victor Solovyev, Serafim Batzoglou, and Inna
DubchakAutomated Whole-Genome Multiple Alignment
of Rat, Mouse, and HumanGenome Res., Apr 2004
14 685 - 692 doi10.1101/gr.2067704

3
??

Multiple Sequence Alignment

AAA----CTGCAC----AG A--CTG-CT--ACTG---G ---CTGACTG
C----TTA-
NP-Complete
4
LAGAN toolkit

http//lagan.stanford.edu/
LAGAN
Multi-LAGAN
Suffle-LAGAN

5
LAGAN (??)

????? sequence ? global alignment
Find local alignments. (seeds)
Compute a rough global map.
Restricted DP.

6
Multi-LAGAN (??)

??? K ? sequence
?? K-Clustering
??????? (phylogenetic-tree )
???????? sequence ????,??? K-1 ? LAGAN?

7
Phylogenetic Tree
8
????

????
ROSETTA Set
129 genes
???
?? 10 Kbp
?? 12 ???? CFTR Region
?? 1Mbp
????? annotated exon ? align ?? 11
???,??????????????

9
ROSETTA
Aligner 100 exons 90 exons 70 exons Time (sec)
DIALIGN 89 96 98 388
MUMmer 0 1 3 17
GLASS 91 97 98 154
AVID 90 95 97 19
BlastZ 94 97 98 17
LAGAN 94 97 98 48
10
CFTR
11
MLAGAN
12
?? misaligned ???
13
Discussion

Multiple Alignments ??????????? Pairwise
Alignments ????
Local Alignment v.s. Global Alignment
MLAGAN ?????,???????
LAGAN / MLAGAN ?????????,????????????????

14
LAGAN
15
Three main steps

1.Generation of local alignments.
2. Construction of a rough global map.
3. Computation of the final global alignment.

16
1. Generation of Local Alignment

LAGAN uses CHAOS to find local homologies between
two sequence.
Michael Brudno, Michael Chapman, Berthold
Gottgens, Serafim Batzoglou, and Burkhard
Morgenstern Fast and sensitive multiple
alignment of long genomic sequences. BMC
Bioinformatics, 466 2003.
CHAOS works by chaining short words, the seeds,
which match between the two sequence.
Anchor chain of seeds, local alignment.

17
1. Generation of Local Alignment

k word length, c degeneracy
A (k, c)-seed is a pair of k-long words that
match with at most c differences between the two
sequence.
d maximum distance , s maximum shift.
Two seeds are x-letters and y-letters apart. They
can be chained together if
x lt d and y lt d
x - y lt s

18
1. Generation of Local Alignment

Find seeds at current location in seq1
Find the previous seeds that fall into the
search box
Do a range query seeds are indexed by their
diagonal.
Pick a previous seed that maximizes the score
of chain

Time O(n log n), where n is number of seeds.
19
1. Generation of Local Alignment

Scoring of Chains
I love SWEET COW (oo)
Match and mismatch penalties for each pair of
chained seed.
Gap penalties proportional to x y for each
pair of chained seed.
Chains are threw away if they score under a
threshold t.
Rapid rescoring
For the chains that score above t.
Rescore them by performing ungapped extensions in
both directions from each seed. Finding the
optimal location to insert exactly one gap of
size x y

20
2. Construction of a Rough Global Map

(b, e, b, e, s) represent a local alignment
(anchor).
From (b, b) to (e, e)
s is the score of the alignment
A1 lt A2 iff e1 lt b2 and e1 lt b2
A1 (b1, e1, b1, e1, s1)
A2 (b2, e2, b2, e2, s2)
A chain of local alignment A1 lt A2 lt lt Ak, has
score s1 s2 sk.
The optimal rough global map is the
highest-scoring chain.
Computed using Sparse Dynamic Programming LIS,
in time O(nlogn), n is the total number of local
alignment.

21
2. Construction of a Rough Global Map

Recursive anchoring
The choice of parameter k (length of seeds), d
(maximum degeneracy of seeds), and t (score
threshold) is a tradeoff between speed and
sensitivity.
Speed higher k, lower c.
Sensitivity lower k, higher c.
To achieve combination of speed and sensitivity,
LAGAN calls CHAOS with a restrictive set of
parameters in the regions between each anchor
(local alignment) of the global map.

22
3. Computation of Global Alignment

Limits the area for each anchor
The rectangle (0, 0) to (ir, i-r).
The rectangle (i-r, j-r) to (M, N).
The band enclosed by the two diagonals
(i-r, jr) to (i-r, jr)
(ir, j-r) to (Ir, j-r)
r is a parameter, typically 15.

23
3. Computation of Global Alignment

Do dynamic programming method Needleman-Wunsch to
this limited area.
In this sense the anchors in LAGAN are more
flexible than the anchors in MUMer, AVID, and
GLASS.
LAGAN provide only approximate locations by which
the alignment should pass.

24
Memory-efficient Implementation

LAGAN performs the entire computation with memory
proportional to the size of the largest
rectangle.
LAGAN achieves this memory efficiency as follow
Allocates working memory for one rectangle and
the neck that follows it. Compute
Needleman-Wunsch matrix.
Traces back all optimal alignments ending in the
cells at the rightmost column of the neck.
Soon converge upon a single optimal alignment in
practice.
Deallocates all working memory, except the memory
necessary to keep the traced-back alignments.
Repeat step 1 to step 3 for the next rectangle
and neck.

25
LAGAN Running Time Analysis

The running time of LAGAN is dominated by the
rectangles.
The running time of necks is Or(MN), which
is linear in the sequence lengths.
Suppose there are n anchors, let (x0, y0),,(xn,
yn) be dimension of the n1 rectangles. Let
denote the total length of the inter-anchor
segments in each sequence. We can asume the
anchors will be aligned in linear time and
therefore ignore their length.

and
26
LAGAN Running Time Analysis

The total number of cells in these rectangles is

The first term depends only on the effective
lengths of the sequences and the total number of
anchors.
If we assume a lower bound on acceptable anchor
density, then L1L2/n behaves linearly in sequence
length, because L1/n and L2/n areO(1).

27
LAGAN Running Time Analysis

The total number of cells in these rectangles is

The second term is at most nsx sy where s
denotes the standard deviation.
Assuming constant anchor density. (reasonable
assumption for a fixed pair of organism.) Thus,
linear in sequence length provided the standard
deviations are constant.
If the anchors are spaced evenly, and with a
constant density, the running time will be linear
in sequence length.

28
References

LAGAN online
http//genome.lbl.gov/cgi-bin/VistaInput?align_pgm
lagannum_seqs2
http//ai.stanford.edu/serafim/CS262_2005/index.h
tml
LAGAN
http//lagan.stanford.edu/lagan_web/citing.shtml
Algorithms for Alignment of Genomic,
SequencesMichael Brudno, Department of Computer
Science, Stanford UniversityPGA Workshop
07/16/2004

29
LAGAN and Multi-LAGAN
Efficient Tools for Large-Scale Multiple
Alignment of Genomic DNA
30
Outline

LAGAN
Multi-LAGAN
Performance Evaluation

31
Multi-LAGAN
32
Multiple Alignment

A natural extension of 2-sequence comparisons
More difficult than pairwise
the running time scales as the product of the
lengths of all sequences
NP-complete problem
(need heuristic approaches)

33
Progressive Alignment

the most widely used heuristic approach
Successive applications of a pairwise alignment
algorithm
CLUSTALW (best-known) and MLAGAN

34
MLAGAN (Multi-LAGAN)

A multiple aligner based on progressive alignment
with LAGAN
2 main phases
(1) Progressive alignment with LAGAN
(2) (optional) Iterative improvement
1. successively remove each sequence
2. realign it

35
Algorithm MLAGAN

Input
K sequences X1,,XK
A phylogenetic binary tree between them

36
Algorithm MLAGAN (cont.)

3 main steps
(1) Generation of rough global maps.
Find the rough global map between
each pair of sequences.
(step 1, 2 of LAGAN)

37
Algorithm MLAGAN (cont.)

(2) Progressive multiple alignment with
anchors.
2.1 Perform a global alignment
between the 2 closest sequences
according to the phylogenetic tree
using step 3 of LAGAN.

38
Algorithm MLAGAN (cont.)

2.2 Find the rough global maps of the new
multi-sequence to all other multi-
sequences.
(details scoring metric in later)
2.3 Iterate steps 2.1, 2.2 (K-1 times).
Repeat until left with a multiple
alignment of all sequences.

39
Algorithm MLAGAN (cont.)

(3) (Optional) Iterative refinement
with anchors.
For each sequence Xi in the multiple
alignment
3.1 Find anchors between Xi the other
sequences that align better than a
given cutoff.

40
Algorithm MLAGAN (cont.)

3.2 Align Xi to the multiple
alignment of the other
sequences with LAGAN.
(details in later)

41
Align 2 Multi-sequences

In the order of the given phylogenetic tree.
E.g. 1. (human, chimpanzee)
2. (mouse, rat)
3. (human/chimpanzee, mouse/rat)
4. (human/chimpanzee/mouse/rat,
chicken)

42
Align 2 Multi-sequences (cont.)

Step 2.2 of MLAGAN
E.g. Compute the rough global map of
2-sequence X/Y and 1-sequence Z
(1) Anchors in the rough global maps
between X Z, Y Z.
(2) Reweigh overlapped anchors
(s1s2)I/U

43
Align 2 Multi-sequences (cont.)

I length of intersection
U length of union
(3) The highest weight chain, by LIS.

44
Scoring with Affine Gaps

An open research area
(T-COFFEE)
2 classical models
(1) sum-of-pairs model
(2) consensus model

45
Scoring with Affine Gaps (cont.)

sum-of-pairs model Sum of scores of all
pairwise alignments
consensus model
Create a consensus string by a majority vote at
each position.
Sum of pairwise scores between the consensus and
each individual sequence

46
Scoring with Affine Gaps (cont.)

Each scoring scheme has advantages
disadvantages.
E.g. consensus
We use a combination of both
sum-of-pairs gt substitutions.
consensus gt gaps.
p.s. Most similar to CLUSTALW
? heuristically weighted per-sequence
penalties gt gaps

47
Scoring with Affine Gaps (cont.)

Stacking effect (consensus affine-gap)
Because gap-open penalties are large compared
to match mismatch scores, often it is favorable
to artificially open additional gaps in order to
stack the gap openings.
Solution use gap-end penalty
( gap-open penalty)

48
Scoring with Affine Gaps (cont.)

consensus string
ATCTGT---CAG

49
Scoring with Affine Gaps (cont.)

Define
(Aij) K L alignment matrix
Aij belongs to A, C, G, T, -
(Bij) K L alignment matrix
Bij belongs to N, O, G, C

50
Scoring with Affine Gaps (cont.)

Bij
O (gap-open) the ones opening a gap.
G (gap-continue) Aij- except gap-open.
C (gap-close) the ones closing a gap.
N (nucleotide) Aij?- except gap-close.

51
Scoring with Affine Gaps (cont.)

Let m, d, g, c be the match, mismatch, gap-open,
gap-continue penalties, respectively.
Define the function S(x, y) as follows
0, if x- or y-
m, if xy
d, otherwise. (i.e. x?y)

52
Scoring with Affine Gaps (cont.)

Let Nj, Oj, Gj, Cj be the number of Ns, Os,
Gs, Cs in the jth column of (Bij),
respectively.
Define the function
T(i) min(Oj, K-Oj) (gc)
min(Gj, K-Gj) c
min(Cj, K-Cj) g

53
Scoring with Affine Gaps (cont.)

The multiple alignment score is then

54
Iterative Refinement With Anchors

A shortcoming of progressive alignment
The initial pairwise alignments are fixed, and
early errors cant be corrected later.

55
Iterative Refinement With Anchors (cont.)

Solution Iterative refinement.
2 versions
(1) standard
(2) limited-area

56
Iterative Refinement With Anchors (cont.)

(1) standard (e.g. CLUSTALW)
Repeat it for a of iterations, or until
a
local maximum is reached.
(2) limited-area
Constraint (stay within radius r).
Improve the alignment locally, not allow
large-scale changes.

57
Iterative Refinement With Anchors (cont.)

MLAGAN
Introduce (2) limited-area version, but of
allowing larger-scale adjustments (by anchors).
Cumulative contribution
si max 0, (the score of position i) si-1
When si ?threshold T, reset si 0 and an
anchor is created at position i.
Re-align the removed sequence using LAGAN with
the anchors.

58
Thanks! ?
59
Introduction(1/3)

Multiple sequence alignments provide
identification and characterization of functional
elements.
Similarity of evolutionary distances is detected
by multiple alignment of homologous sequences.
There have been several schemes of multiple
alignments for bacterial and yeast genomes.
Several strategies for pairwise genome alignments
of human and mouse genomes based on local or
local/global technique.

60
Introduction(2/3)

Comparing more than two large and structurally
complex genomes presents several challenges
Obtaining a consistent map between several
genomes
Performing large-scale multiple alignment
Visualizing
Interpreting the results
The multiple alignment method of this paper
expands on the local/global approach of Couronne
and colleagues.

61
Introduction(3/3)

The technique is fully automated and efficient
It dose not require a prebuilt synteny map
It is able to align the three mammalian genomes
in lt1 day on 24-node computer cluster.
The Multi-VISTA browser is developed with an
user-friendly visualization approach for
exploring conserved regions among multiple
genomes.

62
Overview of Strategy for Multiple alignment(1/3)
General scheme of the method. White boxes show
original and intermediate data green boxes,
mapping/alignment steps and yellow and grey
boxes, resulting data.
The multi-contigs are aligned to human, using
the union of all available BLAT local
alignments from mouse to human and from rat to
human mouse or rat sequences that could not be
aligned to the other rodent are also aligned to
human.
The mouse and rat genomes are aligned by BLAT,
followed by LAGAN global alignment of selected
regions.
Multi-contigs
63
Overview of Strategy for Multiple alignment(2/3)

This strategy allows to predict more accurately
the ortholog of each multi-contig in the human
genome
Only 0.8(2Mbp) of the rat genome and 7 of the
rat contigs were mapped to multiple areas in the
human genome.
Compared with 4.4 of the genome and 32 of the
contigs in the original technique.
The substitution matrices derived specifically
for the human, mouse, and rat genomes is modified
from LAGAN.

64
Overview of Strategy for Multiple alignment(3/3)

It generated 11235 areas of three-way alignments,
74 of which are longer then 200 Kbp in the human
sequence.
Verifing the quality of the alignments
Determining the percentage of whole genomes and
protein-coding exons covered by high-scoring
subalignments.
Comparing our alignments with a syntenic map
generated independently, based on gene
predictions.

65
Method

Part 1 Progressive Alignment Strategy
Part 2 Finding Genomic Synteny by Exons

66
Progressive Alignment Strategy

Genomes
April 2003 Human
February 2003 Mouse
June 2003 Rat
RepeatMasker
RefSeq

67
Work Flow
Global Align Genomes of Mouse and Rat
Aligned Sequences
Unaligned Sequences
Aligned to Human Genome
68
Mouse/Rat Alignment
4. Filtering
3.Group and extend.
1. Rat genome is divided into regions roughly
250kb in size.
Mouse Genome
Rat Genome
ltL
5. Use LAGAN to find global alignment.
Mouse Genome
2. Use BLAT to map these region with mouse.
(Local alignment)
Mouse Genome
ltL
ltL
Mouse Genome
69
Mouse/Rat Alignment (cont.)

Filtering
Group score lt70,000
lt70 of the maximum
Looser threshold ? higher sensitivity?
LAGAN
Empirically derived gap penalties (-500 for
mouse/rat and -800 for human/rodent)
fastreject option
Abandon the alignment if the homology looks
weak. Currently tuned for human/mouse distance,
or closer.

!
70
Human/Rodent Alignment
Rat/Mouse
Aligned and Unaligned seq. of rat (and mouse) are
aligned to Human genome. Similar procedure as
mouse/rat alignment.
Unaligned sequence
Aligned sequence
71
Human/Rodent Alignment (Cont.)

Use scorealign tool to clip aligned sequences of
mouse and rat.
scorealign
Hidden Markov Model
Given a pairwise alignment and conservation
cutoff k, it returns regions that are more
likely to have resulted from a k conservation
model rather than the background (25)
conservation model.

72
Finding Genomic Synteny by Exons

Get Predicted exons from genome
EST_MAP
Based on gene prediction
Fgenesh
Use Chains of Coding Exons to find Synteny
Initially build human/mouse and human/rat
pairwise map, and then resolve them into a single
three way map for human.

73
Get Exons from Genome (Cont.)

RefRNAs are mapped onto genomes by the EST_MAP
program.
Ab intio Fgenesh prediction is run on the rest of
genome.
Use BLAST to search protein homologs of genome
region of 2.
Run Fgenesh to production 3. (more accurate)
Second run of ab intio gene prediction to regions
without prediction in 1 to 4.
Fgenesh gene predictions are run in large introns
of known and predicted genes.

74
Get Predicted Exons from Genome
Exon from running Fgenesh on large intron
Exons from second run of ab inio Fgenesh and
Fgenesh
Chromosome
Exons from ab intio Fgenesh and Fgenesh
Exons from mapping RefSeq mRNAs (Real exons!)
75
Use Chains of Coding Exons to Find Synteny

Compile a set of nonredundant, nonoverlapping
exons with at least 10 amino acids
In ascending order along each chromosome.

MVHFTAEEKAAVTSLWSKMNVEEAGGEALGRLLVVYPWTQRFFDSFG
TQRFFD
TAEEKAAVTSLW
AGGEALGRLLV
76
Use Chains of Coding Exons to Find Synteny (Cont.)

Use BLASTP to align each exon to an exon set from
the chromosome of the other organism.
Homologous exon chain 10 exons at least
Syntenic segment
Share at least 5 pairs of exons with
bidirectional hits.
Syntenic segments are extended into Syntenic
blocks and used to create synteny map.

70
77
Exon-Based Map of Conserved Synteny(1/3)

Because most gene-prediction programs demonstrate
higher accuracy in predicting exons than in
predicting entire genes.
A three-way synteny map is built based on chains
of Fgenesh predicted exons, rather than whole
genes.
Built human/mouse and human/rat pairwise maps,
then resolved them into a single three-way map
for human, mouse, and rat.

78
Exon-Based Map of Conserved Synteny(2/3)

During the construction of pairwise maps, chains
of exons are defined .
The same order in each of the two genomes is at
least 70 of the exons have homologs in the other
genome.
Pairwise syntenic maps are merged into a
three-way synteny map by selecting a single
genome as a base and merging overlapping parts of
the pairwise maps.
The resulting map has a total of 4497 three-way
synteny segments.

79
Exon-Based Map of Conserved Synteny(3/3)

Among the 4497 segments, the mouse segment is
absent in 191 cases (4.2), and the rat segment
in 315 cases (7).
The total length of three-way synteny segments in
the human genome was 674 Mb, with average segment
length of 150 kb.
These segments are further extended into larger
blocks by merging those that are within 5 Mb of
each other in every genome.
Finding 494 synteny blocks is shared among all
three genomes.
The total length of three-way synteny blocks was
2351 Mb, with an average block length of 4.76Mb.

80
Exon-based map of conserved synteny between the
rat, human, and mouse genomes. Each rat
chromosome ( presented along the x-axis) contains
two columns, colored according to conserved
synteny with chromosomes of the human and mouse
genomes. Chromosome color scheme is shown at the
bottom.
81
Agreement Between Alignments and the Exon-Based
Map(1/2)

The multiple alignment of the three genomes and
the predicted exon-based synteny map produce
complementary, independent data sets that can be
used to evaluate the accuracy of both methods.
The alignments generated by the automatic
alignment pipeline with the exon-based synteny
map are compared.
A syntenic block and an alignment were considered
matching if they overlapped.

82
Agreement Between Alignments and the Exon-Based
Map(2/2)
gt100 kb in human

The longer alignments exhibited greater than 97
agreement between the two maps.
But for very short alignments, the correlation is
dropped to 13.

110 kb
83
Genome Coverage by Three-Way Alignments(1/2)

One way to evaluate alignment sensitivity is to
compute the percentage of the base pairs of all
genomes that are reliably aligned.
The scoring techniques developed for comparison
of the human and mouse genomes is used to
computed overall coverage, as well as coverage of
RefSeq exons.

84
Genome Coverage by Three-Way Alignments(2/2)
85
Multi-VISTA Browser

To visualize the results of comparative sequence
analysis of multiple genomes in the VISTA format.
It can be accessed at http//pipeline.lbl.gov.
The Multi-VISTA Browser displays humanmouserat
multiple alignments on the scale of whole
chromosomes, along with annotations.
The user may select any of the three genomes as
the reference and display the level of
conservation between this reference and the
sequences of the other two species in a
particular interval.

86
Multi-VISTA Browser
87
Discuss

By comparing the alignment to an independently
generated map of protein synteny between the
genomes, it concludes that 97 of alignments with
a human sequence gt100 kbp, and 87 of all
alignments, agree with the map.
The difference between these numbers can be
explained by the lower accuracy of both alignment
and synteny map generation when dealing with very
short regions of conserved synteny.

88
Discuss

Only 3.4 of the human base pairs in the whole
genome alignment are within such nonmatching
regions.
One drawback of global alignments is their
inability to deal with small rearrangement
events.
A previous study has suggested that as much as 2
of the gene-coding regions of the human genome
may have undergone some local rearrangement
events.
Since the divergence between human and the
rodents, and the local/global approach often is
not able to cope with these events.

89
Discuss

Additional genomes will help to verify the
quality of existing alignments and provide the
biologists with additional comparative
information with which to judge the evolutionary
importance of a region.
Adding several other mammalian genomes will
possibly allow us to locate constraints at the
individual base pair level.
The availability of these genomes would make
possible the use of comparative sequence analysis
in new areas, such as the determination of
individual binding sites.

90
??-syteny

?????????
??????????????????????????????????????????????????
??????????????????,??????????????????????,????????
??????????????????????,??????????????????????(inve
rsion, translocation, duplication)????????????????
????????????????,?????????????????????,???????????
????????????synteny?

http//vschool.scu.edu.tw/biology/content/genetics
/ge082.htm
91
??-syteny

????,?????????????????????????,????????(syntenic)?
????????????

http//vschool.scu.edu.tw/biology/content/genetics
/ge082.htm
92
??- RefSeq

RefSeq?Reference Sequences???-????????DNA
contigs??????mRNAs??????????mRNAs???
???????????????Accession numbers????
??2??????6???,??NT_123456?NM_123456?
NC_123456?NG_123456?XM_123456?XR_123456(??
http//www.ncbi.nlm.nih.gov/RefSeq/)?

http//www.ascc.sinica.edu.tw/nl/92/1922/02.txt
93
??-orthologs

orthologs/orthologous (????)
?????????(??????????)??????????????

http//pastime.cgu.edu.tw/petang/Bioinfomatics/Lec
ture/93-1/93-1_11/CGU_Bioinformatics_20041203.file
s/frame.htmslide0040.htm
94
Reference

http//www.cf.ac.uk/biosi/research/biosoft/Help/To
pics/evolutionaryDistance.html
http//vschool.scu.edu.tw/biology/content/genetics
/ge082.htm
http//pastime.cgu.edu.tw/petang/Bioinfomatics/Lec
ture/93-1/93-1_11/CGU_Bioinformatics_20041203.file
s/frame.htmslide0040.htm
http//www.ascc.sinica.edu.tw/nl/92/1922/02.txt
http//en.wikipedia.org/wiki/Binding_site
http//genome.ucsc.edu/cgi-bin/hgTrackUi?hgsid769
33989cchr7gsoftberryGene