Title: Multiple-Alignment
1Multiple-Alignment
2Reference
- Michael Brudno, Chuong B. Do, Gregory M. Cooper,
Michael F. Kim, Eugene Davydov, NISC Comparative
Sequencing Program, Eric D. Green, Arend Sidow,
and Serafim BatzoglouLAGAN and Multi-LAGAN
Efficient Tools for Large-Scale Multiple
Alignment of Genomic DNAGenome Res., Apr 2003
13 721 - 731 doi10.1101/gr.926603 - Michael Brudno, Alexander Poliakov, Asaf Salamov,
Gregory M. Cooper, Arend Sidow, Edward M. Rubin,
Victor Solovyev, Serafim Batzoglou, and Inna
DubchakAutomated Whole-Genome Multiple Alignment
of Rat, Mouse, and HumanGenome Res., Apr 2004
14 685 - 692 doi10.1101/gr.2067704
3??
- Multiple Sequence Alignment
AAA----CTGCAC----AG A--CTG-CT--ACTG---G ---CTGACTG
C----TTA-
NP-Complete
4LAGAN toolkit
- http//lagan.stanford.edu/
- LAGAN
- Multi-LAGAN
- Suffle-LAGAN
5LAGAN (??)
- ????? sequence ? global alignment
- Find local alignments. (seeds)
- Compute a rough global map.
- Restricted DP.
6Multi-LAGAN (??)
- ??? K ? sequence
- ?? K-Clustering
- ??????? (phylogenetic-tree )
- ???????? sequence ????,??? K-1 ? LAGAN?
7Phylogenetic Tree
8????
- ????
- ROSETTA Set
- 129 genes
- ???
- ?? 10 Kbp
- ?? 12 ???? CFTR Region
- ?? 1Mbp
- ????? annotated exon ? align ?? 11
???,??????????????
9ROSETTA
Aligner 100 exons 90 exons 70 exons Time (sec)
DIALIGN 89 96 98 388
MUMmer 0 1 3 17
GLASS 91 97 98 154
AVID 90 95 97 19
BlastZ 94 97 98 17
LAGAN 94 97 98 48
10CFTR
11MLAGAN
12?? misaligned ???
13Discussion
- Multiple Alignments ??????????? Pairwise
Alignments ???? - Local Alignment v.s. Global Alignment
- MLAGAN ?????,???????
- LAGAN / MLAGAN ?????????,????????????????
14LAGAN
15Three main steps
- 1.Generation of local alignments.
- 2. Construction of a rough global map.
- 3. Computation of the final global alignment.
161. Generation of Local Alignment
- LAGAN uses CHAOS to find local homologies between
two sequence. - Michael Brudno, Michael Chapman, Berthold
Gottgens, Serafim Batzoglou, and Burkhard
Morgenstern Fast and sensitive multiple
alignment of long genomic sequences. BMC
Bioinformatics, 466 2003. - CHAOS works by chaining short words, the seeds,
which match between the two sequence. - Anchor chain of seeds, local alignment.
171. Generation of Local Alignment
- k word length, c degeneracy
- A (k, c)-seed is a pair of k-long words that
match with at most c differences between the two
sequence. - d maximum distance , s maximum shift.
- Two seeds are x-letters and y-letters apart. They
can be chained together if - x lt d and y lt d
- x - y lt s
181. Generation of Local Alignment
- Find seeds at current location in seq1
- Find the previous seeds that fall into the
search box - Do a range query seeds are indexed by their
diagonal. - Pick a previous seed that maximizes the score
of chain -
Time O(n log n), where n is number of seeds.
191. Generation of Local Alignment
- Scoring of Chains
- I love SWEET COW (oo)
- Match and mismatch penalties for each pair of
chained seed. - Gap penalties proportional to x y for each
pair of chained seed. - Chains are threw away if they score under a
threshold t. - Rapid rescoring
- For the chains that score above t.
- Rescore them by performing ungapped extensions in
both directions from each seed. Finding the
optimal location to insert exactly one gap of
size x y
202. Construction of a Rough Global Map
- (b, e, b, e, s) represent a local alignment
(anchor). - From (b, b) to (e, e)
- s is the score of the alignment
- A1 lt A2 iff e1 lt b2 and e1 lt b2
- A1 (b1, e1, b1, e1, s1)
- A2 (b2, e2, b2, e2, s2)
- A chain of local alignment A1 lt A2 lt lt Ak, has
score s1 s2 sk. - The optimal rough global map is the
highest-scoring chain. - Computed using Sparse Dynamic Programming LIS,
in time O(nlogn), n is the total number of local
alignment.
212. Construction of a Rough Global Map
- Recursive anchoring
- The choice of parameter k (length of seeds), d
(maximum degeneracy of seeds), and t (score
threshold) is a tradeoff between speed and
sensitivity. - Speed higher k, lower c.
- Sensitivity lower k, higher c.
- To achieve combination of speed and sensitivity,
LAGAN calls CHAOS with a restrictive set of
parameters in the regions between each anchor
(local alignment) of the global map.
223. Computation of Global Alignment
- Limits the area for each anchor
- The rectangle (0, 0) to (ir, i-r).
- The rectangle (i-r, j-r) to (M, N).
- The band enclosed by the two diagonals
- (i-r, jr) to (i-r, jr)
- (ir, j-r) to (Ir, j-r)
- r is a parameter, typically 15.
233. Computation of Global Alignment
- Do dynamic programming method Needleman-Wunsch to
this limited area. - In this sense the anchors in LAGAN are more
flexible than the anchors in MUMer, AVID, and
GLASS. - LAGAN provide only approximate locations by which
the alignment should pass.
24Memory-efficient Implementation
- LAGAN performs the entire computation with memory
proportional to the size of the largest
rectangle. - LAGAN achieves this memory efficiency as follow
- Allocates working memory for one rectangle and
the neck that follows it. Compute
Needleman-Wunsch matrix. - Traces back all optimal alignments ending in the
cells at the rightmost column of the neck. - Soon converge upon a single optimal alignment in
practice. - Deallocates all working memory, except the memory
necessary to keep the traced-back alignments. - Repeat step 1 to step 3 for the next rectangle
and neck.
25LAGAN Running Time Analysis
- The running time of LAGAN is dominated by the
rectangles. - The running time of necks is Or(MN), which
is linear in the sequence lengths. - Suppose there are n anchors, let (x0, y0),,(xn,
yn) be dimension of the n1 rectangles. Let - denote the total length of the inter-anchor
segments in each sequence. We can asume the
anchors will be aligned in linear time and
therefore ignore their length.
and
26LAGAN Running Time Analysis
- The total number of cells in these rectangles is
- The first term depends only on the effective
lengths of the sequences and the total number of
anchors. - If we assume a lower bound on acceptable anchor
density, then L1L2/n behaves linearly in sequence
length, because L1/n and L2/n areO(1).
27LAGAN Running Time Analysis
- The total number of cells in these rectangles is
- The second term is at most nsx sy where s
denotes the standard deviation. - Assuming constant anchor density. (reasonable
assumption for a fixed pair of organism.) Thus,
linear in sequence length provided the standard
deviations are constant. - If the anchors are spaced evenly, and with a
constant density, the running time will be linear
in sequence length.
28References
- LAGAN online
- http//genome.lbl.gov/cgi-bin/VistaInput?align_pgm
lagannum_seqs2 - http//ai.stanford.edu/serafim/CS262_2005/index.h
tml - LAGAN
- http//lagan.stanford.edu/lagan_web/citing.shtml
- Algorithms for Alignment of Genomic,
SequencesMichael Brudno, Department of Computer
Science, Stanford UniversityPGA Workshop
07/16/2004
29LAGAN and Multi-LAGAN
Efficient Tools for Large-Scale Multiple
Alignment of Genomic DNA
30Outline
- LAGAN
- Multi-LAGAN
- Performance Evaluation
31Multi-LAGAN
32Multiple Alignment
- A natural extension of 2-sequence comparisons
- More difficult than pairwise
- the running time scales as the product of the
lengths of all sequences - NP-complete problem
- (need heuristic approaches)
33Progressive Alignment
- the most widely used heuristic approach
- Successive applications of a pairwise alignment
algorithm - CLUSTALW (best-known) and MLAGAN
34MLAGAN (Multi-LAGAN)
- A multiple aligner based on progressive alignment
with LAGAN - 2 main phases
- (1) Progressive alignment with LAGAN
- (2) (optional) Iterative improvement
- 1. successively remove each sequence
- 2. realign it
35Algorithm MLAGAN
- Input
- K sequences X1,,XK
- A phylogenetic binary tree between them
-
36Algorithm MLAGAN (cont.)
- 3 main steps
- (1) Generation of rough global maps.
- Find the rough global map between
- each pair of sequences.
- (step 1, 2 of LAGAN)
37Algorithm MLAGAN (cont.)
- (2) Progressive multiple alignment with
- anchors.
- 2.1 Perform a global alignment
- between the 2 closest sequences
- according to the phylogenetic tree
- using step 3 of LAGAN.
38Algorithm MLAGAN (cont.)
- 2.2 Find the rough global maps of the new
- multi-sequence to all other multi-
- sequences.
- (details scoring metric in later)
- 2.3 Iterate steps 2.1, 2.2 (K-1 times).
- Repeat until left with a multiple
- alignment of all sequences.
39Algorithm MLAGAN (cont.)
- (3) (Optional) Iterative refinement
- with anchors.
- For each sequence Xi in the multiple
- alignment
- 3.1 Find anchors between Xi the other
- sequences that align better than a
- given cutoff.
40Algorithm MLAGAN (cont.)
- 3.2 Align Xi to the multiple
- alignment of the other
- sequences with LAGAN.
- (details in later)
41Align 2 Multi-sequences
- In the order of the given phylogenetic tree.
- E.g. 1. (human, chimpanzee)
- 2. (mouse, rat)
- 3. (human/chimpanzee, mouse/rat)
- 4. (human/chimpanzee/mouse/rat,
- chicken)
42Align 2 Multi-sequences (cont.)
- Step 2.2 of MLAGAN
- E.g. Compute the rough global map of
- 2-sequence X/Y and 1-sequence Z
- (1) Anchors in the rough global maps
- between X Z, Y Z.
- (2) Reweigh overlapped anchors
- (s1s2)I/U
43Align 2 Multi-sequences (cont.)
- I length of intersection
- U length of union
- (3) The highest weight chain, by LIS.
44Scoring with Affine Gaps
- An open research area
- (T-COFFEE)
- 2 classical models
- (1) sum-of-pairs model
- (2) consensus model
-
45Scoring with Affine Gaps (cont.)
- sum-of-pairs model Sum of scores of all
pairwise alignments - consensus model
- Create a consensus string by a majority vote at
each position. - Sum of pairwise scores between the consensus and
each individual sequence
46Scoring with Affine Gaps (cont.)
- Each scoring scheme has advantages
disadvantages. - E.g. consensus
- We use a combination of both
- sum-of-pairs gt substitutions.
- consensus gt gaps.
- p.s. Most similar to CLUSTALW
- ? heuristically weighted per-sequence
penalties gt gaps
47Scoring with Affine Gaps (cont.)
- Stacking effect (consensus affine-gap)
- Because gap-open penalties are large compared
to match mismatch scores, often it is favorable
to artificially open additional gaps in order to
stack the gap openings. - Solution use gap-end penalty
- ( gap-open penalty)
48Scoring with Affine Gaps (cont.)
- consensus string
- ATCTGT---CAG
49Scoring with Affine Gaps (cont.)
- Define
- (Aij) K L alignment matrix
- Aij belongs to A, C, G, T, -
- (Bij) K L alignment matrix
- Bij belongs to N, O, G, C
50Scoring with Affine Gaps (cont.)
- Bij
- O (gap-open) the ones opening a gap.
- G (gap-continue) Aij- except gap-open.
- C (gap-close) the ones closing a gap.
- N (nucleotide) Aij?- except gap-close.
51Scoring with Affine Gaps (cont.)
- Let m, d, g, c be the match, mismatch, gap-open,
gap-continue penalties, respectively. - Define the function S(x, y) as follows
- 0, if x- or y-
- m, if xy
- d, otherwise. (i.e. x?y)
52Scoring with Affine Gaps (cont.)
- Let Nj, Oj, Gj, Cj be the number of Ns, Os,
Gs, Cs in the jth column of (Bij),
respectively. - Define the function
- T(i) min(Oj, K-Oj) (gc)
- min(Gj, K-Gj) c
- min(Cj, K-Cj) g
53Scoring with Affine Gaps (cont.)
- The multiple alignment score is then
-
-
54Iterative Refinement With Anchors
- A shortcoming of progressive alignment
- The initial pairwise alignments are fixed, and
early errors cant be corrected later.
55Iterative Refinement With Anchors (cont.)
- Solution Iterative refinement.
- 2 versions
- (1) standard
- (2) limited-area
56Iterative Refinement With Anchors (cont.)
- (1) standard (e.g. CLUSTALW)
- Repeat it for a of iterations, or until
a - local maximum is reached.
- (2) limited-area
- Constraint (stay within radius r).
- Improve the alignment locally, not allow
- large-scale changes.
57Iterative Refinement With Anchors (cont.)
- MLAGAN
- Introduce (2) limited-area version, but of
allowing larger-scale adjustments (by anchors). - Cumulative contribution
- si max 0, (the score of position i) si-1
- When si ?threshold T, reset si 0 and an
anchor is created at position i. - Re-align the removed sequence using LAGAN with
the anchors.
58 Thanks! ?
59Introduction(1/3)
- Multiple sequence alignments provide
identification and characterization of functional
elements. - Similarity of evolutionary distances is detected
by multiple alignment of homologous sequences. - There have been several schemes of multiple
alignments for bacterial and yeast genomes. - Several strategies for pairwise genome alignments
of human and mouse genomes based on local or
local/global technique.
60Introduction(2/3)
- Comparing more than two large and structurally
complex genomes presents several challenges - Obtaining a consistent map between several
genomes - Performing large-scale multiple alignment
- Visualizing
- Interpreting the results
- The multiple alignment method of this paper
expands on the local/global approach of Couronne
and colleagues.
61Introduction(3/3)
- The technique is fully automated and efficient
- It dose not require a prebuilt synteny map
- It is able to align the three mammalian genomes
in lt1 day on 24-node computer cluster. - The Multi-VISTA browser is developed with an
user-friendly visualization approach for
exploring conserved regions among multiple
genomes.
62Overview of Strategy for Multiple alignment(1/3)
General scheme of the method. White boxes show
original and intermediate data green boxes,
mapping/alignment steps and yellow and grey
boxes, resulting data.
The multi-contigs are aligned to human, using
the union of all available BLAT local
alignments from mouse to human and from rat to
human mouse or rat sequences that could not be
aligned to the other rodent are also aligned to
human.
The mouse and rat genomes are aligned by BLAT,
followed by LAGAN global alignment of selected
regions.
Multi-contigs
63Overview of Strategy for Multiple alignment(2/3)
- This strategy allows to predict more accurately
the ortholog of each multi-contig in the human
genome - Only 0.8(2Mbp) of the rat genome and 7 of the
rat contigs were mapped to multiple areas in the
human genome. - Compared with 4.4 of the genome and 32 of the
contigs in the original technique. - The substitution matrices derived specifically
for the human, mouse, and rat genomes is modified
from LAGAN.
64Overview of Strategy for Multiple alignment(3/3)
- It generated 11235 areas of three-way alignments,
74 of which are longer then 200 Kbp in the human
sequence. - Verifing the quality of the alignments
- Determining the percentage of whole genomes and
protein-coding exons covered by high-scoring
subalignments. - Comparing our alignments with a syntenic map
generated independently, based on gene
predictions.
65Method
- Part 1 Progressive Alignment Strategy
- Part 2 Finding Genomic Synteny by Exons
66Progressive Alignment Strategy
- Genomes
- April 2003 Human
- February 2003 Mouse
- June 2003 Rat
- RepeatMasker
- RefSeq
67Work Flow
Global Align Genomes of Mouse and Rat
Aligned Sequences
Unaligned Sequences
Aligned to Human Genome
68Mouse/Rat Alignment
4. Filtering
3.Group and extend.
1. Rat genome is divided into regions roughly
250kb in size.
Mouse Genome
Rat Genome
ltL
5. Use LAGAN to find global alignment.
Mouse Genome
2. Use BLAT to map these region with mouse.
(Local alignment)
Mouse Genome
ltL
ltL
Mouse Genome
69Mouse/Rat Alignment (cont.)
- Filtering
- Group score lt70,000
- lt70 of the maximum
- Looser threshold ? higher sensitivity?
- LAGAN
- Empirically derived gap penalties (-500 for
mouse/rat and -800 for human/rodent) - fastreject option
- Abandon the alignment if the homology looks
weak. Currently tuned for human/mouse distance,
or closer.
!
70Human/Rodent Alignment
Rat/Mouse
Aligned and Unaligned seq. of rat (and mouse) are
aligned to Human genome. Similar procedure as
mouse/rat alignment.
Unaligned sequence
Aligned sequence
71Human/Rodent Alignment (Cont.)
- Use scorealign tool to clip aligned sequences of
mouse and rat. - scorealign
- Hidden Markov Model
- Given a pairwise alignment and conservation
cutoff k, it returns regions that are more
likely to have resulted from a k conservation
model rather than the background (25)
conservation model.
72Finding Genomic Synteny by Exons
- Get Predicted exons from genome
- EST_MAP
- Based on gene prediction
- Fgenesh
- Use Chains of Coding Exons to find Synteny
- Initially build human/mouse and human/rat
pairwise map, and then resolve them into a single
three way map for human.
73Get Exons from Genome (Cont.)
- RefRNAs are mapped onto genomes by the EST_MAP
program. - Ab intio Fgenesh prediction is run on the rest of
genome. - Use BLAST to search protein homologs of genome
region of 2. - Run Fgenesh to production 3. (more accurate)
- Second run of ab intio gene prediction to regions
without prediction in 1 to 4. - Fgenesh gene predictions are run in large introns
of known and predicted genes.
74Get Predicted Exons from Genome
Exon from running Fgenesh on large intron
Exons from second run of ab inio Fgenesh and
Fgenesh
Chromosome
Exons from ab intio Fgenesh and Fgenesh
Exons from mapping RefSeq mRNAs (Real exons!)
75Use Chains of Coding Exons to Find Synteny
- Compile a set of nonredundant, nonoverlapping
exons with at least 10 amino acids - In ascending order along each chromosome.
MVHFTAEEKAAVTSLWSKMNVEEAGGEALGRLLVVYPWTQRFFDSFG
TQRFFD
TAEEKAAVTSLW
AGGEALGRLLV
76Use Chains of Coding Exons to Find Synteny (Cont.)
- Use BLASTP to align each exon to an exon set from
the chromosome of the other organism. - Homologous exon chain 10 exons at least
- Syntenic segment
- Share at least 5 pairs of exons with
bidirectional hits. - Syntenic segments are extended into Syntenic
blocks and used to create synteny map.
70
77Exon-Based Map of Conserved Synteny(1/3)
- Because most gene-prediction programs demonstrate
higher accuracy in predicting exons than in
predicting entire genes. - A three-way synteny map is built based on chains
of Fgenesh predicted exons, rather than whole
genes. - Built human/mouse and human/rat pairwise maps,
then resolved them into a single three-way map
for human, mouse, and rat.
78Exon-Based Map of Conserved Synteny(2/3)
- During the construction of pairwise maps, chains
of exons are defined . - The same order in each of the two genomes is at
least 70 of the exons have homologs in the other
genome. - Pairwise syntenic maps are merged into a
three-way synteny map by selecting a single
genome as a base and merging overlapping parts of
the pairwise maps. - The resulting map has a total of 4497 three-way
synteny segments.
79Exon-Based Map of Conserved Synteny(3/3)
- Among the 4497 segments, the mouse segment is
absent in 191 cases (4.2), and the rat segment
in 315 cases (7). - The total length of three-way synteny segments in
the human genome was 674 Mb, with average segment
length of 150 kb. - These segments are further extended into larger
blocks by merging those that are within 5 Mb of
each other in every genome. - Finding 494 synteny blocks is shared among all
three genomes. - The total length of three-way synteny blocks was
2351 Mb, with an average block length of 4.76Mb.
80Exon-based map of conserved synteny between the
rat, human, and mouse genomes. Each rat
chromosome ( presented along the x-axis) contains
two columns, colored according to conserved
synteny with chromosomes of the human and mouse
genomes. Chromosome color scheme is shown at the
bottom.
81Agreement Between Alignments and the Exon-Based
Map(1/2)
- The multiple alignment of the three genomes and
the predicted exon-based synteny map produce
complementary, independent data sets that can be
used to evaluate the accuracy of both methods. - The alignments generated by the automatic
alignment pipeline with the exon-based synteny
map are compared. - A syntenic block and an alignment were considered
matching if they overlapped.
82Agreement Between Alignments and the Exon-Based
Map(2/2)
gt100 kb in human
- The longer alignments exhibited greater than 97
agreement between the two maps. - But for very short alignments, the correlation is
dropped to 13.
110 kb
83Genome Coverage by Three-Way Alignments(1/2)
- One way to evaluate alignment sensitivity is to
compute the percentage of the base pairs of all
genomes that are reliably aligned. - The scoring techniques developed for comparison
of the human and mouse genomes is used to
computed overall coverage, as well as coverage of
RefSeq exons.
84Genome Coverage by Three-Way Alignments(2/2)
85Multi-VISTA Browser
- To visualize the results of comparative sequence
analysis of multiple genomes in the VISTA format. - It can be accessed at http//pipeline.lbl.gov.
- The Multi-VISTA Browser displays humanmouserat
multiple alignments on the scale of whole
chromosomes, along with annotations. - The user may select any of the three genomes as
the reference and display the level of
conservation between this reference and the
sequences of the other two species in a
particular interval.
86Multi-VISTA Browser
87Discuss
- By comparing the alignment to an independently
generated map of protein synteny between the
genomes, it concludes that 97 of alignments with
a human sequence gt100 kbp, and 87 of all
alignments, agree with the map. - The difference between these numbers can be
explained by the lower accuracy of both alignment
and synteny map generation when dealing with very
short regions of conserved synteny.
88Discuss
- Only 3.4 of the human base pairs in the whole
genome alignment are within such nonmatching
regions. - One drawback of global alignments is their
inability to deal with small rearrangement
events. - A previous study has suggested that as much as 2
of the gene-coding regions of the human genome
may have undergone some local rearrangement
events. - Since the divergence between human and the
rodents, and the local/global approach often is
not able to cope with these events.
89Discuss
- Additional genomes will help to verify the
quality of existing alignments and provide the
biologists with additional comparative
information with which to judge the evolutionary
importance of a region. - Adding several other mammalian genomes will
possibly allow us to locate constraints at the
individual base pair level. - The availability of these genomes would make
possible the use of comparative sequence analysis
in new areas, such as the determination of
individual binding sites.
90??-syteny
- ?????????
- ??????????????????????????????????????????????????
??????????????????,??????????????????????,????????
??????????????????????,??????????????????????(inve
rsion, translocation, duplication)????????????????
????????????????,?????????????????????,???????????
????????????synteny?
http//vschool.scu.edu.tw/biology/content/genetics
/ge082.htm
91??-syteny
- ????,?????????????????????????,????????(syntenic)?
????????????
http//vschool.scu.edu.tw/biology/content/genetics
/ge082.htm
92??- RefSeq
- RefSeq?Reference Sequences???-????????DNA
contigs??????mRNAs??????????mRNAs???
???????????????Accession numbers????
??2??????6???,??NT_123456?NM_123456?
NC_123456?NG_123456?XM_123456?XR_123456(??
http//www.ncbi.nlm.nih.gov/RefSeq/)?
http//www.ascc.sinica.edu.tw/nl/92/1922/02.txt
93??-orthologs
- orthologs/orthologous (????)
- ?????????(??????????)??????????????
http//pastime.cgu.edu.tw/petang/Bioinfomatics/Lec
ture/93-1/93-1_11/CGU_Bioinformatics_20041203.file
s/frame.htmslide0040.htm
94Reference
- http//www.cf.ac.uk/biosi/research/biosoft/Help/To
pics/evolutionaryDistance.html - http//vschool.scu.edu.tw/biology/content/genetics
/ge082.htm - http//pastime.cgu.edu.tw/petang/Bioinfomatics/Lec
ture/93-1/93-1_11/CGU_Bioinformatics_20041203.file
s/frame.htmslide0040.htm - http//www.ascc.sinica.edu.tw/nl/92/1922/02.txt
- http//en.wikipedia.org/wiki/Binding_site
- http//genome.ucsc.edu/cgi-bin/hgTrackUi?hgsid769
33989cchr7gsoftberryGene