Title: Design and creation of multiple sequence alignments Unit 15
1Design and creation of multiple sequence
alignmentsUnit 15
- BIOL221T Advanced Bioinformatics for
Biotechnology
Irene Gabashvili, PhD
2IPA 6.0 license
- Need a list of e-mails to create accounts
- Will have a 6 weeks license (instead of 2 weeks)
- Problem Set 3 is Pathway Analysis, Lab of March
19 will be on using IPA too
3Problem Set 2 Review
- Sensitivity and Specificity
- Parameters for Multiple Alignment (Databases,
Search Terms, Scores) - Transfac
- Dotplots
4Gene prediction flowchart
Fig 5.15 Baxevanis Ouellette 2005
5Evaluation of Splice Site Prediction
What do measures really mean?
Fig 5.11 Baxevanis Ouellette 2005
Note typo in BO
6ROC curves (plots of (1-Sn) vs Sp)
- A receiver operating characteristic (ROC), or
simply ROC curve, is a graphical plot of the
sensitivity vs. (1 - specificity) for a binary
classifier system as its discrimination threshold
is varied. - The sensitivity and specificity of a diagnostic
test depends on more than just the "quality" of
the test--they also depend on the definition of
what constitutes an abnormal test.
7Evaluation of Splice Site Prediction
8Careful different definitions for "Specificity"
Brendel definitions
cf. Guig?ó definitions Sn Sensitivity
TP/(TPFN) Sp Specificity TN/(TNFP) Sp- AC
Approximate Coefficient 0.5 x ((TP/(TPFN))
(TP/(TPFP)) (TN/(TNFP)) (TN/(TNFN))) - 1
Other measures? Predictive Values, Correlation
Coefficient
9Best measures for comparing different methods?
- ROC curves (Receiver Operating
Characteristic?!!) - http//www.anaesthetist.com/mnm/stats/roc/
- "The Magnificent ROC" - has fun applets
quotes - "There is no statistical test, however intuitive
and simple, which will not be abused by medical
researchers" - Correlation Coefficient
- (Matthews correlation coefficient (MCC)
- MCC 1 for a perfect prediction
- 0 for a completely random assignment
- -1 for a "perfectly incorrect" prediction
Just FYI
10 PromotersWhat signals are there? Simple
ones in prokaryotes
11Prokaryotic promoters
- RNA polymerase complex recognizes promoter
sequences located very close to on 5 side
(upstream) of initiation site - RNA polymerase complex binds directly to these.
with no requirement for transcription factors - Prokaryotic promoter sequences are highly
conserved - -10 region
- -35 region
12Simpler view of complex promoters in eukaryotes
Fig 5.12 Baxevanis Ouellette 2005
13Eukaryotic genes are transcribed by 3 different
RNA polymerases
Recognize different types of promoters
enhancers
14Eukaryotic promoters enhancers
- Promoters located relatively close to
initiation site - (but can be located within gene,
rather than upstream!) - Enhancers also required for regulated
transcription - (these control expression in specific cell
types, developmental stages, in response to
environment) - RNA polymerase complexes do not specifically
recognize promoter sequences directly - Transcription factors bind first and serve as
landmarks for recognition by RNA polymerase
complexes
15Eukaryotic transcription factors
- Transcription factors (TFs) are DNA binding
proteins that also interact with RNA polymerase
complex to activate or repress transcription - TFs contain characteristic DNA binding motifs
- http//www.ncbi.nlm.nih.gov/books/bv.fcgi?r
idgenomes.table.7039 - TFs recognize specific short DNA sequence motifs
transcription factor binding sites - Several databases for these, e.g. TRANSFAC
- http//www.generegulation.com/cgibin/pub/data
bases/transfac
16Zinc finger-containing transcription factors
- Common in eukaryotic proteins
- Estimated 1 of mammalian genes encode
zinc-finger proteins - In C. elegans, there are 500!
- Can be used as highly specific DNA binding
modules
- Potentially valuable tools for directed genome
modification (esp. in plants) human gene
therapy
17Promoter prediction Eukaryotes vs prokaryotes
Promoter prediction is easier in microbial
genomes Why? Highly conserved Simpler
gene structures More sequenced genomes!
(for comparative approaches) Methods?
Previously mostly HMM-based Now
similarity-based. comparative methods because
so many genomes available
18Predicting promoters Steps Strategies
- Closely related to gene prediction!
- Obtain genomic sequence
- Use sequence-similarity based comparison
- (BLAST, MSA) to find related genes
- But "regulatory" regions are much less
well-conserved than coding regions - Locate ORFs
- Identify TSS (if possible!)
- Use promoter prediction programs
- Analyze motifs, etc. in sequence (TRANSFAC)
19Predicting promoters Steps Strategies
- Identify TSS --if possible?
- One of biggest problems is determining exact
TSS! - Not very many full-length cDNAs!
- Good starting point? (human vertebrate genes)
- Use FirstEF
- found within UCSC Genome Browser
- or submit to FirstEF web server
-
Fig 5.10 Baxevanis Ouellette 2005
20Automated promoter prediction strategies
- Pattern-driven algorithms
- Sequence-driven algorithms
- Combined "evidence-based"
- BEST RESULTS? Combined, sequential
21Promoter Prediction Pattern-driven algorithms
- Success depends on availability of collections of
annotated binding sites (TRANSFAC PROMO) - Tend to produce huge numbers of FPs
- Why?
- Binding sites (BS) for specific TFs often
variable - Binding sites are short (typically 5-15 bp)
- Interactions between TFs ( other proteins)
influence affinity specificity of TF binding - One binding site often recognized by multiple BFs
- Biology is complex promoters often specific to
organism/cell/stage/environmental condition
22Promoter Prediction Pattern-driven algorithms
- Solutions to problem of too many FP predictions?
- Take sequence context/biology into account
- Eukaryotes clusters of TFBSs are common
- Prokaryotes knowledge of ? factors helps
- Probability of "real" binding site increases if
annotated transcription start site (TSS) nearby - But What about enhancers? (no TSS nearby!)
- Only a small fraction of TSSs have been
experimentally mapped - Do the wet lab experiments!
- But Promoter-bashing is tedious
23Promoter Prediction Sequence-driven algorithms
- Assumption common functionality can be deduced
from sequence conservation - Alignments of co-regulated genes should highlight
elements involved in regulation - Careful How determine co-regulation?
- Orthologous genes from difference species
- Genes experimentally determined to be
- co-regulated (using microarrays??)
- Comparative promoter prediction
- "Phylogenetic footprinting" - more later.
24Promoter Prediction Sequence-driven algorithms
- Problems
- Need sets of co-regulated genes
- For comparative (phylogenetic) methods
- Must choose appropriate species
- Different genomes evolve at different rates
- Classical alignment methods have trouble with
- translocations, inversions in order of
functional elements - If background conservation of entire region is
highly conserved, comparison is useless - Not enough data (Prokaryotes gtgtgt Eukaryotes)
- Biology is complex many (most?) regulatory
elements are not conserved across species!
25Examples of promoter prediction/characterization
software
Lab used MATCH, MatInspector TRANSFAC MEME
MAST BLAST, etc. Others? FIRST EF Dragon
Promoter Finder also see Dragon Genome
Explorer (has specialized promoter software for
GC-rich DNA, finding CpG islands, etc) JASPAR
26TRANSFAC matrix entry for TATA box
- Fields
- Accession ID
- Brief description
- TFs associated with this entry
- Weight matrix
- Number of sites used to build (How many here?)
- Other info
Fig 5.13 Baxevanis Ouellette 2005
27Global alignment of human mouse obese gene
promoters (200 bp upstream from TSS)
Fig 5.14 Baxevanis Ouellette 2005
28GenBank IDs and Accessions
- http//www.ncbi.nlm.nih.gov/RefSeq/key.htmlaccess
ions (Accession Formats RefSeq) - http//www.ncbi.nlm.nih.gov/Sitemap/samplerecord.h
tml (GenBank Sample Record)
29Why we do multiple alignments?
- Help prediction of the secondary and tertiary
structures of new sequences - Preliminary step in molecular evolution analysis
using Phylogenetic methods for constructing
phylogenetic trees.
30An example of Multiple Alignment
VTISCTGSSSNIGAG-NHVKWYQQLPG VTISCTGTSSNIGS--ITVNWY
QQLPG LRLSCSSSGFIFSS--YAMYWVRQAPG LSLTCTVSGTSFDD--
YYSTWVRQPPG PEVTCVVVDVSHEDPQVKFNWYVDG-- ATLVCLISDF
YPGA--VTVAWKADS-- AALGCLVKDYFPEP--VTVSWNSG--- VSLT
CLVKGFYPSD--IAVEWWSNG--
31Visualization example
32Other multiple alignment programs
ClustalW / ClustalX pileup multalign multal saga h
mmt
DIALIGN SBpima MLpima T-Coffee ...
33Other multiple alignment programs
ClustalW / ClustalX pileup multalign multal saga h
mmt
DIALIGN SBpima MLpima T-Coffee ...
34ClustalW- for multiple alignment
- ClustalW can create multiple alignments,
manipulate existing alignments, do profile
analysis and create phylogentic trees. - Alignment can be done by 2 methods
- - slow/accurate
- - fast/approximate
35Running ClustalW
clustalw
CLUSTAL
W (1.7) Multiple Sequence Alignments
1. Sequence Input From Disc
2. Multiple Alignments 3. Profile /
Structure Alignments 4. Phylogenetic trees
S. Execute a system command H. HELP
X. EXIT (leave program) Your choice
36Running ClustalW
The input file for clustalW is a file containing
all sequences in one of the following
formats NBRF/PIR, EMBL/SwissProt, Pearson
(Fasta), GDE, Clustal, GCG/MSF, RSF.
37Using ClustalW
MULTIPLE ALIGNMENT MENU 1. Do
complete multiple alignment now (Slow/Accurate)
2. Produce guide tree file only 3. Do
alignment using old guide tree file 4.
Toggle Slow/Fast pairwise alignments SLOW
5. Pairwise alignment parameters 6.
Multiple alignment parameters 7. Reset gaps
between alignments? OFF 8. Toggle screen
display ON 9. Output format
options S. Execute a system command H.
HELP or press RETURN to go back to main
menu Your choice
38Output of ClustalW
CLUSTAL W (1.7) multiple sequence
alignment HSTNFR GGGAAGAG---TTCCCCAGGGACCTCTC
TCTAATCAGCCCTCTGGCCCAG------GCAG SYNTNFTRP
GGGAAGAG---TTCCCCAGGGACCTCTCTCTAATCAGCCCTCTGGCCCAG
------GCAG CFTNFA -----------------------------
--------------TGTCCAG------ACAG CATTNFAA
GGGAAGAG---CTCCCACATGGCCTGCAACTAATCAACCCTCTGCCCCAG
------ACAC RABTNFM AGGAGGAAGAGTCCCCAAACAACCTCCAT
CTAGTCAACCCTGTGGCCCAGATGGTCACCC RNTNFAA
AGGAGGAGAAGTTCCCAAATGGGCTCCCTCTCATCAGTTCCATGGCCCAG
ACCCTCACAC OATNFA1 GGGAAGAGCAGTCCCCAGCTGGCCCCTCC
TTCAACAGGCCTCTGGTTCAG------ACAC OATNFAR
GGGAAGAGCAGTCCCCAGCTGGCCCCTCCTTCAACAGGCCTCTGGTTCAG
------ACAC BSPTNFA GGGAAGAGCAGTCCCCAGGTGGCCCCTCC
ATCAACAGCCCTCTGGTTCAA------ACAC CEU14683
GGGAAGAGCAATCCCCAACTGGCCTCTCCATCAACAGCCCTCTGGTTCAG
------ACCC
39ClustalW options
Your choice 5 PAIRWISE ALIGNMENT
PARAMETERS Slow/Accurate
alignments 1. Gap Open Penalty
15.00 2. Gap Extension Penalty 6.66
3. Protein weight matrix BLOSUM30 4. DNA
weight matrix IUB Fast/Approximate
alignments 5. Gap penalty 5
6. K-tuple (word) size 2 7. No. of top
diagonals 4 8. Window size
4 9. Toggle Slow/Fast pairwise alignments
SLOW H. HELP Enter number (or RETURN to
exit)
40ClustalW options
Your choice 6 MULTIPLE ALIGNMENT
PARAMETERS 1. Gap Opening
Penalty 15.00 2. Gap Extension
Penalty 6.66 3. Delay divergent
sequences 40 4. DNA Transitions
Weight 0.50 5. Protein weight
matrix BLOSUM series 6. DNA
weight matrix IUB 7. Use
negative matrix OFF 8.
Protein Gap Parameters H. HELP Enter
number (or RETURN to exit)
41Blocks database and tools
- Blocks are multiply aligned ungapped segments
corresponding to the most highly conserved
regions of proteins. - The Blocks web server tools are Block
Searcher, Get Blocks and Block Maker. These are
aids to detection and verification of protein
sequence homology. - They compare a protein or DNA sequence to a
database of protein blocks, retrieve blocks, and
create new blocks,respectively.
42The BLOCKS web server
- At URL http//blocks.fhcrc.org/
- The BLOCKS WWW server can be used to create
blocks of a group of sequences, or to compare a
protein sequence to a database of blocks. - The Blocks Searcher tool should be used for
multiple alignment of distantly related protein
sequences.
43The Blocks Searcher tool
- For searching a database of blocks, the first
position of the sequence is aligned with the
first position of the first block, and a score
for that amino acid is obtained from the profile
column corresponding to that position. Scores are
summed over the width of the alignment, and then
the block is aligned with the next position. - This procedure is carried out exhaustively for
all positions of the sequence for all blocks in
the database, and the best alignments between a
sequence and entries in the BLOCKS database are
noted. If a particular block scores highly, it is
possible that the sequence is related to the
group of sequences the block represents.
44The Blocks Searcher tool
- Typically, a group of proteins has more than one
region in common and their relationship is
represented as a series of blocks separated by
unaligned regions. If a second block for a group
also scores highly in the search, the evidence
that the sequence is related to the group is
strengthened, and is further strengthened if a
third block also scores it highly, and so on.
45The BLOCKS Database
- The blocks for the BLOCKS database are made
automatically by looking for the most highly
conserved regions in groups of proteins
represented in the PROSITE database. These blocks
are then calibrated against the SWISS-PROT
database to obtain a measure of the chance
distribution of matches. It is these calibrated
blocks that make up the BLOCKS database.
46The Block Maker Tool
- Block Maker finds conserved blocks in a group of
two or more unaligned protein sequences, which
are assumed to be related, using two different
algorithms. - Input file must contain at least 2 sequences.
- Input sequences must be in FastA format.
- Results are returned by e-mail.
47Progressive Approaches
- CLUSTALW
- Perform pairwise alignments
- Construct a tree, joining most similar sequences
first (guide tree) - Align sequences sequentially, using the
phylogenetic tree - PILEUP
- Similar to CLUSTALW
- Uses UPGMA to produce tree (chapter 6)
48Clustal method
- Higgins and Sharp 1988
- ref CLUSTAL a package for performing multiple
sequence alignment on a microcomputer. Gene, 73,
237244. Medline - Progressive alignment method
- An approximation strategy (heuristic algorithm)
yields a possible alignment, but not necessarily
the best one
49First step
A B C D
Compute the pairwise alignments for all against
all (6 pairwise alignments) the similarities are
stored in a table
D C B A
A
11 B
1 3 C
10 2 2 D
50Second step
D C B A
A
11 B
1 3 C
10 2 2 D
- cluster the sequences to create a tree (guide
tree) - Represents the order in which pairs of sequences
are to be aligned - Highly similar sequences are neighbors in the
tree - Highly distant sequences are distant from each
other in the tree
51Third step
Align most similar pairs
Align the alignments as if each of them was a
single sequence (with the use of a consensus
sequence or a profile)
52Clustal programs
- ClustalV
- ClustalW
- Thompson et al., 1994
- Uses sequence weighting, positions-specific gap
penalties and weight matrix choice - W stands for weight sequences
- clustalX - windows implementation
53 ClustalW method rules (1) sequence weighting
- Each sequence is weighted according to how
different it is from the other sequences. - For the case where one specific subfamily is
overrepresented in the data
54ClustalW method rules (2) weight matrix choice
- The substitution matrix used for each alignment
step depends on the similarity of the sequences.
55ClustalW method rules (3) positions-specific gap
penalties
- Gaps found in initial alignments remain fixed
through the process (ends gap) - Hydrophobic residues have higher gap penalties
than hydrophilic - they are more likely to be in the hydrophobic
core, where gaps should not occur.
56ClustalW method shortcomings
- (1) Sequences that are similar only in sub-
regions - ClustalW forces a global alignments, not local.
- (2) A sequence that contains a large
insertion/deletion compared to the rest will
extremely affect the alignment - (again global not local).
-
57ClustalW method shortcomings
- (3) A sequence that contains a repetitive
- element (such as a domain), whereas all
other sequences only contain one copy.
58Comments
- Pairwise alignment is an optimal algorithm
- Multiple alignment is not an optimal algorithm
only a heuristic. Better alignments may exist! - The algorithm yields a possible alignment, but
not necessarily the best one.
59ClustalW in the web server
- Global multiple sequence alignment program for
DNA or proteins - Available from a number of sites
- EMBL-EBI
60Results
61Results
62Alignment with colors
identity
similarty
63CLUSTAL format
- CLUSTAL W(1.82) multiple sequence alignment
- YPK1 SQLSWKRLLMKGYIPPYKPAVSD-Q--NSMDTSN
FDEEFTR--SEKPIDSVVDEYLSESV - YPK2 KDISWKKLLLKGYIPPYKPIVKDTQ--SEIDTAN
FDQEFTK---EKPIDSVVDEYLSASI - KPCA_HUMAN RRIDWEKLENREIQPPFKPKVC------GKGAEN
FDKFFTR---GQPVLTPPDQLVIANI - KPCZ_HUMAN RSIDWDLLEKKQALPPFQPQIT---M-DDYGLDN
FDTQFTS---EPVQLTPDDEDAIKRI - KAPA KEVVWEKLLSRNIETPYEPPIQ----QGQGDTSQ
FDKYPE----EDINYGVQGEDPYADL - KAPC NEVIWEKLLARYIETPYEPPIQ----QGQGDTSQ
FDRYPE-EVDEEFNYGIQGEDPYMDL - KAPB SEVVWERLLAKDIETPYEPPIT----SGIGDTSL
FDQYPE-DV-EQLDYGIQGDDPYAEY - KS6_HUMAN RHINWEELLARKVEPPFKPLLQ-----SEEDVSQ
FDSKFTR-V-QTPVDSP-DDSTLSES - .
- YPK1 -----MQKQF
- YPK2 ----N-QKQF
- KPCA_HUMAN D--O--QSDF
- KPCZ_HUMAN D-----QSEF
- KAPA -D----FRDF
64ClustalW at EMBL - Jalview
Jalview is a multiple alignment editor
conservation
65Jalview
- color menu
- Taylor colors (each amino acid is colored
differently) - Zappo colors (amino acids are colored according
to their physico-chemical properties) - Hydrophobicity colors (colors amino aids
according to a certain score scale that
represents hydrophobicity) - Coloring residues above a percentage identity
threshold - User defined color schemes
66Example - Zappo colors
- physico-chemical properties color-code
67Guide Tree
68ClustalX
- ClustalX provides a window-based user interface
to the ClustalW program. -
- It uses the developed by the NCBI as part of
their NCBI SOFTWARE DEVELOPEMENT TOOLKIT.
69T-coffee
- Another MSA program
- Protein nucleotide MSA program
- Uses principles similar to ClustalW
- More accurate but longer running times
- Limits the number of sequences it can align
(100) - T-coffee at EMBnet
70(No Transcript)
71T-coffee results
72Phylip format
- 5 99
- Cabd_199509 PQITLWQRPLVTIKIGGQLKEALLDTGADDTVLEEMN
LPGKWKPKIIGGI - JCSB1_199401 PQITLWQRPIVTIKIGGQLKEALLDTGAD---LEEM-
NPGRWKPKIIGGI - JCSB2_199401 PQITLWQRPVT-IK-GG-QLEALLDTGADDTL-EEI-
LPGRW-PKMIGGI - JCSB4_199401 PQITLWQRPVT--K-GG-LKEALLDTGADDTE-----
DPGRWKPKMIGGI - JCSB5_199401 PQITLWQRPIVTIKVGGQLKEALLDTGADDTVL-EMN
LPGRWKPKMIGGI - GGFIKVRQYDQVPIEICGHKAIGTVLVGPTPSNIIGR
NLLTQLGCTLNF - GGFVKVRQYDQIPIDICGHKVIGTVL-GPTPANVIGR
NLLTQIGCTLNF - GGFVKVR-YDQVPIEICGH--IGTVLVGPTPANIIGR
NLMTQLGCTLNF - GGFLKVRQYDQIPVEICGHKAIGTVL-GPTPANIIGR
NLLTQIG-TLNF - GGFVKVRQYDQIPIEICGHKAIGTVLVGPTPANIVGR
NLLTQIGCTLNF
73The Biology WorkBench
- http//workbench.sdsc.edu/
- http//www.ngbw.org/
- Nucleic Acid Sequence Tools, including BLAST,
CLUSTALW, MFOLD, PRIMER3
74Muscle
- Protein nucleotide MSA program
- Improvements in both accuracy and speed
- exploiting a range of existing and new
algorithmic techniques - combination of progressive and iterative
alignment strategies - details of the method
- web server
- downloads Windows, Linux, Mac
75Muscle web server
76Editing MSA
- There are a variety of tools that can be used to
modify a multiple alignment (SeaView, BioEdit,
JalView) - These programs can be very useful in formatting
and annotating an alignment for publication. - An editor can also be used to make modifications
by hand to improve biologically significant
regions in a multiple alignment created by one of
the automated alignment programs.
77MSA approaches
- Progressive approach CLUSTALW (CLUSTALX),
PileUp, - T-COFFEE, MAFFT, MUSCLE
- Iterative approach Repeatedly realign subsets
of sequences. MultAlin, DiAlig, MAFFT,
MUSCLE,ProbCons - Genetic algorithm
- SAGA
- Graph algorithm
- POA
78Conclusion
- There is no single method that always generates
the best alignment - It may thus be wise to use more than one method
- Alignment editors can be used to correct the
alignments