Multiple Sequence Analysis - PowerPoint PPT Presentation

1 / 78
About This Presentation
Title:

Multiple Sequence Analysis

Description:

Creates multiple sequence alignments from a group of related sequences by ... HIGhroad selects 'top' alignment path for equally optimal gaps ... – PowerPoint PPT presentation

Number of Views:135
Avg rating:3.0/5.0
Slides: 79
Provided by: ElliotLe6
Category:

less

Transcript and Presenter's Notes

Title: Multiple Sequence Analysis


1
Multiple Sequence Analysis
2
Conserved functional domains
  • Sequences required for common function
  • Conserved among different species

3
Variable domains
  • Sequences under selection
  • Antibody escape mutants

4
Peptide motifs
  • Active sites
  • Binding motifs
  • Protein modification motifs

5
Evolutionary relationships
  • Origin of new strains
  • Epidemiology
  • Disease spread
  • Disease origins
  • Common ancestry

6
Multiple Sequence Alignments
7
The "Best" Alignment
  • Optimal Alignments
  • BestFit and Gap algorithms
  • "approximate" Alignments
  • When the optimal alignment would exceed computer
    capabilities
  • PileUp
  • In either case, the final alignment will always
    be dependent on the chosen variables

8
Program Variables
  • Symbol Comparison Table
  • Gap Weight
  • Gap Length Weight
  • Algorithm
  • Other

9
Determining the Best Alignment
  • Optimize
  • Percent Identity/Similarity
  • Quality
  • Statistical measures
  • Make up a number

10
The Final Analysis
  • Your own eyes
  • Human knowledge
  • Biology
  • Evolution

11
PileUp
  • Creates multiple sequence alignments from a group
    of related sequences by performing pairwise
    alignments among all of the sequences in the group

12
PileUp Initial Comparison
  • Compares each sequence to every other sequence
  • Uses the GAP global alignment algorithm
  • Creates a table of similarities between every
    sequence
  • The table is plotted as a dendogram to a .figure
    file

13
PileUp Alignment
  • Align the two most common sequences to each other
  • Forms cluster number one
  • Align cluster one to the next most similar
    sequence
  • Gaps introduced into cluster one are introduced
    into both sequences
  • Forms cluster two (group of three)

14
Completion of the Alignment
  • Repeat the alignment by gapping each new cluster
    to the next most similar sequence
  • Writes the final alignment to a Multiple Sequence
    Format (MSF) file

15
MSF file
  • Can be read by other programs
  • Individually gapped sequences can be utilized on
    their own or in groups that are subsets of the
    whole alignment

16
Dendogram
  • Original pairwise relationships among all of the
    sequences used to determine cluster alignment
    order
  • The dendogram does not predict phylogenetic
    relationships
  • The final alignment was not used to determine the
    sequence relationships

17
(No Transcript)
18
Alignment Order
  • Alignments begin with the two most similar
    sequences and end with the most distant sequence
  • The final alignment may be influenced by the
    alignment order
  • This order cannot be changed in the present
    version of PileUp

19
Similar Sequences
  • PileUp does not allow differential weighting of
    the input sequences
  • All input sequences are weighted equally
  • Several very similar sequences will contribute
    equally to the alignment
  • Several very similar sequences may bias the final
    alignment

20
Different PileUp Runs
  • Run PileUp with different sets of input sequences
  • Use all members or only one member of a group of
    very similar sequences
  • Run PileUp using previously determined consensus
    sequences for sequence groups

21
Unrelated Sequences
  • PileUp includes all sequences in the final
    alignment
  • An unrelated sequence appears in the alignment
    even if it has no similarity to all of the other
    sequences
  • Unrelated sequences may greatly alter the final
    sequence alignment by the introduction of many
    additional gaps

22
Restrictions
  • 500 sequences
  • 5,000 symbols per sequence
  • 2,000 new gaps per sequence
  • 7,000 final alignment length

23
Restrictions
  • Surface of comparison between any two comparisons
    cannot exceed 2,250,000
  • Product of the sequence lengths
  • If the surface of comparison does exceed the
    limit, the program will attempt an alignment by
    limiting the total number of gaps introduced

24
analyze pileup -check _at_pol.list PileUp creates
a multiple sequence alignment from a group of
related sequences using progressive, pairwise
alignments. It can also plot a tree showing the
clustering relationships used to create the
alignment. Minimal Syntax pileup
-INfile_at_Hsp70.List -Default Prompted
Parameters -GAPweight12 gap creation
penalty -LENgthweight4 gap extension
penalty -DENsity20.0 number of
sequences per 100 pu in the dendrogram -OUTfile1
hsp70.msf output file for multiple sequence
alignment Local Data Files-MATRixblosum62.cmp
scoring matrix for peptides
-MATRixpileupdna.cmp scoring matrix for nucleic
acids
25
Optional Parameters -BEGin1 sets beginning
position for every sequence to be aligned
-END100 sets ending position for every
sequence to be aligned -REVerse uses the
reverse strand for each input sequence -ENDWeight
penalizes end gaps like other gaps -INSitu
realign a portion of an existing
alignment -HIGhroad selects "top" alignment
path for equally optimal gaps -LOWroad
selects "bottom" alignment path for equally
optimal gaps -MAXSeg5000 sets maximum segment
length for every input sequence -MAXGap2000 sets
maximum combined length of all gaps added to a
sequence -NOSORt presents
output sequences in the same order as
input -LINesize50 sets the number of
sequence symbols per line -BLOcksize10 sets
the number of sequence symbols per block -DEGap
removes gap characters ('.' and '') from the
input sequences -NOPLOt
suppresses plot of clustering relationships -NOMON
itor suppresses screen trace of each
alignment -NOSUMmary suppresses screen summary
at the end of the program -BATch submits
program to the batch queue Add what to the
command line ?
26
1 POLG_POL1M 461 aa 2
POLH_POL1M 461 aa ........ 48
POLN_SOUV3 266 aa 49 POLN_FCVF4 175
aa What is the gap creation penalty ( 12 ) ?
5 What is the gap extension penalty ( 4 ) ?
1 This program can display the clustering
relationships graphically. Do you want to
A) Plot to a FIGURE file called "pileup.figure"
B) Plot graphics on COLORWORKSTATION attached
to GCG_Graphics C) Suppress the plot
Please choose one ( A ) c What should I
call the output file name ( pol.msf ) ?
pol-a.msf
27
Determining pairwise similarity scores... 1
x 2 5.24 1 x 3 5.22
47 x 49 3.43 48 x 49
1.42 Aligning... 1 ........-. 2
........-. ........-. 47
.............-. 48 .....................-...
Total sequences 49
Alignment length 495 CPU
time 0207.25 Output
file/export/home/lefkowit/temp/pol-a.msf
28
(No Transcript)
29
(No Transcript)
30
(No Transcript)
31
(No Transcript)
32
Specifying MSF Sequences
  • Picorna.MSF
  • All sequences
  • Picorna.MSFPolg_pol
  • All polio sequences
  • Picorna.MSFPolg_Pol1m
  • Only the pol1 Mahoney strain

33
The SeqLab Editor
  • Alignment Refinement
  • Color-coded symbol groups

34
Pretty
  • Display multiple sequence alignments
  • Does not create the alignment
  • Calculates a consensus sequence
  • Allows control over the output

35
Pretty Output
  • Show all symbols (default)
  • -Consensus
  • show all symbols and a consensus sequence
  • -Case
  • show symbols agreeing with the consensus in upper
    case

36
Pretty Output
  • -Differences
  • Only show symbols differing from the consensus
  • -Identity
  • Only show consensus symbols which are identical
    in all of the aligned sequences

37
Pretty Sequence file Output
  • -Ugly
  • Write individual sequences into separate files
  • Includes the consensus sequence

38
Consensus Calculation
  • Find the symbol at a particular position which
    has the greatest number of "votes"
  • A vote is determined by the sum of the symbol
    comparison values between that symbol and all
    other symbols at that position

39
Consensus Calculation
  • If the total vote for the highest scoring symbol
    is greater than the threshold value, that symbol
    appears in the consensus
  • If no vote is higher than the threshold, no
    symbol appears at that position in the consensus

40
Vote Weight
  • All sequences participate equally in the
    consensus calculation by having a vote weight of
    1.0
  • The vote weight can be changed for any individual
    sequence
  • The symbols in that sequence are then weighted
    when the consensus is calculated according to
    their vote weight
  • This allows the votes of very similar sequences
    not to overly influence the consensus calculation

41
Specifying Vote Weights
  • In an MSF file change the number under the weight
    column
  • In a file of sequence names, add a number
    following the sequence name for a vote weight
    other than one

42
analyze pretty -check pol-a.msf Pretty
displays multiple sequence alignments and
calculates a consensus sequence. It does not
create the alignment it simply displays it.
Minimal Syntax pretty -INfile_at_Pretty.List
-Default Prompted Parameters -BEGin1 -END349
range of interest -OUTfilepretty.pretty
output file Local Data Files
-MATRixprettydna.cmp consensus scoring matrix
for nucleotides -MATRixblosum62.cmp consensus
scoring matrix for peptides
43
Optional Parameters -CONsensus
generates (displays) a consensus
sequence -IDEntity only shows positions
of unanimous agreement in the
consensus -DIFferences"-" only shows
positions disagreeing with the calculated -CASe
shows positions agreeing with the
calculated consensus in
upper case -THReshold1 sets minimum
comparison value for symbol to vote in
consensus -PLUrality2.0
defines the minimum number of votes for a
consensus to
exist -LINesize50 sets the number of
residues per line -WEIGHT1.0 sets the
weight for all input sequences -BLOcksize10
sets the number of residues per block -UGLy
writes the individual sequences into new
files
44
Add what to the command line ?
pol-a.msfPOLG_HPAV8 len 495 wgt 1.00
pol-a.msfPOLG_HPAV4 len 495 wgt 1.00
pol-a.msfPOLN_SMSV1 len 495 wgt
1.00 pol-a.msfPOLN_SOUV3 len 495
wgt 1.00 Begin ( 1 ) ?
End ( 495 ) ? What should I
call the output file ( pretty.pretty ) ?
pol-a.pretty
45
pretty
46
pretty -case
47
pretty -Dif
48
pretty -Ide
49
Picorna.MSF - Weighted
50
Picorna.MSF - Weighted
51
Pretty Consensus with Altered Vote Weights
52
Pretty Consensus with Altered Vote Weights
53
(No Transcript)
54
PrettyBox
  • Shaded representation of a multiple sequence
    alignment
  • Creates a postscript file
  • Create a pdf file using Adobe Distiller
  • Edit using Adobe Illustrator

55
analyze prettybox -check dnaa.msf PrettyBox
displays multiple sequence alignments in
PostScript format, using shading to represent
regions that agree with a calculated consensus
sequence. The program does not create the
alignment it simply displays it. Minimal
Syntax prettybox -INfile_at_pretty.list
-Default Prompted Parameters -BEGin1 -END349
sets the range of interest -ORIentationl
specifies the direction for printing as
Landscape (L) or Portrait
(P) -NUMberingr sets printing of
sequence numbering to
Right side (R), Top (T), or None -CONsensus
generates a consensus sequence -OUTfilepret
tybox.ps writes to PostScript output file Local
Data Files -MATRixprettyboxdna.cmp assigns
the scoring matrix for nucleotides -MATRixblosum6
2.cmp assigns the scoring matrix for
proteins -MARkpretty.mrk defines
regions to be shaded
56
Optional Parameters -PAIrx,2,1 sets
thresholds for identical (x), very similar, and
weekly similar
comparisons to the consensus,
respectively. Protein defaults are x, 2,
1. Nucleic acid
defaults are 1, 1, 1. -THReshold1 sets
minimum comparison value for symbol to vote in
the consensus -PLUrality2.0 defines the
minimum number of votes for a consensus to
exist -IDEntity restricts shading and
consensus determination to
positions of unanimous agreement -CASe
shows positions agreeing with the calculated
consensu in uppercase
-SIMPlifysimplify.txt simplifies sequences
works like the Simplify program. -SIMIlara
considers similarity in generating a
consensus. If 'O' is
used, then only identical matches are
considered. -NOOFFset
prevents printing the consensus line offset from
the other
sequences -NOHEAder suppresses
printing a header -SEQNamep sets
sequences names to be Partial (P), Full (F),
or None (N)
57
-ASKstart asks about the starting
numbers for each sequence -WIDth50
sets the number of residues per
line -BLOcksize10 sets the number of
residues per block -SPAcing1 sets
the number of spaces between blocks -BLAnklines2
sets the number of blank lines between
each group of sequence
lines -FONtsize10 sets the font size
in terms of PostScript numbers -XMArgin20
sets the left and right margins in PostScript
units -YMArgin20 set the top and
bottom margins in PostScript units -FAT
uses fat (bold) lettering -COLorb,L,P,W
sets the colors (shading intensities) to
use for identical,
similar, somewhat-similar, and
non-similar comparisons to the consensus,
respectively. The
available colors, by decreasing
order of intensity, are Black (B), Dark
(D), Light (L), Pale
(P), and White (W). -DENsityf sets
the density of printing to be either Rough (R)
or Fine (F). Rough may
photocopy better. Density
only works with the colors Dark, Light, and
Pale. Add what to the command line ?
58
dnaa.msfDNAA_CAUCR, len 749
dnaa.msfDNAA_RHIME, len 749
dnaa.msfDNAA_MYCCA, len 749
dnaa.msfDNAA_MYCMY, len 749
dnaa.msfDNAA_SPIAP, len 749
dnaa.msfDNAA_SPICI, len 749
dnaa.msfDNAA_BORBU, len 749
dnaa.msfDNAA_TREPA, len 749
dnaa.msfDNAA_RICPR, len 749
dnaa.msfDNAA_WOLSP, len 749
dnaa.msfDNAA_BUCAI, len 749
dnaa.msfDNAA_BUCAP, len 749
dnaa.msfDNAA_ECOLI, len 749
dnaa.msfDNAA_SALTY, len 749
dnaa.msfDNAA_SERMA, len 749
dnaa.msfDNAA_PROMI, len 749
dnaa.msfDNAA_VIBHA, len 749
dnaa.msfDNAA_PSEPU, len 749
dnaa.msfDNAA_HAEIN, len 749
dnaa.msfDNAA_MYCBO, len 749
dnaa.msfDNAA_MYCPA, len 749
dnaa.msfDNAA_MYCAV, len 749
dnaa.msfDNAA_MYCTU, len 749
dnaa.msfDNAA_MYCLE, len 749
dnaa.msfDNAA_MYCSM, len 749
dnaa.msfDNAA_STRCO, len 749
dnaa.msfDNAA_STRRE, len 749
dnaa.msfDNAA_STRCH, len 749
dnaa.msfDNAA_MICLU, len 749
dnaa.msfDNAA_BACSU, len 749
dnaa.msfDNAA_STAAU, len 749
dnaa.msfDNAA_PROMA, len 749
dnaa.msfDNAA_SYNY3, len 749
dnaa.msfDNAA_STRPN, len 749
dnaa.msfDNAA_THEMA, len 749
dnaa.msfDNAA_MYCGE, len 749
dnaa.msfDNAA_MYCPN, len 749
dnaa.msfDNAA_UREPA, len 749 Begin ( 1 )
? End ( 749 ) ?
59
Print in which orientation l)andscape
p)ortrait Please select ( L ) Display a
consensus ( No ) ? Find consensus to what
minimum plurality ( 2.00 ) ? Where should
numbers be placed r)ight side t)op
n)one Please select ( R ) What should I
call the output PostScript file ( prettybox.ps
) ? dnaa.ps analyze
60
PrettyBox Output
  • pdf file

61
PlotSimilarity
  • Plots the similarity among sequences in a
    multiple sequence alignment

62
Similarity Statistic
  • The similarity statistic is the average of all
    symbol comparison scores when all symbols at any
    one position are compared with each other
  • The similarity statistic is averaged over a
    window size of 10 (default) and plotted along the
    length of the sequence

63
PlotSimilarity -IDEntity
  • A measure of symbol identity along the sequence
  • Instead of using a symbol comparison table for
    the calculation, all matches receive a value of
    1, and mismatches a value of 0

64
analyze plotsimilarity -check pol-a.msf PlotS
imilarity plots the running average of the
similarity among the sequences in a multiple
sequence alignment. Minimal Syntax
plotsimilarity -INfile1hsp70.msf
-Default Prompted Parameters -WINdow10
comparison window size -DENsity624.3
the number of bases per 100 platen
units Prompted Parameters for comparing 2
sequences only -INfile2ggamma.gap second
input sequence -BEGin11 -END11700 the range of
interest for sequence 1 -BEGin21 -END21700 the
range of interest for sequence 2 -REVerse1
-REVerse2 strand of each sequence Local Data
Files -MATRixblosum62.cmp scoring matrix for
peptides -MATRixplotsimdna.cmp
scoring matrix for nucleic acids
65
Optional Parameters -BEGin11 -END1718
the range of interest in the alignment
-OUTfileHsp70.plotsim writes the similarity
values to a file -WEIGHT1 sets
the weight for all input sequences -IDEntity
plots the level of identity among the
sequences -BARgraph plots a bar
graph (rather than a
continuous curve) -PROFile plots
positional conservation in a profile -MINScale0
sets the bottom of the similarity
score scale -MAXScale2 sets the top
of the similarity score scale -EXPand
scales plot between observed min and max
similarity
scores -NOAVErage suppresses the
plot of overall similarity -NOPLOt
suppresses the plot -CMASKfilename
creates a SeqLab colormask file with grayscale
values for levels of
similarity Add what to the command line ?
66
Process set to plot with COLORWORKSTATION
attached to GCG_Graphics using the xwindows
graphic interface. pol-a.msfPOLG_HPAV8
pol-a.msfPOLG_HPAV4
pol-a.msfPOLN_SMSV1 pol-a.msfPOLN_SOUV3
What window to average ( 10 ) ? The
minimum density for this plot is 430.4
residues/100 platen units. What density do you
want ( 430.4 ) ? xwindows instructions for
a COLORWORKSTATION are now being sent to
GCG_Graphic. Press ltReturngt
67
(No Transcript)
68
(No Transcript)
69
(No Transcript)
70
(No Transcript)
71
MACAW
  • Multiple Alignment Construction Analysis
    Workbench
  • Locate, analyze, edit, and combine blocks of
    aligned sequence segments
  • Gregory D. Schuler, Stephen F. Altschul, and
    David J. Lipman
  • FTP to ncbi.nlm.nih.gov

72
MACAW Blocks
  • Blocks
  • Ungapped regions of similarity between two or
    more sequences
  • Identifies the best local regions of similarity
    between two or more sequences
  • BestFit-like search
  • Will identify multiple blocks of similarity
    between two or more sequences

73
MACAW Sensitivity
  • Multiple sequence patterns are located in more
    than 2 sequences at a time
  • The significance and sensitivity of a match is
    greater when a similar pattern is located in more
    than two sequences

74
Detection and Alignment of Blocks
  • Multiple algorithms available
  • The statistical significance of blocks of
    similarity is evaluated
  • Candidate blocks may be visually evaluated for
    potential inclusion in a multiple alignment
  • Each block can be edited by moving its boundaries
    or by eliminating particular segments
  • Blocks may be linked to form a composite multiple
    alignment

75
MACAW Scoring
  • SP Sum of Pairs
  • The sum of all pairwise amino acid scores in a
    block
  • MP Mean Pairwise Score
  • SP/ of pairs

76
MACAW Output
  • Export alignment to a text file
  • Includes aligned and unaligned regions
  • Not in any format that GCG or ReadSeq can read
  • Must "recreate" the alignment by hand in SeqLab

77
MACAW Demo
78
Next Up - Profile Analysis
Write a Comment
User Comments (0)
About PowerShow.com