Title: Calculations using GCG programs
1Calculations using GCG programs Sequence
Alignments
Lecture 9 BINF 5230 Spring 2005
2Sequences A ACGCTG, B CATGT
3(No Transcript)
4(No Transcript)
5(No Transcript)
6(No Transcript)
7(No Transcript)
8(No Transcript)
9(No Transcript)
10(No Transcript)
11(No Transcript)
12(No Transcript)
13(No Transcript)
14(No Transcript)
15(No Transcript)
16The GCG dynamic programming alignment methods
(GAP and BESTFIT) may find more than one
alignment having the same or almost the same
score.
17Dynamic programming can provide
Global and Local sequence
alignment Global alignments of two proteins by
the Needleman-Wunsch algorithm. Local sequence
alignment by the Smith-Waterman algorithm.
18a locally optimal alignment. The Smith-Waterman
algorithm determines the longest/best
sub-sequence pair to maximize similarity between
the two sequences. Idea Ignore badly aligning
regions
The key features is that all negative scores
calculated in the dynamic programming changed to
0, to
avoid extending poorly scoring alignments
19Thus the major difference between
the
Needleman-Wunsch and the Smith-Waterman
scoring matrix is that
there are no negative scores in matrix.
(when scoring value becomes negative, that value
is set to zero) Why Do we need this
correction? The effect of this change is that an
alignment can . begin anywhere
without receiving a negative penalty from a
previously low- scoring alignment.
stops when negative alignment scores or the
introduction of gaps reduces the alignment scores
to 0 Result only a portion of each sequence
that was in this high- scoring region will be
reported.
20The trace-back matrix of the Smith-Waterman
scoring matrix.
To find the optimal local alignment, the
highest-scoring position in the scoring matrix
follow the trace-back from this position to a
zero in the matrix. Find two best local
alignments ?
The best sequence alignment sequence 1 S D R
T sequence 2 S D R T
score 2 4 6 3 15
The lower-scoring alignment sequence 1 M N A
sequence 2 M G S
score 6
0 1 7
21The Smith-Waterman algorithm Termination
1. If we want the best local alignment
FOPT max i,j F(i, j) 2. If
we want all local alignments scoring gt t
For all i, j find F(i, j) gt t, and trace
back
22The quality of the alignment
An Important question is our alignment good
enough to provide evidence of homology?
For database searches , some programs (PSI-BLAST)
report
E-values
is the expected number of random sequences from
the database, that give the same Z-score or
better. 0 E- value number of
sequences in the database
If E value 0.02 sequence probably homologous
If E value 1 homology cannot be
ruled out If E value gt
1 a match just by chance
23What program to use for searching?
- 1) FASTA works best in GCG
- integrated with GCG
- precise choice of databases
- more sensitive for DNA-DNA comparisons
- 2) BLAST is fastest and easily accessed on the
Web - limited sets of databases
- nice translation tools (BLASTX, TBLASTN)
- 3) Smith-Waterman is slower, but more sensitive
- known as a rigorous or exhaustive search
- SSEARCH in GCG and standalone FASTA
24 FASTA
(1988) Pearson and Lipman FASTA
make a local alignment between a new sequence and
any sequence in a database. FASTA is based on
the logic of the dot plots.
What is the main idea of dotplots?
In a dotplot, regions of similarity between two
sequences show up as diagonals.
25FASTA essentially calculates the sum of the dots
along each diagonal. What idea makes FASTA
faster than routine dynamic programming method of
calculating the diagonal sums? FASTA uses a
"word" based method Preparation for analysis 1)
Each sequence is broken down into short
words DNA words are usually 6 bases protein
words are 2 amino acids 2) These words are
organized into a table with an address where a
given words are in the sequence.
26FASTA Algorithm
- The first step FASTA matches identical words from
each list, and then creates diagonals by joining
adjacent matches, but only non-overlapping
words.
The first step find the largest number of short
perfect matches (words) for each comparison.
27FASTA Algorithm
b) The second step FASTA re-scores the highest
scoring regions using a replacement matrix such
as the PAM250
The second step the "best" regions are re-scored
. Why?
use a
scoring matrix that allows conservative
replacements
28FASTA Algorithm c) The third step after all
diagonals found, tries to join diagonals by
adding gaps
29FASTA Algorithm d) The fourth step Construct
an optimal alignment of the query sequence an
(Smith-Waterman algorithm).
the search set sequences with the highest scores
are aligned to the query sequence for display
30 FASTA Results
FASTA calculates
an E()-value (expectation of significance).
The best scores are init1
initn opt z-sc E(1018780).. SWPPI1_HUMAN
Begin 1 End 269 ! Q00169 homo sapiens
(human). phosph... 1854 1854 1854 2249.3
1.8e-117 SWPPI1_RABIT Begin 1 End 269 !
P48738 oryctolagus cuniculus (rabbi... 1840 1840
1840 2232.4 1.6e-116 SWPPI1_RAT Begin 1
End 270 ! P16446 rattus norvegicus (rat). pho...
1543 1543 1837 2228.7 2.5e-116 SWPPI1_MOUSE
Begin 1 End 270 ! P53810 mus musculus
(mouse). phosph... 1542 1542 1836 2227.5
2.9e-116 SWPPI2_HUMAN Begin 1 End 270 !
P48739 homo sapiens (human). phosph... 1533 1533
1533 1861.0 7.7e-96 SPTREMBL_NEWBAC25830
Begin 1 End 270 ! Bac25830 mus musculus
(mouse). 10, ... 1488 1488 1522 1847.6
4.2e-95 SP_TREMBLQ8N5W1 Begin 1 End 268 !
Q8n5w1 homo sapiens (human). simila... 1477 1477
1522 1847.6 4.3e-95 SWPPI2_RAT Begin 1
End 269 ! P53812 rattus norvegicus (rat). pho...
1482 1482 1516 1840.4 1.1e-94
31FASTA Results - Alignment
initn 1565 init1 1515 opt 1687 Z-score
1158.1 expect() 2.3e-58
66.2 identity in 875 nt overlap
60 70 80 90 100
110 u39412.gb_pr CCCTTTGTGGCCGCCATGGACAATTCCG
GGAAGGAAGCGGAGGCGATGGCGCTGTTGGCC
DMU09374 AGGCGGACATAAATCCTCGACATGGGT
GACAACGAACAGAAGGCGCTCCAACTGATGGCC
130 140 150 160 170
180 120 130
140 150 160 170 u39412.gb_pr
GAGGCGGAGCGCAAAGTGAAGAACTCGCAGTCCTTCTTCTCTGGCCTCTT
TGGAGGCTCA
DMU09374
GAGGCGGAGAAGAAGTTGACCCAGCAGAAGGGCTTTCTGGGATCGCTGTT
CGGAGGGTCC 190 200
210 220 230 240
180 190 200 210 220
230 u39412.gb_pr TCCAAAATAGAGGAAGCATGCGAAATC
TACGCCAGAGCAGCAAACATGTTCAAAATGGCC
DMU09374 AACAAGGTGGAGGACGCCATCGAGTGC
TACCAGCGGGCGGGCAACATGTTTAAGATGTCC
250 260 270 280 290
300 240 250
260 270 280 290 u39412.gb_pr
AAAAACTGGAGTGCTGCTGGAAACGCGTTCTGCCAGGCTGCACAGCTGCA
CCTGCAGCTC
DMU09374
AAAAACTGGACAAAGGCTGGGGAGTGCTTCTGCGAGGCGGCAACTCTACA
CGCGCGGGCT 310 320
330 340 350 360
32FASTA Format
- simple format used by almost all programs
- gtheader line with a return at end
- Sequence (no specific requirements for line
length, characters, etc)
gtURO1 uro1.seq Length 2018 November 9, 2000
1150 Type N Check 3854 .. CGCAGAAAGAGGAGGCGC
TTGCCTTCAGCTTGTGGGAAATCCCGAAGATGGCCAAAGACA ACTCAAC
TGTTCGTTGCTTCCAGGGCCTGCTGATTTTTGGAAATGTGATTATTGGTT
GTT GCGGCATTGCCCTGACTGCGGAGTGCATCTTCTTTGTATCTGACCA
ACACAGCCTCTACC CACTGCTTGAAGCCACCGACAACGATGACATCTAT
GGGGCTGCCTGGATCGGCATATTTG TGGGCATCTGCCTCTTCTGCCTGT
CTGTTCTAGGCATTGTAGGCATCATGAAGTCCAGCA GGAAAATTCTTCT
GGCGTATTTCATTCTGATGTTTATAGTATATGCCTTTGAAGTGGCAT CT
TGTATCACAGCAGCAACACAACAAGACTTTTTCACACCCAACCTCTTCCT
GAAGCAGA TGCTAGAGAGGTACCAAAACAACAGCCCTCCAAACAATGAT
GACCAGTGGAAAAACAATG GAGTCACCAAAACCTGGGACAGGCTCATGC
TCCAGGACAATTGCTGTGGCGTAAATGGTC CATCAGACTGGCAAAAATA
CACATCTGCCTTCCGGACTGAGAATAATGATGCTGACTATC CCTGGCCT
CGTCAATGCTGTGTTATGAACAATCTTAAAGAACCTCTCAACCTGGAGGC
TT
33BLAST (Basic Local Alignment Search Tool)
is a similarity
search program It was developed at
NCBI/GenBank.
It is available as a free service
over the internet which provides very fast,
accurate, and sensitive database searching. The
easiest way to use BLAST is through the Web.
(http//www.ncbi.nlm.nih.gov).
Journal of Molecular Biology Volume 215, Issue 3
, 5 October 1990, Pages 403-410 Basic Local
Alignment Search Tool Stephen F. Altschul,
Warren Gish, Webb Millerb, Eugene W. Meyers and
David J. Lipman
34BLAST Algorithm
The
three algorithmic steps Step1. Compiling a list
of high-scoring words in query sequence
(typically a length of the word w 3
amino acids or 11 nucleotides) GCVEDT - query
sequence (length L) GCV CVE VED - words
ACV CVD LED
List of the high-scoring aa words
GCL CLE VDD at each
position in query sequence.
C IE
35 The words, which score exceeds 12 will be used
for the search
36BLAST Algorithm
The list consists of all words (w-mers) that
score at least T Typically on the order of fifty
words in the list for every residue in the query
sequence, e.g. 12500 words for a sequence of
length 250.
37BLAST scores words matches unlike FASTA,
Construct all possible words which can
be matched with the given word Scores the
matches of all these words using all the values
in a similarity matrix (not just the ones of
identity)
38Step 2
Scanning the database for hits.
BLAST scans a protein database at approximately
500,000 residues/second.
39---
---
---
a possible k-mer word is PQG. All of its
neighboring words are generated by trying all
possible replacements of each amino acid in the
word at a time. After all words are tested, a
set of HSPs (High-scoring Segment Pairs) are
chosen for that database sequence.
40Step 3 Extend hits one base at a time If similar
words are found, BLAST tries to expand the
alignment to the adjacent words (Extending a hit
to find a locally maximal segment pair but it
does not allow gaps). Any such hit is extended to
a segment pair whose score is greater than or
equal to S.
41HSPs are Aligned Regions
- The results of the word matching and attempts to
extend the alignment are segments - - called HSPs (High-scoring Segment Pairs)
- BLAST often produces several short HSPs rather
than a single aligned region
42Step 4
After a set of HSPs (High-scoring
Segment Pairs) are chosen for that database
sequence. Several short, non-overlapping HSPs may
be combined to create a larger, more significant
match. The fundamental unit of BLAST algorithm
output is the High- scoring Segment Pair (HSP)
43BLAST Considerations BLAST is fast because it
initially throws away all database sequences that
do not have a significant match without gaps to
the query sequence. If two sequences are
generally similar over a long region, but do not
have a single highly conserved region, BLAST
might miss the similarity. Does not handle
gaps well
New gapped BLAST (BLAST 2) is better
44 BLAST works much better for proteins than for
nucleotides The estimate is that BLAST can find
protein sequences that have diverged by as much
as 250 substitutions per 100 amino acids, but
only 50 substitutions for 100 nucleotides.
BLAST and FASTA produce similar output files.
45Sequence Display
The first column contains a sequence identifier (
gi 66668 ). Following the sequence identifier
is the truncated title line of the sequence. It
is sometimes possible to identify the function of
the database from the title line. The second
column contains the "High Score" for each
database match. This is the score of the highest
scoring HSP found within that database sequence.
46Clicking on the sequence identifier takes you to
the alignment data for that HSP.
47Column 4, the "N" column displays the number of
HSPs in the set which was assigned the lowest
P-value. If a sequence matches to the query in a
number of regions with large gaps interspersed
then N will indicate the number of regions.
Column 3, the "P(N)" column, contains the lowest
P-value assigned to a set of HSPs for each
database sequence. The P-value represents the
probablity (in the range of 0-1) of a given
sequence occurring by chance. It is less accurate
than the E-value and N-dependent.
48Interpretation of output
E() or P-value to be the most reliable index for
the quality of a match. ? P-values smaller than
e-100 (these are negative exponents) are exact
matches (same gene, same species). ? P-values
in the range of e-50 to e-100 are nearly
identical genes (alleles, mutations, related
species).
49 ? P-values in the range of e-10 to e-50 are
interesting closely related sequences. ?
P-values between 0.1 and e-5 are still
interesting but usually represent distant
relationships
50Borderline similarity
- What to do with matches with E() values in the
0.5 -1.0 range? - this is the Twilight Zone
- retest these sequences and look for related hits
(not just your original query sequence) - similarity is transitive
- if AB and BC, then AC
51Instructions for Lab I Exercise B -
BINF5230 Fast Methods for Searching Sequence
Databases for Homologs FAST and BLAST Exercise
1b - Topics Exercise 1b focuses on use of
databases or sequence Libraries to find
similarities to a nucleic acid or protein
sequence. This is usually the first type of
analysis one performs with a new DNA sequence.
The algorithms are largely based on the concepts
described in this lecture. The Main objectives
in Exercise 1b are Try out the two main fast
database searching methods FASTA and BLAST
Compare the results of searches with a protein
sequence and the DNA sequence Get a feel for the
effects of filtering on searching with BLAST
Compare features of the original BLAST with
those of Gapped-BLAST Compare fast methods to
"rigorous" dynamic programming alignments
52Main Specific Tasks to Perform in Exercise B A.
Searching with FASTA 1.Obtain sequences 2.Run
FASTA on your protein sequence using ktup of
2 3.Repeat this search with a ktup of 1 4.
Perform a search using the cognate DNA
sequence B. Searching with BLAST 1.Use the DNA
sequence with BLASTN and DNA nr database 2.Use
the DNA sequences with BLASTX and BLOSUM62 matrix
and protein nr database 3.Use the protein
sequence with BLASTP and protein nr
database 4.Use the protein sequence with
Gapped-BLASTP and protein nr database 5.Do 3
BLASTP runs using the distance matrices PAM40,
PAM120, and PAM250 C. Filtering Low Complexity
Sequences in BLAST Searches 1.Retrieve the MTG8
protein sequence corresponding to Genbank
accession number, 2498595 2.Do a BLASTP run using
the BLO62 matrix and no FILTER options 3.Do a
BLASTP run using the BLO62 matrix and filter,
with both FILTER options D. Comparison with
dynamic programming alignments 1.Find a matching
SWISS_PROT sequence from the FASTA or BLAST
protein sequence searches. 2.Align this sequence
to your query using the GCG BESTFIT program
3.Evaluate the significance of the alignment
using the randomization option of BESTFIT.
53In Exercise B, you are to obtain cognate DNA and
protein sequences using sites on the Web, e.g.
NCBI and/or EXPASY, and then will use these
sequences to search databases for similar
sequences, with the objective of finding
homologs. (You can use the DNA/protein sequences
used in Exercise A, if you like). You will use
the two most frequently used fast database
searching programs FASTA and BLAST, as well as
the recent Gapped-BLAST.
Several Web based versions of FASTA are available
see this lecture (be sure to bookmark it). You
can find others if you don't like this site by
doing a search using the query. At this point in
the course, you have had sufficient experience
with both the Web and the GCG programs via the
SeqLab interface that I believe the Exercises
need not be as verbose as the initial Exercises.
54- A. Searching with FASTA
- Obtain sequences to use.Find a protein sequence
and the corresponding DNA sequence. This is
easiest to do using Entrez to locate the protein
sequence and following a nucleotide link to get
the DNA. These sequences will be used for most of
the database search tasks in this Exercise. It is
best to use a DNA sequence that encodes one and
only one gene, and one that isn't too long (2000
bases or less). It will be most convenient to
store these sequences on your local computer in
FASTA format. - Use sequences from Exercise 1a to test FASTA
programs - 2. Run FASTA on your protein sequence using a
ktup of 2.Examine the results and decide where
the division between homologous and
non-homologous sequences is. Include a sufficient
amount of the summary scores to justify your
conclusion. Please do not include hundreds of
alignments. Are there homologs that score lower
than non-homologs? What was the "cutoff" value
used in this search?
553. Repeat this search with a ktup of 1.Summarize
the differences between this search and the one
above, including relevant parts of the output. 4.
Perform a search using the cognate DNA
sequence.Summarize the differences between this
search and the protein searches above, including
relevant parts of the output. Which search, DNA
or protein, was more successful at finding
homologous sequences?
56B. Searching with BLAST 1. Use the DNA sequence
with BLASTN and DNA nr database.Choose an
appropriate web site for running the BLAST
programs, e.g. the BLAST server at NCBI (Entrez)
or GCG package. Use the DNA sequence you obtained
in Part A above for a BLASTN run, using the nr
(NonRedundant) DNA database. If you use the NCBI
BLAST server, the results will be displayed
directly on the NetScape window. This means you
need to "wait" until the BLASTN run is completed.
You can open a new NetScape window with the "New
Web Browser" command in the Netscape "File" menu,
to continue work on the Web. When the BLAST
output comes back, use COPY-PASTE to include
representative parts of the BLASTN output in your
log file.
572. Use the DNA sequences with BLASTX and BLO62
matrix and Protein nr databaseRepeat the above
BLASTN run but now using BLASTX and the Protein
nr database. Again, include representative parts
of the BLASTX output in your notebook. 3. Use the
protein sequence with BLASTP and Protein nr
databaseNow use the cognate Protein Sequence for
a BLAST run using the BLASTP program and the same
nr database. Again, include representative parts
of the BLASTP output in your Notebook. In all 3
cases, include the information about the BLAST
run found at the very end of the BLAST
output. Examine the output from these three BLAST
runs. Compare the hits found by looking at the
description lines. Which search seems to be the
most useful? Why? How do the results found in
your BLAST runs compare with those in your FASTA
runs? Which is most useful? Why?
584. Use the protein sequence with Gapped-BLASTP
and Protein nr databaseNow use the same cognate
Protein Sequence for a Gapped-BLAST run using the
BLASTP program and the same nr database. Again,
include representative parts of the Gapped-BLASTP
output in your Notebook. Again, include the
information about the Gapped-BLAST run found at
the very end of the Gapped-BLAST output.
Examine the output and compare the hits found
by BLASTP vs. those found by Gapped-BLASTP by
looking at both the description lines and the
alignment lines. What differences in the
description lines do you see? What is the E value
in Gapped-BLASTP, and how does it compare with
the P(N) used by BLASTP? How do the alignments
compare with each other? Pick an alignment with N
gt 3 or so. Describe how the alignments from
BLASTP and from Gapped-BLASTP compare with each
other. Which search seems to be the most useful?
Why?
595. Do 3 BLASTP runs using the distance matrices
PAM40, PAM120, and PAM250.Use your protein
sequence to do three runs using the BLASTP
program against the Protein nr database, but
using the PAM matrices PAM40, PAM120, PAM250.
Compare the Hits found by looking at the
description lines, and include the 3 sets of
Description lines (the main hits, at least) in
your Notebook. Compare also the hits found with
the PAM matrices to those found with the BLO62
matrix (the NCBI default protein distance matrix)
used in Part B3 above. Which distance matrix
works best to detect homologous sequences?
60C. BLASTP run, using FILTER and the MTG8 protein
sequence In this part, you will use the MTG8
Protein Sequence and the BLASTP program,
comparing the effects of the FILTER operation, to
"mask" low information sequences and direct
repeats. 1. Retrieve the MTG8 protein sequence
corresponding to Genbank accession number D14820.
in FASTA format.This should be easy for you to
do at NCBI Entrez using the Accession Number
2498595 or SWISS-PROT ACC Q06455. 2. Do a BLASTP
run using the BLO62 matrix and no FILTER
options.Include representative data from your
BLASTP output in your notebook. 3. Do a BLASTP
run using the BLO62 matrix and FILTER, with both
FILTER options. Do your BLASTP run by choosing
both the SEG and XNU filter options. Compare your
BLASTP output visually with that obtained with no
FILTER options, and include data from your BLASTP
output that is particularly relevant to
differences found with and without use of the
FILTER option. What differences do you find? Why
are these sequences filtered out?