Title: Why is pairwise sequence alignment different
1Lecture 5
- Why is pairwise sequence alignment different
- for proteins and for nucleic acids ?
- General protein introduction.
- Scoring systems and matrices for protein data.
- 3. Wet experience for pairwise sequence
alignment - (for proteins, more options).
- 4. Special Blast pages.
- 5. Why is multiple alignment better ?
- 6. Wet experience for MSA (for proteins).
2BLASTP 2.2.6 Apr-09-2003 RID
1068830459-16741-16367211346.BLASTQ3 Query
gi33875338gbAAH00349.2 GBA protein Homo
sapiens (359 letters) Database All
non-redundant GenBank CDS translationsPDBSwissPr
otPIRPRF 1,541,613 sequences 503,866,891
total letters Taxonomy reports Distribution of
87 Blast Hits on the Query Sequence
Graphic Representation
3From the BlastP Page Go To Taxonomy Report
Organism Report
Score
Common name
Blast (family) name
E-value
Scientific name
- TaxBLAST hits are sorted according to species
containing the target sequence. - All hits of the same organism are listed
together. - Within each species, TaxBLAST hits are sorted
by score and E-value.
4PSI-BLAST - Position Specific Iterated BLAST
- A fast heuristic method for searching a profile,
by using iterations. The profile is used as the
query in the next iteration. - Advantages of PSI-BLAST
- Identify week homologies (more distant
relatives of - a protein, not found directly in FASTA or
BLAST). -
- An important tool for predicting biochemical
function.
Information http//www.ncbi.nlm.nih.gov/Educatio
n/BLASTinfo/psi1.html
5BLAST vs PSI-BLAST
BLAST DNA or protein. Use for close
homologies. PSI-BLAST Proteins only. Finds
distant homologies. Predicts biochemical
activity and function.
http//www.rubic.rdg.ac.uk/andrew/bioinf.org/talk
s/LocalBlast/img0.htm
6Outline of the PSI-BLAST Algorithm
First ordinary BLAST is used to find close
homologues. Rather than making a real multiple
alignment, the close homologues are all aligned
to the query sequence. A profile is constructed
using a very simple empirical weighing
scheme. Ignoring the positional variation of
indels the profile is again searched against the
database.
7What is a Conserved Position ?
A conserved position has a high frequency of any
single amino-acid type in the MSA column.
8How Does PSI-BLAST Work ?
- 1. Run a gapped-BLAST search with the query
sequence. - Collect all output sequences aligned to the query
with E-value - below a threshold (default is 0.005). Call
the collection M. - Construct a profile from M. The profile is a
matrix (position specific score matrix - PSSM).
The matrix has 20 rows, - one per AA.
-
- Iterate steps 1 to 3 with query profile. The
iterative search results in increased
sensitivity, and detection of weak homologies. - 5. Stop iterating when no new, significant
sequences are found ("convergence). - Note A highly conserved position will receive a
high count frequency - so they will be more significant in
the next iteration than - weakly conserved positions (that
receive low count frequencies).
9PSSM (Position Specific Scoring Matrix)
1 2 3 4 5 6 7 8 9 10 11 12 13.
- Notes
- PSSM is generated by calculating
- position specific scores for each
- position in the alignment (conserved
- positions receive high score, weakly
- conserved positions receive low score).
- Profile is produced internally but not
available on NCBI server. - Only first 15 positions of profile shown here
for lack of space.
http//www.rubic.rdg.ac.uk/andrew/bioinf.org/talk
s/LocalBlast/img0.htm
10http//www.idi.ntnu.no/grupper/KS-grp/microarray/s
lides/drablos/Fold_recognition/sld004.htm http//b
ioweb.pasteur.fr/seqanal/blast/intro-uk.htmlpsibl
ast
11PSI-BLAST - Output
Hits that are better than the E-value threshold
are listed first. These hits are used in forming
the profile that will be used in the next
PSI-BLAST iteration. Hits with E-values worse
than threshold, but nonetheless have an E-value
better than 10 (default selected on the query
page) are listed further down the page. Any of
the sequences in the list of "Sequences with
E-value worse than threshold (gt 0.005) can be
manually added (click) to sequences used for
generating the PSI-BLAST profile.
12To run PSI-BLAST, Start with the BLAST page
http//www.ncbi.nlm.nih.gov/BLAST/
13PSI-BLAST
14Running PSI-BLAST
- NOTES
- Use the
- SwissProt
- database and
- the BLOSUM62
- scoring matrix.
- Default EXPECT value in BLAST is 10.
- Default threshold value for PSI-BLAST is
0.005. - The user can see all BLASTP hits up to E-Value
10, - but only sequences with E-value threshold lt
0.005 affect the profile.
15PSI-BLAST Output - First Run - Query Human
Prosaposin.
16PSI-BLAST Output - First Run - Query Human
Prosaposin.
16 Sequences with E-value BETTER than threshold
(lt 0.005)
17PSI-BLAST Output, Second Run, query profile
(up from 54)
18PSI-BLAST Output, Second Run, query profile
Sequences with E-value BETTER than threshold
19PSI-BLAST Output, Third Run, query 2nd
iteration profile
20PSI-BLAST Tutorial
http//www.ncbi.nlm.nih.gov/Education/BLASTinfo/ps
i1.html
http//www.cmbi.kun.nl/bioinf/tools/psiblast.shtml
Help http//npsa-pbil.ibcp.fr/cgi-bin/npsa_automa
t.pl?page/NPSAHLP/npsahlp_simsearchpsiblast.html
Other Servers for PSI-Blast
http//xylian.igh.cnrs.fr/blast/psi_blast2.html
http//www.vge.ac.uk/blast/psiblast.html
http//www.cmbi.kun.nl/bioinf/tools/psiblast.shtm
l
21http//www.ebi.ac.uk/fasta3/
45