Title: Multiple Sequence Alignments PSIBLAST
1Multiple Sequence AlignmentsPSI-BLAST
2- Reading assignments
- Xiong Chapters 5 6
3Topics
- Overview of MSA
- MSA methods
- Practical aspects
- MSA to Profiles
- PSI-BLAST
4What are MSAs used for?
- Identification of protein families
- More sensitive identification of remote
homologues - Identification of conserved, functionally
important sites - Starting point for phylogenetic studies
- Defining protein family sets for entire genomes
- Prediction of protein secondary structure
- Identification of regulatory regions
5Overview of MSA
- Alignment of 3 sequences to bring as many
similar characters into register as possible - Hypothetical model of mutations (substitutions,
insertions deletions) - Best represents most likely evolutionary
scenario. - Cannot be unambiguously established
6MSA methods
- Hierarchical
- Most common accurate
- ClustalW is most popular MSA program
- ClustalW advanced development
- Can propagate errors made early on
- Non-hierarchical
- Can align sequences of different lengths
- May fail on larger sequence sets
- T-Coffee
7Overview of hierarchical method
- Do a pairwise comparison of all sequences
- Create a guide tree of the most to least similar
- Align 2 most similar, then next 2 most similar
- Add sequences progressively in decreasing order
of similarity - Gaps added are never removed
8Clustal W
- CLUSTALCluster alignment
- The underlying concept is that groups of
sequences are phylogenetically related. - If they can be aligned then one can construct a
tree.
Reference Thompson et al. (1994) Nucleic Acids
Res. 22, 4673-4680
9Step 1-pairwise alignments
Compare each sequence with each other and
calculate a distance matrix.
A - B .87 - C .59
.60 -
Each number represents the number of exact
matches divided by the sequence length (ignoring
gaps). Thus, the higher the number the more
closely related the two sequences are.
Different sequences
A B C
In this distance matrix sequence A is 87
identical to sequence B
10Step 2-Create Guide Tree
Use the Distance Matrix to create a Guide Tree
to determine the order of the sequences.
0.87 (0.13)
A - B .87 - C .59
.60 -
A B C
Different sequences
0.60 (0.40)
A B C
Guide Tree
Branch length proportional to estimated
divergence between A and B (0.13)
11Step 3-Progressive Alignment
First, align A and B Then add sequence C to the
previous alignment. In the closely aligned
sequences, gaps are given a heavier weight than
in more divergent sequences.
Guide Tree
12Amino acid weight matrices
- Series of scoring matrices that one can use
depending on the relatedness of the proteins
aligned. - As the alignment proceeds in CLUSTALW the AA
weight matrices are changed to more divergent
scoring matrices. - Length of the branch is used to determine which
matrix to use and contributes to the alignment
score.
13Globin alignment
- Starting with a group of 7 globin-related
sequences from different species - Do pairwise alignments between all 7 sequences
- Calculate similarity between each pair higher
score indicates more similar
14- Cluster the sequences by similarity to create a
guide tree - Branch length is proportional to estimated
divergence between the two sequences
15(No Transcript)
16(No Transcript)
17(No Transcript)
18Example of Sequence Alignment using Clustal W
identity high similarity . low similarity -
gap in sequence
Amino acids often color coded based on
physical -chemical properties
19Globin alignment
20Guide tree phylogram
21ClustalW programs
- Locally by itself
- Download and install ClustalX on any platform
- Graphic interface
- Locally, as part of another package
- BioEdit free runs on Windows
- Genious basic version free (Mac Windows)
- WEB servers
- EBI ClustalW server www.ebi.ac.uk/clustalw
22Practical aspects
- Identify download sequences in correct format
- Should meet criteria for MSA
- Closely related (E lt 1)
- Similar length and number of domains
- Same domain order
- If necessary, extract regions of similar length
- Name them appropriately
23Alignment viewers
- Edit and prepare for publication
- Different coloring schemes
- Jalview -- Java based interactive viewer (free)
24(No Transcript)
25MSAs to Profiles
- MSAs can be used to find remote homologs or
remote members of a protein family - PSI-BLAST
- Automated, available as part of NCBI BLAST
- Hidden Markov Models (HMMs)
- more sensitive
- less automated
- Basis of PFAM database
26Why?
- Database searches using a profile or
position-specific scoring matrices (PSSM) are
much more sensitive for detecting weak or distant
relationships than are database searches using a
single sequence as query - Information content higher in a PSSM
27Pairwise alignment
28What is a PSSM?
Position Specific Scoring Matrix
29MSAs to PSSM
POS 123456 Seq1 ATGTCG Seq2 AAGACT Seq3 TACTCA
Seq4 CGGAGG Seq5 AACCTG
30ATGTCG AAGACT TACTCA CGGAGG AACCTG
Convert MSA to raw frequency table
31Normalize by dividing by overall frequencies
32Convert the values to log to the base of 2
PSSM
33Match AACTCG in the matrix
SUM 1.0 1.0 0.8 1.0 1.38 1.15 6.33
34PSI-BLAST
- Position-Specific Iterated BLAST
- What is it and how does it relate to MSA?
- How is it related to BLAST?
- What can I do with it?
35Steps in PSI-BLAST
- Single protein sequence compared to database
using BLASTP - Construct a multiple alignment and profile (PSSM)
from any significant local alignments - Profile or PSSM is compared to database, making
local alignments - Estimate statistical significance of local
alignments - Iterate an arbitrary number of times or until
convergence (no new sequences added)
36Practical uses of PSI-BLAST
- Can create a PSSM using PSI-BLAST against 1
database, use the same PSSM in another database
for a more sensitive search - Does not have to run to convergence to create a
PSSM useful for finding remote homologues,
usually 2 or 3 iterations is sufficient
37Sma4 protein from C. elegans
- Sma4 protein, 570 aa long
- Protein domains
38BLASTP against Refseq DB
. . .
39PSI-BLAST iteration1
- Default threshold is E 0.005
E 0.023
40PSI-BLAST iteration 2
E2e-29
41PSI-BLAST iteration 3
E2e-39
. . .
42Homologs?
- 30-50 identical over short stretches
Sma4 protein
XP_001668763
43Homologs?
- Sma4 protein (570 aa)
- XP_001668763 (531 aa)
44Finding homologs in other species
- Sma4 BLASTP against Refseq limited to Gallus
gallus (chicken) - 17 hits with E-value lt 10
45Use PSSM to search other DB
- PSSM from 3rd iteration with Sma4 in a PSI-BLAST
search of REFSEQ limited to Gallus gallus
(chicken) - 127 matches with E-value lt 10
46Computer lab
- PSI-BLAST to find remote homologues
- MSA of proteins for genotyping
- MSA to determine homology