Title: Why is pairwise sequence alignment different
1Lecture 5
- Why is pairwise sequence alignment different
- for proteins and for nucleic acids ?
- General protein introduction.
- Scoring systems and matrices for protein data.
- 3. Wet experience for pairwise sequence
alignment - (for proteins, more options).
- 4. Special Blast pages.
- 5. Why is multiple alignment better ?
- 6. Wet experience for MSA (for proteins).
2Scoring Systems
- Identity Count the number of identical matches,
- divide by length of aligned region (in ).
- Similarity A less well defined measure of how
close 2 sequences are. - Chemical similarities among amino acids
http//www.imb-jena.de/IMAGE_AA.html
3Related Amino Acids
http//www.imb-jena.de/IMAGE_AA.html
4Protein Scoring Matrices
- Family of matrices listing the likelihood of
changes from one sequence to another during
evolution. - The two most popular matrices are the PAM and the
BLOSUM matrices.
5PAM Matrix - Point Accepted Mutations
- PAM matrices are based
- on related sequences.
- In these related proteins, the
- function was not significantly changed.
The changes are accepted by natural selection
(mutations survived during evolution).
6PAM Scoring Matrices
PAM units measure evolutionary distance.
PAM 1 matrix - Substitution scores arising from
sequences where one percent of amino acid
pairs are different. Note PAM 1 is a small
change -gt the sequences will be almost identical.
7PAM Scoring Matrices
- In general
- Low PAM numbers are used for aligning short
sequences - with strong local similarities.
- High PAM numbers used for aligning long
sequences - with weak similarities.
- When there is no information about evolutionary
distance, - 3 matrices are recommended for sequence
comparison - PAM 40, PAM 120 and PAM 250.
8PAM Family of Matrices (Dayhoff, 78)
(log odds)
Values gt 0 in the logs odd PAM matrix indicate
likely mutations values 0 are neutral values lt
0 indicate unlikely mutations.
Note Numbers along diagonals are not all equal.
The diagonal indicates how conserved a
residue tend to be (W is VERY conserved).
Calculate PAM Matrix Enter the desired PAM value
in the box below (value must be greater than 1,
and less than 512) http//www.cmbi.kun.nl/bioinf
/tools/pam.shtml
9THE BLOSUM Family of Matrices
Blocks Substitution Matrices- (BLOSUM
matrices based on a much larger dataset then PAM).
- Blocks are short conserved patterns of 3-60 aa
long. - Proteins can be divided into families by common
blocks. - Different BLOSUM matrices emerge by looking
- at sequences with different identity
percentage.Example BLOSUM62 is derived from an
alignment - of sequences that share at least 62
identity.
Block A B C D
10THE BLOSUM Family of Matrices
Blocks Substitution Matrices
(log odds)
11PAM vs. BLOSUM Matrices
Widely used
- Tips for protein similarity search
- Start with BLOSUM 62 or PAM 120, default gap
penalties. - If no significant results found, use BLOSUM 45
or PAM 250 - and lower gap penalties, to find more
divergent results. - Examine results above E-value 0.05 for
divergent sequences. - Use PSI-BLAST to discover weak but biologically
significant - sequence similarities.
http//www.ncbi.nlm.nih.gov/Education/BLASTinfo/Sc
oring2.html
12Lecture 5
- Why is pairwise sequence alignment different
- for proteins and for nucleic acids ?
- General protein introduction.
- Scoring systems and matrices for protein data.
- 3. Wet experience for pairwise sequence
alignment - (for proteins, more options).
- 4. Special Blast pages.
- 5. Why is multiple alignment better ?
- 6. Wet experience for MSA (for proteins).
13http//www.ncbi.nlm.nih.gov/BLAST/
14http//www.ncbi.nlm.nih.gov/BLAST/
15http//www.ebi.ac.uk/swissprot/
16Protein Query
17Options for Advanced
18(No Transcript)
19Examples of Alignment Formats http//www.ncbi.nlm
.nih.gov/Education/BLASTinfo/multi_formats.html
20Pair wise Alignment in BLAST Output
low complexity sequence filtered
Positives