Title: Similarity: a guide
1Similarity a guide
2DNA
A C G T
Four nucleotides
Any one nucleotide has a 25 chance of matching
at a given position. Therefore two random
sequences of nucleic acid will have a 25
nucleotide identity.
3A good match
When you search a database of size X bases, what
are the chances of getting a good match ? a
sequence of 4 bases is present every 44 bases
256 bases a sequence of 6 bases is present every
64 bases 4000 bases a perfect match of 16 bases
would be expected by chance in a cDNA
library. The bigger the database, the better the
chance of getting a long match purely by chance.
4How good is a good match ?
- Data base search engines look at the size of the
database, the size and identity of the match,
and calculate the likelihood of such a match
occurring by chance. - 1 expected to occur by random chance
- 0.01 is my minimum cutoff (10-2)
- DNA and protein sequence is not truly random the
statistics are skewed - Different databases calculate P differently.
5Protein
- 20 amino acids
- Any one amino acid has a 1 in 20 chance of
matching randomly - Two random protein sequences will have 5
sequence identity - Searching a database with a protein sequence
gives many fewer background hits, and is much
more sensitive on statistical grounds
6DNA or protein searches ?
DNA
Protein
Sensitivity
Poor
Much better
Size of database
Excellent
Poor
7Significance of match
STHREBEERPMSILAGERPWYTCHERRYBRANDYSRED
QREBNNBEERALWHQLAGERNGCNRCHERRYBRANDYQQE
8Significance of match
STHREBEERPMSILAGERPWYTCHERRYBRANDYSRED
QREBNNBEERALWHQLAGERNGCNRCHERRYBRANDYQQE
STHREBEERPMSILAGERPWYTCHERRYBRANDYSRED
QREBNNBEERALWHQLAGERNGCNRCHERRYBRANDYQQE
STHREBEERPMSILAGERPWYTCHERRYBRANDYSRED
QREBNNBEERALWHQLAGERNGCNRCHERRYBRANDYQQE
9Similarity
DNA
Transitions
A G
C T
Transversions
C G
A T
Transitions and transversions occur at different
rates
10Similarity- proteins
Dayhoff Matrix
Based on proteins with gt 85 identity
11Similarity- proteins
- BLOSUM matrix
- based on proteins with low similarity
- much more sensitive for detecting related
proteins - Matrices based on structural models
- yet to be well described
12Multiple Sequence Alignment
- Analysis of related proteins can demonstrate
conserved residues, structural features of the
protein, and identify novel facets of protein
biology - Prerequisite for phylogenetic analysis of
proteins and DNA
13MSA
- When aligning two sequences of 1000 aa, there are
(103)2 106 possibilities - When aligning three sequences, 109
- For nine sequences, there are 1027 possible
alignments - Multiple sequence alignments rely on cheats to
make the alignment - these alignments are not optimal
- Check manually !
14MSA
- Align- Use ClustalX
- Check your alignment manually!
- Visualise your alignment with Genedoc or Boxshade
- allows the detection of conserved residues
- Phylogeny- go on Paul Sharps course !