Similarity: a guide - PowerPoint PPT Presentation

1 / 14
About This Presentation
Title:

Similarity: a guide

Description:

Any one nucleotide has a 25 % chance of matching at a. given position. ... Multiple sequence alignments rely on 'cheats' to make the alignment ... – PowerPoint PPT presentation

Number of Views:48
Avg rating:3.0/5.0
Slides: 15
Provided by: david282
Category:

less

Transcript and Presenter's Notes

Title: Similarity: a guide


1
Similarity a guide
2
DNA
A C G T
Four nucleotides
Any one nucleotide has a 25 chance of matching
at a given position. Therefore two random
sequences of nucleic acid will have a 25
nucleotide identity.
3
A good match
When you search a database of size X bases, what
are the chances of getting a good match ? a
sequence of 4 bases is present every 44 bases
256 bases a sequence of 6 bases is present every
64 bases 4000 bases a perfect match of 16 bases
would be expected by chance in a cDNA
library. The bigger the database, the better the
chance of getting a long match purely by chance.
4
How good is a good match ?
  • Data base search engines look at the size of the
    database, the size and identity of the match,
    and calculate the likelihood of such a match
    occurring by chance.
  • 1 expected to occur by random chance
  • 0.01 is my minimum cutoff (10-2)
  • DNA and protein sequence is not truly random the
    statistics are skewed
  • Different databases calculate P differently.

5
Protein
  • 20 amino acids
  • Any one amino acid has a 1 in 20 chance of
    matching randomly
  • Two random protein sequences will have 5
    sequence identity
  • Searching a database with a protein sequence
    gives many fewer background hits, and is much
    more sensitive on statistical grounds

6
DNA or protein searches ?
DNA
Protein
Sensitivity
Poor
Much better
Size of database
Excellent
Poor
7
Significance of match
STHREBEERPMSILAGERPWYTCHERRYBRANDYSRED
QREBNNBEERALWHQLAGERNGCNRCHERRYBRANDYQQE
8
Significance of match
STHREBEERPMSILAGERPWYTCHERRYBRANDYSRED
QREBNNBEERALWHQLAGERNGCNRCHERRYBRANDYQQE
STHREBEERPMSILAGERPWYTCHERRYBRANDYSRED
QREBNNBEERALWHQLAGERNGCNRCHERRYBRANDYQQE
STHREBEERPMSILAGERPWYTCHERRYBRANDYSRED
QREBNNBEERALWHQLAGERNGCNRCHERRYBRANDYQQE
9
Similarity
DNA
Transitions
A G
C T
Transversions
C G
A T
Transitions and transversions occur at different
rates
10
Similarity- proteins
Dayhoff Matrix
Based on proteins with gt 85 identity
11
Similarity- proteins
  • BLOSUM matrix
  • based on proteins with low similarity
  • much more sensitive for detecting related
    proteins
  • Matrices based on structural models
  • yet to be well described

12
Multiple Sequence Alignment
  • Analysis of related proteins can demonstrate
    conserved residues, structural features of the
    protein, and identify novel facets of protein
    biology
  • Prerequisite for phylogenetic analysis of
    proteins and DNA

13
MSA
  • When aligning two sequences of 1000 aa, there are
    (103)2 106 possibilities
  • When aligning three sequences, 109
  • For nine sequences, there are 1027 possible
    alignments
  • Multiple sequence alignments rely on cheats to
    make the alignment
  • these alignments are not optimal
  • Check manually !

14
MSA
  • Align- Use ClustalX
  • Check your alignment manually!
  • Visualise your alignment with Genedoc or Boxshade
  • allows the detection of conserved residues
  • Phylogeny- go on Paul Sharps course !
Write a Comment
User Comments (0)
About PowerShow.com