Data Searches and Sequence Alignments - PowerPoint PPT Presentation

1 / 41
About This Presentation
Title:

Data Searches and Sequence Alignments

Description:

Creation and analysis of protein multiple sequence alignment ... F. Smith and M. Waterman algorithm. GLOBAL AND LOCAL ALIGNMENT ... – PowerPoint PPT presentation

Number of Views:103
Avg rating:3.0/5.0
Slides: 42
Provided by: cclearn
Category:

less

Transcript and Presenter's Notes

Title: Data Searches and Sequence Alignments


1
Data Searches and Sequence Alignments
  • Assessing pairwise sequence similarity BLAST
  • Creation and analysis of protein multiple
    sequence alignment

2
Why do we want to perform an alignment?
  • Assumption Evolutionary (phylogenetic)
    relationship
  • Functional implication
  • Build phylogenetic tree
  • Sequences can be used for alignment
  • nucleotide sequence or protein sequence

3
  • Homolog A YES or NO question
  • yes, share a common ancestor
  • no, not related
  • Similarity can be described as a fractional
    value
  • AAATACGCGGTAATAGCATGCATTAGTGGT
  • AATTACGCCGTAATTGCAAGCATTAGTGGT
  • 26/3087 identity
  • Too short to determine if they are homologous

4
  • For amino acids sequence
  • MATPGAGGRDKLIVASCYPVLIFIIAWQMQEP
  • MHSPGAAGKERLLVASCYPVIGFILAWNSQDP
  • Identity 2132 66
  • Similarity -- 2832 87.5
  • In general, protein sequences share over 30
    similarity are likely to be homologues.
  • Usually protein sequences are longer than 50
    residues (minimal length for a domain)

5
Evaluation of two sequences
  • Dot plot

6
DOROTHYCROWFOOTHODKIN DOROTHY-------HODKIN
GAP
7
Aligned local sequence
8
Tandem repeat
9
Low complexity
10
Identify exon-intron
11
Inverted repeat for terminator
12
Aligned with frame 1
frameshift
Aligned with frame 3
13
Simple alignment
  • Match score and penalty

14
Gaps in alignment
15
Origination of gaps
  • Insertion vs. deletion (indel) events
  • One step event in evolution
  • Origination penalty (open gap penalty)? higher
    penalty value
  • Length penalty (gap extension penalty)? smaller
    penalty value

16
Scoring matrices -- nucleotide
Conservative substitutions are more likely to be
preserved in evolution Every mismatch should not
give the same penalty? weighted score
17
Scoring matrices amino acid residues
  • PAM (point accepted mutation) matrix the scores
    are computed by the substitutions that occur in
    alignments between highly similar sequences
    (relative mutability)
  • PAM unit -- the amount of evolutionary time
    required for an average of one substitution per
    100 residues to be observed
  • Lower numbered PAM matrices are more suitable for
    comparing closely related proteins
  • Usually PAM250 is used
  • BLOSUM matrix the substitution rate are
    calculated by statistical clustering methods when
    the un-gapped related protein sequences are
    aligned
  • Higher numbered BLOSUM matrices are more
    suitable for comparing closely related proteins
  • Usually BLOSUM 62 is used comparing protein
    sequences with approx. 62 similarity

18
Similarity Physico-Chemical Properties of Amino
Acids
http//swift.embl-heidelberg.de/course/BSchap-4.ht
mlprin
19
(No Transcript)
20
(No Transcript)
21
Algorithms for searching the best alignment
  • Exhaustive search is intractable
  • Dynamic programming breaking a problem into
    sub-problems and using partial results to compute
    the final answer
  • S. Needleman and C. Wunsch

22
GLOBAL AND LOCAL ALIGNMENT
  • Global alignment compare two sequences entirely
  • Gap penalty is assessed regardless of their
    location
  • Semi-global alignment -- terminal gaps are not
    penalized

23
ACACTGATCG ACACTG----
24
GLOBAL AND LOCAL ALIGNMENT
  • LOCAL ALIGNMENT
  • Functional unit in sequence are more conserved
    than flexible regions
  • F. Smith and M. Waterman algorithm

25
  • Why do you want to do a databank search?
  • Early discovery of protein purification artifacts
  • Identification of new/unknown proteins
  • Look for possible functions of an unknown protein
  • Collect sequences for other studies
  • phylogenetic analysis
  • primer design
  • looking for new motifs

26
Database Search
Six frame translation
27
(No Transcript)
28
Word size 4
29
Score is related to the length of match
30
Significance of the BLAST result
  • E kmNe-ls

m length of query N total number of letter in
the target database l constant, dependent on
scoring matrix S score of HSP
31
Score of two random sequence at the same length
Low Score with a lot of hits? no significance
Z (score - mean)/std dev 5
Score high enough not just random hit
If the score of the alignment observed is no
better than expected from a random permutation of
the sequence, then it is likely to have arisen by
chance.
32
(No Transcript)
33
(No Transcript)
34
(No Transcript)
35
(No Transcript)
36
(No Transcript)
37
BLAST cutoffs
  • Nucleotide
  • Elt10-6
  • Identitygt70
  • Protein
  • Elt10-3
  • Identitygt25
  • Careful around the twilight zone, use reciprocal
    BLAST or randomized sequence to BLAST

38
megablast
  • Variation of blastn
  • Align long or highly similar nucleotide sequences
  • Locate the sequence in a contig
  • Word size 28, for almost exact match
  • No penalty for open gap for fast search, tend to
    get matches with more small gaps.

39
PSI-BLAST
  • Position-specific-iterated BLAST
  • Identify distantly related proteins
  • Using position-specific score matrix (PSSM) to
    train the search, each iteration will include the
    new residue into the profile (PSSM), and use the
    new profile for next iteration (search).

40
Start from BLAST ? using high score hits to make
a PSSM ? using the PSSM for second BLAST ?
repeat for 3rd
41
Check the ones to make new PSSM
Write a Comment
User Comments (0)
About PowerShow.com