ALIGNMENT - PowerPoint PPT Presentation

1 / 47
About This Presentation
Title:

ALIGNMENT

Description:

Local alignment (Smith-Waterman) BLAST (simplified Smith-Waterman) FASTA (simplified Smith-Waterman) BESTFIT (GCG program) Global alignment (Needleman-Wunsch) ... – PowerPoint PPT presentation

Number of Views:179
Avg rating:3.0/5.0
Slides: 48
Provided by: csta2
Category:

less

Transcript and Presenter's Notes

Title: ALIGNMENT


1
ALIGNMENT
  • How do we tell whether two sequences are similar?

Prev. reading Ch 1, Ch 3 Assigned reading Ch 11
BIO520 Bioinformatics Jim Lund
2
Alignments
  • DNADNA
  • polypeptidepolypeptide

THE BASIC Sequence Analysis Operation
3
Alignments
  • Pairwise sequence alignments
  • One-to-One
  • One-to-Database
  • Multiple sequence alignments
  • Many-to-Many

4
Origins of Sequence Similarity
  • Homology
  • common evolutionary descent
  • Similarity in function
  • Convergence (very rare)
  • Chance
  • Short similar segments are very common.

5
Visual sequence comparison Dotplot
6
Visual sequence comparison Filtered dotplot
4 bp window, 75 identity cutoff
7
Visual sequence comparison Dotplot
4 bp windw, 75 identity cutoff
8
Similarity
GAACAAT 7/7 OR 100 GAACAAT
Which is BETTER? How do we SCORE?
GAACAAT 1/7 or 14GAACAAT
9
Similarity
GAACAAT 7/7 OR 100 GAACAAT
GAACAAT 6/7 OR 84 GAATAAT
10
Mismatches
GAACAAT 6/7 OR 84 GAATAAT
Same??
GAACAAT 6/7 OR 84 GAAGAAT
11
Terminal Mismatch
GAACAATttttt aaaccGAATAAT 6/7
OR 84
12
INDELS
GAAgCAAT 7/7 OR 100 GAACAAT
13
Indels, contd
GAAgCAAT GAACAAT
14
Similarity Scoring
  • Common Method
  • Terminal mismatches (0)
  • Match score (5)
  • Mismatch penalty (-4)
  • Gap penalty (-5)
  • Gap extension penalty (-3)

DNA Defaults
15
DNA Scoring
GGGGGGGGGG 5(5)5(-4)5 GGGGGAAAAA
GGGGG
GGGGGGGGGG
10(5)(-5)5(-3)30 GGGGGAAAAAGGGGG
16
Absurdity of Low Gap Penalty
GATCGCTACGCTCAGC A.C.C..C..T
Perfect similarity, Every time!
17
Sequence alignment algorithms
  • Local alignment
  • Smith-Waterman
  • Global alignment
  • Needleman-Wunsch

18
Alignment Programs
  • Local alignment (Smith-Waterman)
  • BLAST (simplified Smith-Waterman)
  • FASTA (simplified Smith-Waterman)
  • BESTFIT (GCG program)
  • Global alignment (Needleman-Wunsch)
  • GAP

19
Local vs. global alignment
10 gaggc 15 3 gaggc 7
Local alignment alignment of regions of
substantial similarity
1 gggggaaaaaggggccccc 19
1 gggggttttttttggggtttcc 22
Global alignment alignment of the full length of
the sequences
20
BLAST Algorithm
  • Look for local alignment, a High Scoring Pair
    (HSP)
  • Finding word (W) in query and subject. Score gt
    T.
  • Extend local alignment until score reaches
    maximum-X.
  • Keep High Scoring Segment Pairs (HSPs) with
    scores gt S.
  • Find multiple HSPs per query if present
  • Expectation value (E value) using Karlin-Altschul
    stats

21
BLAST statistical significance assessing the
likelihood a match occurs by chance
Karlin-Altschul statistic E k m N exp(-Lambda
S) m Size of query seqeunce N Size of
database k Search space scaling
parameter Lambda scoring scaling parameter S
BLAST HSP score Low E -gt good match
22
BLAST statistical significance
  • Rule of thumb for a good match
  • Nucleotide match
  • E lt 1e-6
  • Identity gt 70
  • Protein match
  • E lt 1e-3
  • Identity gt 25

23
Protein Similarity
  • Identity-Easy
  • WEAK Alignments
  • Chemical Similarity
  • L vs I, K vs R
  • Evolutionary Similarity
  • How do proteins evolve?
  • How do we infer similarities?

24
Single-base evolution changes the encoded AA
  • CAUH
  • CACH CGUR UAUY
  • CAAQ CCUP GAUD
  • CAGQ CUUL AAUN

Selection Drift...etc
25
Substitution Matrices
  • Two main classes
  • PAM-Dayhoff
  • BLOSUM-Henikoff

26
PAM-Dayhoff
  • Built from closed related proteins, substitutions
    constrained by evolution and function
  • accepted by evolution (Point Accepted
    MutationPAM)
  • 1 PAM1 divergence
  • PAM120closely related proteins
  • PAM250divergent proteins

27
BLOSUM-HenikoffHenikoff
  • Built from ungapped alignments in proteins
    BLOCKS
  • Merge blocks at given similar to one sequence
  • Calculate target frequencies
  • BLOSUM6262 similar blocks
  • good general purpose
  • BLOSUM30
  • Detects weak similarities, used for distantly
    related proteins

28
BLOSUM62
29
Gapped alignments
  • No general theory for significance of matches!!
  • GL(n)
  • indel mutations rare
  • variation in gap length easy, G gt L

30
Real Alignments
Protein-Protein
Close-Distant
DNA-DNA
31
Phylogeny
Myoglobin
32
Cow-to-Pig
88 identical
33
Cow-to-Pig cDNA
80 Identity (88 at aa!)
34
DNA similarity reflects polypeptide similarity
35
Coding vs Non-coding Regions
90 in Coding 74 in Non-coding
36
Third Base of Codon Hypervariable
28 third base 11 second 8 first
37
Cow-to-Fish Protein
42 identity 51 similairity
38
Cow-to-Fish DNA
48 similairity Significant
30-NOT significant
39
Protein vs DNAAlignments
  • Polypeptide similarity gt DNA
  • Coding DNA gt Non-coding
  • 3rd base of codon hypervariable
  • Moderate Distance ?
  • poor DNA similarity

40
Rules of Thumb
  • DNA-DNA similarities
  • 50 significant if long
  • E lt 1e-6, 70 identity
  • Protein-protein similarities
  • 80 end-end same structure, same function
  • 30 over domain, similar function, structure
    overall similar
  • 15-30 twilight zone
  • Short, strong matchcould be a motif

41
Finding similar sequencesDatabase searches
  • DNA
  • Polypeptide
  • FASTA
  • BLAST

42
Basic BLAST Family
  • BLASTN
  • DNA to DNA database
  • BLASTP
  • protein to protein database
  • TBLASTN
  • DNA (translated) to protein database
  • BLASTX
  • protein to DNA database (translated)
  • TBLASTX
  • DNA (translated) to DNA database (translated)

43
DNA Databases
  • nr (non-redundantish merge of Genbank, EMBL,
    etc)
  • EXCLUDES EST, STS,GSS
  • est (expressed sequence tags)
  • htgs (high throughput genome seq.)
  • gss (genome survey sequence)
  • vector, yeast, ecoli, mito
  • chromosome (complete genomes)
  • And more

44
Protein Databases
  • nr (non-redundant Swiss-prot, PIR, PDF, PDB,
    Genbank CDS)
  • swissprot
  • ecoli, yeast, fly
  • month
  • And more

45
BLAST Input
  • Program
  • Database
  • Options-see more
  • Sequence
  • FASTA
  • gi or accession

gtone line gggtcgagtac
46
BLAST Options
  • Algorithm and output options
  • descriptions, alignments returned
  • probability cutoff
  • Strand
  • Alignment parameters
  • Scoring Matrix
  • PAM10, PAM40, PAM120, PAM250, BLOSUM62
  • Filter (low complexity) PPPPP-gtXXXXX

47
Extended BLAST Family
  • Gapped Blast (default)
  • PSI-Blast (Position-specific iterated blast)
  • self generated scoring matrix
  • PHI BLAST (motif plus BLAST)
  • BLAST2 client (align two seqs)
  • megablast (genomic sequence)
  • rpsblast (search for domains)
Write a Comment
User Comments (0)
About PowerShow.com