Title: ALIGNMENT
1ALIGNMENT
- How do we tell whether two sequences are similar?
Prev. reading Ch 1, Ch 3 Assigned reading Ch 11
BIO520 Bioinformatics Jim Lund
2Alignments
- DNADNA
- polypeptidepolypeptide
THE BASIC Sequence Analysis Operation
3Alignments
- Pairwise sequence alignments
- One-to-One
- One-to-Database
- Multiple sequence alignments
- Many-to-Many
4Origins of Sequence Similarity
- Homology
- common evolutionary descent
- Similarity in function
- Convergence (very rare)
- Chance
- Short similar segments are very common.
5Visual sequence comparison Dotplot
6Visual sequence comparison Filtered dotplot
4 bp window, 75 identity cutoff
7Visual sequence comparison Dotplot
4 bp windw, 75 identity cutoff
8Similarity
GAACAAT 7/7 OR 100 GAACAAT
Which is BETTER? How do we SCORE?
GAACAAT 1/7 or 14GAACAAT
9Similarity
GAACAAT 7/7 OR 100 GAACAAT
GAACAAT 6/7 OR 84 GAATAAT
10Mismatches
GAACAAT 6/7 OR 84 GAATAAT
Same??
GAACAAT 6/7 OR 84 GAAGAAT
11Terminal Mismatch
GAACAATttttt aaaccGAATAAT 6/7
OR 84
12INDELS
GAAgCAAT 7/7 OR 100 GAACAAT
13Indels, contd
GAAgCAAT GAACAAT
14Similarity Scoring
- Common Method
- Terminal mismatches (0)
- Match score (5)
- Mismatch penalty (-4)
- Gap penalty (-5)
- Gap extension penalty (-3)
DNA Defaults
15DNA Scoring
GGGGGGGGGG 5(5)5(-4)5 GGGGGAAAAA
GGGGG
GGGGGGGGGG
10(5)(-5)5(-3)30 GGGGGAAAAAGGGGG
16Absurdity of Low Gap Penalty
GATCGCTACGCTCAGC A.C.C..C..T
Perfect similarity, Every time!
17Sequence alignment algorithms
- Local alignment
- Smith-Waterman
- Global alignment
- Needleman-Wunsch
18Alignment Programs
- Local alignment (Smith-Waterman)
- BLAST (simplified Smith-Waterman)
- FASTA (simplified Smith-Waterman)
- BESTFIT (GCG program)
- Global alignment (Needleman-Wunsch)
- GAP
19Local vs. global alignment
10 gaggc 15 3 gaggc 7
Local alignment alignment of regions of
substantial similarity
1 gggggaaaaaggggccccc 19
1 gggggttttttttggggtttcc 22
Global alignment alignment of the full length of
the sequences
20BLAST Algorithm
- Look for local alignment, a High Scoring Pair
(HSP) - Finding word (W) in query and subject. Score gt
T. - Extend local alignment until score reaches
maximum-X. - Keep High Scoring Segment Pairs (HSPs) with
scores gt S. - Find multiple HSPs per query if present
- Expectation value (E value) using Karlin-Altschul
stats
21BLAST statistical significance assessing the
likelihood a match occurs by chance
Karlin-Altschul statistic E k m N exp(-Lambda
S) m Size of query seqeunce N Size of
database k Search space scaling
parameter Lambda scoring scaling parameter S
BLAST HSP score Low E -gt good match
22BLAST statistical significance
- Rule of thumb for a good match
- Nucleotide match
- E lt 1e-6
- Identity gt 70
- Protein match
- E lt 1e-3
- Identity gt 25
23Protein Similarity
- Identity-Easy
- WEAK Alignments
- Chemical Similarity
- L vs I, K vs R
- Evolutionary Similarity
- How do proteins evolve?
- How do we infer similarities?
24Single-base evolution changes the encoded AA
- CAUH
- CACH CGUR UAUY
- CAAQ CCUP GAUD
- CAGQ CUUL AAUN
Selection Drift...etc
25Substitution Matrices
- Two main classes
- PAM-Dayhoff
- BLOSUM-Henikoff
26PAM-Dayhoff
- Built from closed related proteins, substitutions
constrained by evolution and function - accepted by evolution (Point Accepted
MutationPAM) - 1 PAM1 divergence
- PAM120closely related proteins
- PAM250divergent proteins
27BLOSUM-HenikoffHenikoff
- Built from ungapped alignments in proteins
BLOCKS - Merge blocks at given similar to one sequence
- Calculate target frequencies
- BLOSUM6262 similar blocks
- good general purpose
- BLOSUM30
- Detects weak similarities, used for distantly
related proteins
28BLOSUM62
29Gapped alignments
- No general theory for significance of matches!!
- GL(n)
- indel mutations rare
- variation in gap length easy, G gt L
30Real Alignments
Protein-Protein
Close-Distant
DNA-DNA
31Phylogeny
Myoglobin
32Cow-to-Pig
88 identical
33Cow-to-Pig cDNA
80 Identity (88 at aa!)
34DNA similarity reflects polypeptide similarity
35Coding vs Non-coding Regions
90 in Coding 74 in Non-coding
36Third Base of Codon Hypervariable
28 third base 11 second 8 first
37Cow-to-Fish Protein
42 identity 51 similairity
38Cow-to-Fish DNA
48 similairity Significant
30-NOT significant
39Protein vs DNAAlignments
- Polypeptide similarity gt DNA
- Coding DNA gt Non-coding
- 3rd base of codon hypervariable
- Moderate Distance ?
- poor DNA similarity
40Rules of Thumb
- DNA-DNA similarities
- 50 significant if long
- E lt 1e-6, 70 identity
- Protein-protein similarities
- 80 end-end same structure, same function
- 30 over domain, similar function, structure
overall similar - 15-30 twilight zone
- Short, strong matchcould be a motif
41Finding similar sequencesDatabase searches
42Basic BLAST Family
- BLASTN
- DNA to DNA database
- BLASTP
- protein to protein database
- TBLASTN
- DNA (translated) to protein database
- BLASTX
- protein to DNA database (translated)
- TBLASTX
- DNA (translated) to DNA database (translated)
43DNA Databases
- nr (non-redundantish merge of Genbank, EMBL,
etc) - EXCLUDES EST, STS,GSS
- est (expressed sequence tags)
- htgs (high throughput genome seq.)
- gss (genome survey sequence)
- vector, yeast, ecoli, mito
- chromosome (complete genomes)
- And more
44Protein Databases
- nr (non-redundant Swiss-prot, PIR, PDF, PDB,
Genbank CDS) - swissprot
- ecoli, yeast, fly
- month
- And more
45BLAST Input
- Program
- Database
- Options-see more
- Sequence
- FASTA
- gi or accession
gtone line gggtcgagtac
46BLAST Options
- Algorithm and output options
- descriptions, alignments returned
- probability cutoff
- Strand
- Alignment parameters
- Scoring Matrix
- PAM10, PAM40, PAM120, PAM250, BLOSUM62
- Filter (low complexity) PPPPP-gtXXXXX
47Extended BLAST Family
- Gapped Blast (default)
- PSI-Blast (Position-specific iterated blast)
- self generated scoring matrix
- PHI BLAST (motif plus BLAST)
- BLAST2 client (align two seqs)
- megablast (genomic sequence)
- rpsblast (search for domains)