Title: Sequence Alignments
1Sequence Alignments
2Outline
- How sequences are aligned
- How alignments are scored
- Understanding BLAST
3Reading assignment
- Required
- Chapters 3 4 Xiong
- Optional
- The BLAST help tab
- http//blast.ncbi.nlm.nih.gov/Blast.cgi?CMDWebPA
GE_TYPEBlastDocs - There is a link from the exercise 3 homepage
4Sequence alignment
- Determine if two sequences are related
- Sequence assembly
- Sequence annotation
- Identify shared protein domains or motifs
- Analysis of genomes
- Phylogeny and evolution
5Definitions
- Homologous share a common ancestor
- Cannot be measured
- Measure similarity infer homology
- Orthologs separated by speciation
- Paralogs separated by duplication
6Sequence alignment
- Determine whether two (or more) sequences are of
sufficient similarity such that an inference of
homology is justified
7How to align 2 sequences
- Choose 2 sequences
- Select an algorithm
- Scoring method that reflects degree of similarity
- Allow for gaps (insertions and deletions)
- Estimate probability that alignment occurred by
chance
8Dotplots visual alignment
9(No Transcript)
10Dotplots self alignment
Alignments show up as diagonal lines on the
plot Gaps are evidenced by vertical or
horizontal shifts
11Can find repeats
- Align sequence to itself
- Repeats are shorter diagonals off the main
diagonal
12Low complexity regions
- Mucin 40 exact tandem repeats of 20 amino acids
13Compare sequences
- HMG1 and SRY are somewhat similar to each other
- But dotplots do not give a quantitative measure
of the degree of similarity
14blast2seq output
15 identity and similarity
- identity percentage of aligned residues that
are identical - similarity percentage of aligned residues that
have similar chemical/physical properties - Amino acid alignments only
16Scoring schemes
- Biologically significant way of scoring matches,
mismatches gaps - Nucleotide alignments
- Identity only
- Positive score for matches negative scores for
mismatches
17Amino acid substitution matrices
- Method to score matches and mismatches
- Based on observed frequencies of amino acid
distributions and substitutions - Must model conservative nature of substitutions
- Implicitly represent evolutionary patterns
- Scores are based in Information Theory
18Scoring amino acid substitutions
- Amino acids are NOT distributed evenly
- Amino acids share similarity based on chemical
and physical properties
19(No Transcript)
20PAM scoring matrices
- Margaret O. Dayhoff developed first protein
substitution matrices based on observed
frequencies of amino acids as well as observed
substitutions in aligned proteins (1978) - PAM Point Accepted Mutations
- Observed variations groups of sequences 85
similar - estimated substitutions in a group of evolving
proteins - represent substitutions that do not significantly
alter protein structure/function, so accepted
by natural selection
21PAM Scoring Matrices
- Based on mutational model of evolution
- Assume changes occur independently
- Changes are a prediction of first changes that
occur as proteins diverge from common ancestor - Matrices for more distantly related protein
sequences extrapolated from short-term changes - All amino acids positions in related sequences
were scored
22PAM Scoring matrices
S score for amino acid pairing in the alignment
qij is the observed pairing frequency of amino
acids i and j.
pi and pj are the expected frequencies for amino
acids i and j.
23PAM 250 Matrix
24PAM 250 Matrix
25BLOSUM matrices
- BLOSUM matrices are based on local alignments
- BLOSUM BLOcks SUbstitution Matrix
- Sequences within segments clustered into blocks
based on identity - Contributions of the sequences within a cluster
were averaged. - BLOSUM62 is a matrix calculated from comparisons
of sequences lt 62 identical. - BLOSUM40 PAM250
26BLOSUM matrices
- All BLOSUM matrices are based on observed
alignments they are not extrapolated from
comparisons of closely related proteins. - The BLOCKS database contains thousands of groups
of multiple sequence alignments. - BLOSUM62 is the default matrix in BLAST 2.0.
27BLOSUM62 Matrix
28BLOSUM62 Matrix
29BLOSUM90
More positive more negative than BLOSUM62
30Choosing a matrix
31Gaps
- Insertions can lead to gaps of varying lengths
- Use 2 gap penalties
- higher penalty for opening a gap
- lower penalty for extension of a gap
32BLAST
Calculate statistical significance of matches
Build word list from query sequence
Find hits in database sequence
Extend the hits to form HSPs
- Build a list of words from query sequence
- (3 for proteins, 11 for DNA)
- Evaluate each word for match using scoring matrix
and discard all below threshold - Generally 50 matches per word
- T value is threshold determines sensitivity and
speed of search
33Query sequence
PSATPVLICWAAG
Word list
PSA ATP VLI CWA
Threshold score (T)
11
Matches to PSA Score
PSA 15 PST 9 PDA 11 WSA
4
34BLAST
Calculate statistical significance of matches
Build word list from query sequence
Extend the hits to form HSPs
Find hits in database sequence
- Find match for each word in database
- Database is indexed so all possible words in all
sequences is known - This search is very fast (500K words/sec)
- Matches gt T are used as seed for alignments
35BLAST
Calculate statistical significance of matches
Build word list from query sequence
Find hits in database sequence
Extend the hits to form HSPs
- Extend alignment from each word in both
directions so long as score increases - These alignments are the HSPs
- Keep HSPs if score is above a given threshold
36Extending the hit
Score of new alignment
Score of previous alignment (A)
Score of new aligned pair
(1)
P S A C P S A C 24
p S A P S A 15
C C 9
Score of new aligned pair
(2)
Score of alignment (C)
Score of previous alignment (B)
P S A C Y P S A C Y 31
P S A C P S A C 24
Y Y 7
(3)
Repeat adding aligned pairs until score goes down
or reach end of sequence.
37BLAST
Combine HSPs into a gapped alignment
Build word list from query sequence
Find hits in database sequence
Extend the hits to form HSPs
- Highest scoring HSPs extended in both directions
using dynamic programming - Continues as long as score gt threshold
38Positives 200/310 (64)
Identities 135/310 (43)
Expect 2e-73
39BLAST statistics
- BLASTN match with E-value .004
atgctctggccacggcacttgcggatcccagggtgatctgtgcacctgcg
ata
atgctatggccacgggtcttgtggatccca---t
gatgtgtgcacctgcgata
How is this calculated and what does it mean?
40Significance
- Significance of hit is measured by E-value or
expect value - Each alignment has a bit score (S)
- E-value is number of alignments with bit score ?
S that you expect to find by chance - E mn2-s
- m effective length of query
- n effective length (total of bases) of
database
41BLAST statistics
- bit score
- larger S, smaller E-value
- length of query
- longer queries usually generate larger E-values
- size of database
- larger database results in larger E-values
42Calculation of raw score
- Raw score calculated from number of identities,
mismatches, gaps and characters in the
alignment - R aI bX cO dG
- I number of identities a reward for each
identity - X number of mismatches b reward for each
mismatch - O number of gaps c is gap-opening penalty
- G number of d is gap-extension penalty
43Calculation of raw score
atgctctggccacggcacttgcggatcccagggtgatctgtgcacctgcg
ata
atgctatggccacgggtcttgtggatccca---t
gatgtgtgcacctgcgata
There are 46 identities, 4 mismatches, 1 gap, and
3 - characters
R 46 (-3)(4) (5)(1) (2)(3) 23
44Calculation of bit score
- Bit score is obtained from raw score by
?R lnK
S
ln2
- and K are normalizing parameters
- dependent on the scoring matrix
45- ? 1.37
- K 0.711
- H 1.31
- Effective query length 34
- Effective database length 7,806,816,630
46Calculation of bit score
- Bit score is obtained from raw score by
?R lnK
S
ln2
(1.37)(23) ln(0.711)
S
46
ln2
47Calculation of E-value
- E mn2-s
- m effective length of query
- n effective length of the database
In this example, S 46, m 34 and n
7,806,816,630
E 0.004
48Significance of alignment
probability that the observed match could have
happened by chance
P
number of matches as good as the observed one
that would be expected to appear by chance in a
database of the size probed Expect value
E
49Significance of alignments
- P values between 0 and 1
- E P x size of the database
- E values range from 0 to the size of the database
50E values
- Strongly dependent on the size of the database
- E-value from search of 9000 protein db is 100x
smaller than E-value for exact same alignment in
a search of a 900,000 protein db
51Caveats
- Repetitive sequence
- Regions of low complexity
- Repeated motifs
- Unusually high number of low abundant amino acids
(i.e. cysteines)
52Filtering LCR
gtASH1_HUMAN Achaete-scute homolog 1
(HASH1) MESSAKMESGGAGQQPQPQPQQPFLPPAACFFATAAAAAAA
AAAAAAQSAQQQQQQQQQQQQAPQLRPAADGQPSGGGHKSAPKQVKRQRS
SSPELMRCKRRLNFSGFGYSLPQQQPAAVARRNERERNRVKLVNLGFATL
REHVPNGAANKKMSKVETLRSAVEYIRALQQLLDEHDAVSAAFQAGVLSP
TISPNYSNDLNSMAGSPVSSYSSDEGSYDPLSPEEQELLDFTNWF
- gtASH1_HUMAN Achaete-scute homolog 1 (HASH1)
MESSAKMESGGXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXQPSGGGHKSAPKQVKRQRSSSPELMRCK
RRLNFSGFGYSLPQQQPAAXXXXXXXXXXXXXXVNLGFATLREHVPNGAA
NKKMSKVETLRSAVEYIRALQQLLDEHDAVSAAFQAGVLSPTISPNYSND
LNSMAGSPVSSYSSDEGSYDPLSPEEQELLDFTNWF
53Alignment without filtering Note low E-value (E
1e-13) alignment in region 120-133.
54Alignment with filtering turned on. Note higher
E-value (4e-7) Xs in region 120-133 as a
consequence of the filtering.
55Flavors of BLAST
- BLASTP
- Protein query protein DB
- BLASTN
- Megablast, discontinuous megablast, blastn
- Nucleotide query, nucleotide DB
- BLASTX
- Nucleotide query translated in 3 frames protein
DB - TBLASTN
- Protein query, Nucleotide DB translated in 3
frames - TBLASTX
- Nucleotide query, nucleotide DB both translated
in 3 frames