Sequence Alignments - PowerPoint PPT Presentation

1 / 55

About This Presentation

Title:

Sequence Alignments

Description:

Based on observed frequencies of amino acid distributions and substitutions ... first protein substitution matrices based on observed frequencies of amino ... – PowerPoint PPT presentation

Number of Views:43

Avg rating:3.0/5.0

Slides: 56

Provided by: maureen119

Category:

more less

Transcript and Presenter's Notes

Title: Sequence Alignments

1
Sequence Alignments

June 2, 2009

2
Outline

How sequences are aligned
How alignments are scored
Understanding BLAST

3
Reading assignment

Required
Chapters 3 4 Xiong
Optional
The BLAST help tab
http//blast.ncbi.nlm.nih.gov/Blast.cgi?CMDWebPA
GE_TYPEBlastDocs
There is a link from the exercise 3 homepage

4
Sequence alignment

Determine if two sequences are related
Sequence assembly
Sequence annotation
Identify shared protein domains or motifs
Analysis of genomes
Phylogeny and evolution

5
Definitions

Homologous share a common ancestor
Cannot be measured
Measure similarity infer homology
Orthologs separated by speciation
Paralogs separated by duplication

6
Sequence alignment

Determine whether two (or more) sequences are of
sufficient similarity such that an inference of
homology is justified

7
How to align 2 sequences

Choose 2 sequences
Select an algorithm
Scoring method that reflects degree of similarity
Allow for gaps (insertions and deletions)
Estimate probability that alignment occurred by
chance

8
Dotplots visual alignment
9
(No Transcript)
10
Dotplots self alignment
Alignments show up as diagonal lines on the
plot Gaps are evidenced by vertical or
horizontal shifts
11
Can find repeats

Align sequence to itself
Repeats are shorter diagonals off the main
diagonal

12
Low complexity regions

Mucin 40 exact tandem repeats of 20 amino acids

13
Compare sequences

HMG1 and SRY are somewhat similar to each other
But dotplots do not give a quantitative measure
of the degree of similarity

14
blast2seq output
15
identity and similarity

identity percentage of aligned residues that
are identical
similarity percentage of aligned residues that
have similar chemical/physical properties
Amino acid alignments only

16
Scoring schemes

Biologically significant way of scoring matches,
mismatches gaps
Nucleotide alignments
Identity only
Positive score for matches negative scores for
mismatches

17
Amino acid substitution matrices

Method to score matches and mismatches
Based on observed frequencies of amino acid
distributions and substitutions
Must model conservative nature of substitutions
Implicitly represent evolutionary patterns
Scores are based in Information Theory

18
Scoring amino acid substitutions

Amino acids are NOT distributed evenly
Amino acids share similarity based on chemical
and physical properties

19
(No Transcript)
20
PAM scoring matrices

Margaret O. Dayhoff developed first protein
substitution matrices based on observed
frequencies of amino acids as well as observed
substitutions in aligned proteins (1978)
PAM Point Accepted Mutations
Observed variations groups of sequences 85
similar
estimated substitutions in a group of evolving
proteins
represent substitutions that do not significantly
alter protein structure/function, so accepted
by natural selection

21
PAM Scoring Matrices

Based on mutational model of evolution
Assume changes occur independently
Changes are a prediction of first changes that
occur as proteins diverge from common ancestor
Matrices for more distantly related protein
sequences extrapolated from short-term changes
All amino acids positions in related sequences
were scored

22
PAM Scoring matrices
S score for amino acid pairing in the alignment
qij is the observed pairing frequency of amino
acids i and j.
pi and pj are the expected frequencies for amino
acids i and j.
23
PAM 250 Matrix
24
PAM 250 Matrix
25
BLOSUM matrices

BLOSUM matrices are based on local alignments
BLOSUM BLOcks SUbstitution Matrix
Sequences within segments clustered into blocks
based on identity
Contributions of the sequences within a cluster
were averaged.
BLOSUM62 is a matrix calculated from comparisons
of sequences lt 62 identical.
BLOSUM40 PAM250

26
BLOSUM matrices

All BLOSUM matrices are based on observed
alignments they are not extrapolated from
comparisons of closely related proteins.
The BLOCKS database contains thousands of groups
of multiple sequence alignments.
BLOSUM62 is the default matrix in BLAST 2.0.

27
BLOSUM62 Matrix
28
BLOSUM62 Matrix
29
BLOSUM90
More positive more negative than BLOSUM62
30
Choosing a matrix
31
Gaps

Insertions can lead to gaps of varying lengths
Use 2 gap penalties
higher penalty for opening a gap
lower penalty for extension of a gap

32
BLAST
Calculate statistical significance of matches
Build word list from query sequence
Find hits in database sequence
Extend the hits to form HSPs

Build a list of words from query sequence
(3 for proteins, 11 for DNA)
Evaluate each word for match using scoring matrix
and discard all below threshold
Generally 50 matches per word
T value is threshold determines sensitivity and
speed of search

33
Query sequence
PSATPVLICWAAG
Word list
PSA ATP VLI CWA
Threshold score (T)
11
Matches to PSA Score
PSA 15 PST 9 PDA 11 WSA
4
34
BLAST
Calculate statistical significance of matches
Build word list from query sequence
Extend the hits to form HSPs
Find hits in database sequence

Find match for each word in database
Database is indexed so all possible words in all
sequences is known
This search is very fast (500K words/sec)
Matches gt T are used as seed for alignments

35
BLAST
Calculate statistical significance of matches
Build word list from query sequence
Find hits in database sequence
Extend the hits to form HSPs

Extend alignment from each word in both
directions so long as score increases
These alignments are the HSPs
Keep HSPs if score is above a given threshold

36
Extending the hit
Score of new alignment
Score of previous alignment (A)
Score of new aligned pair

(1)
P S A C P S A C 24
p S A P S A 15
C C 9

Score of new aligned pair
(2)
Score of alignment (C)
Score of previous alignment (B)

P S A C Y P S A C Y 31
P S A C P S A C 24
Y Y 7

(3)
Repeat adding aligned pairs until score goes down
or reach end of sequence.
37
BLAST
Combine HSPs into a gapped alignment
Build word list from query sequence
Find hits in database sequence
Extend the hits to form HSPs

Highest scoring HSPs extended in both directions
using dynamic programming
Continues as long as score gt threshold

38
Positives 200/310 (64)
Identities 135/310 (43)
Expect 2e-73
39
BLAST statistics

BLASTN match with E-value .004

atgctctggccacggcacttgcggatcccagggtgatctgtgcacctgcg
ata
atgctatggccacgggtcttgtggatccca---t
gatgtgtgcacctgcgata
How is this calculated and what does it mean?
40
Significance

Significance of hit is measured by E-value or
expect value
Each alignment has a bit score (S)
E-value is number of alignments with bit score ?
S that you expect to find by chance
E mn2-s
m effective length of query
n effective length (total of bases) of
database

41
BLAST statistics

bit score
larger S, smaller E-value
length of query
longer queries usually generate larger E-values
size of database
larger database results in larger E-values

42
Calculation of raw score

Raw score calculated from number of identities,
mismatches, gaps and characters in the
alignment
R aI bX cO dG
I number of identities a reward for each
identity
X number of mismatches b reward for each
mismatch
O number of gaps c is gap-opening penalty
G number of d is gap-extension penalty

43
Calculation of raw score
atgctctggccacggcacttgcggatcccagggtgatctgtgcacctgcg
ata
atgctatggccacgggtcttgtggatccca---t
gatgtgtgcacctgcgata
There are 46 identities, 4 mismatches, 1 gap, and
3 - characters
R 46 (-3)(4) (5)(1) (2)(3) 23
44
Calculation of bit score

Bit score is obtained from raw score by

?R lnK
S
ln2

and K are normalizing parameters
dependent on the scoring matrix

? 1.37
K 0.711
H 1.31
Effective query length 34
Effective database length 7,806,816,630

46
Calculation of bit score

Bit score is obtained from raw score by

?R lnK
S
ln2
(1.37)(23) ln(0.711)
S
46
ln2
47
Calculation of E-value

E mn2-s
m effective length of query
n effective length of the database

In this example, S 46, m 34 and n
7,806,816,630
E 0.004
48
Significance of alignment
probability that the observed match could have
happened by chance
P
number of matches as good as the observed one
that would be expected to appear by chance in a
database of the size probed Expect value
E
49
Significance of alignments

P values between 0 and 1
E P x size of the database
E values range from 0 to the size of the database

50
E values

Strongly dependent on the size of the database
E-value from search of 9000 protein db is 100x
smaller than E-value for exact same alignment in
a search of a 900,000 protein db

51
Caveats

Repetitive sequence
Regions of low complexity
Repeated motifs
Unusually high number of low abundant amino acids
(i.e. cysteines)

52
Filtering LCR
gtASH1_HUMAN Achaete-scute homolog 1
(HASH1) MESSAKMESGGAGQQPQPQPQQPFLPPAACFFATAAAAAAA
AAAAAAQSAQQQQQQQQQQQQAPQLRPAADGQPSGGGHKSAPKQVKRQRS
SSPELMRCKRRLNFSGFGYSLPQQQPAAVARRNERERNRVKLVNLGFATL
REHVPNGAANKKMSKVETLRSAVEYIRALQQLLDEHDAVSAAFQAGVLSP
TISPNYSNDLNSMAGSPVSSYSSDEGSYDPLSPEEQELLDFTNWF

gtASH1_HUMAN Achaete-scute homolog 1 (HASH1)
MESSAKMESGGXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXQPSGGGHKSAPKQVKRQRSSSPELMRCK
RRLNFSGFGYSLPQQQPAAXXXXXXXXXXXXXXVNLGFATLREHVPNGAA
NKKMSKVETLRSAVEYIRALQQLLDEHDAVSAAFQAGVLSPTISPNYSND
LNSMAGSPVSSYSSDEGSYDPLSPEEQELLDFTNWF

53
Alignment without filtering Note low E-value (E
1e-13) alignment in region 120-133.
54
Alignment with filtering turned on. Note higher
E-value (4e-7) Xs in region 120-133 as a
consequence of the filtering.
55
Flavors of BLAST