Sequence Alignments - PowerPoint PPT Presentation

1 / 55
About This Presentation
Title:

Sequence Alignments

Description:

Based on observed frequencies of amino acid distributions and substitutions ... first protein substitution matrices based on observed frequencies of amino ... – PowerPoint PPT presentation

Number of Views:43
Avg rating:3.0/5.0
Slides: 56
Provided by: maureen119
Category:

less

Transcript and Presenter's Notes

Title: Sequence Alignments


1
Sequence Alignments
  • June 2, 2009

2
Outline
  • How sequences are aligned
  • How alignments are scored
  • Understanding BLAST

3
Reading assignment
  • Required
  • Chapters 3 4 Xiong
  • Optional
  • The BLAST help tab
  • http//blast.ncbi.nlm.nih.gov/Blast.cgi?CMDWebPA
    GE_TYPEBlastDocs
  • There is a link from the exercise 3 homepage

4
Sequence alignment
  • Determine if two sequences are related
  • Sequence assembly
  • Sequence annotation
  • Identify shared protein domains or motifs
  • Analysis of genomes
  • Phylogeny and evolution

5
Definitions
  • Homologous share a common ancestor
  • Cannot be measured
  • Measure similarity infer homology
  • Orthologs separated by speciation
  • Paralogs separated by duplication

6
Sequence alignment
  • Determine whether two (or more) sequences are of
    sufficient similarity such that an inference of
    homology is justified

7
How to align 2 sequences
  • Choose 2 sequences
  • Select an algorithm
  • Scoring method that reflects degree of similarity
  • Allow for gaps (insertions and deletions)
  • Estimate probability that alignment occurred by
    chance

8
Dotplots visual alignment
9
(No Transcript)
10
Dotplots self alignment
Alignments show up as diagonal lines on the
plot Gaps are evidenced by vertical or
horizontal shifts
11
Can find repeats
  • Align sequence to itself
  • Repeats are shorter diagonals off the main
    diagonal

12
Low complexity regions
  • Mucin 40 exact tandem repeats of 20 amino acids

13
Compare sequences
  • HMG1 and SRY are somewhat similar to each other
  • But dotplots do not give a quantitative measure
    of the degree of similarity

14
blast2seq output
15
identity and similarity
  • identity percentage of aligned residues that
    are identical
  • similarity percentage of aligned residues that
    have similar chemical/physical properties
  • Amino acid alignments only

16
Scoring schemes
  • Biologically significant way of scoring matches,
    mismatches gaps
  • Nucleotide alignments
  • Identity only
  • Positive score for matches negative scores for
    mismatches

17
Amino acid substitution matrices
  • Method to score matches and mismatches
  • Based on observed frequencies of amino acid
    distributions and substitutions
  • Must model conservative nature of substitutions
  • Implicitly represent evolutionary patterns
  • Scores are based in Information Theory

18
Scoring amino acid substitutions
  • Amino acids are NOT distributed evenly
  • Amino acids share similarity based on chemical
    and physical properties

19
(No Transcript)
20
PAM scoring matrices
  • Margaret O. Dayhoff developed first protein
    substitution matrices based on observed
    frequencies of amino acids as well as observed
    substitutions in aligned proteins (1978)
  • PAM Point Accepted Mutations
  • Observed variations groups of sequences 85
    similar
  • estimated substitutions in a group of evolving
    proteins
  • represent substitutions that do not significantly
    alter protein structure/function, so accepted
    by natural selection

21
PAM Scoring Matrices
  • Based on mutational model of evolution
  • Assume changes occur independently
  • Changes are a prediction of first changes that
    occur as proteins diverge from common ancestor
  • Matrices for more distantly related protein
    sequences extrapolated from short-term changes
  • All amino acids positions in related sequences
    were scored

22
PAM Scoring matrices
S score for amino acid pairing in the alignment
qij is the observed pairing frequency of amino
acids i and j.
pi and pj are the expected frequencies for amino
acids i and j.
23
PAM 250 Matrix
24
PAM 250 Matrix
25
BLOSUM matrices
  • BLOSUM matrices are based on local alignments
  • BLOSUM BLOcks SUbstitution Matrix
  • Sequences within segments clustered into blocks
    based on identity
  • Contributions of the sequences within a cluster
    were averaged.
  • BLOSUM62 is a matrix calculated from comparisons
    of sequences lt 62 identical.
  • BLOSUM40 PAM250

26
BLOSUM matrices
  • All BLOSUM matrices are based on observed
    alignments they are not extrapolated from
    comparisons of closely related proteins.
  • The BLOCKS database contains thousands of groups
    of multiple sequence alignments.
  • BLOSUM62 is the default matrix in BLAST 2.0.

27
BLOSUM62 Matrix
28
BLOSUM62 Matrix
29
BLOSUM90
More positive more negative than BLOSUM62
30
Choosing a matrix
31
Gaps
  • Insertions can lead to gaps of varying lengths
  • Use 2 gap penalties
  • higher penalty for opening a gap
  • lower penalty for extension of a gap

32
BLAST
Calculate statistical significance of matches
Build word list from query sequence
Find hits in database sequence
Extend the hits to form HSPs
  • Build a list of words from query sequence
  • (3 for proteins, 11 for DNA)
  • Evaluate each word for match using scoring matrix
    and discard all below threshold
  • Generally 50 matches per word
  • T value is threshold determines sensitivity and
    speed of search

33
Query sequence
PSATPVLICWAAG
Word list
PSA ATP VLI CWA
Threshold score (T)
11
Matches to PSA Score
PSA 15 PST 9 PDA 11 WSA
4
34
BLAST
Calculate statistical significance of matches
Build word list from query sequence
Extend the hits to form HSPs
Find hits in database sequence
  • Find match for each word in database
  • Database is indexed so all possible words in all
    sequences is known
  • This search is very fast (500K words/sec)
  • Matches gt T are used as seed for alignments

35
BLAST
Calculate statistical significance of matches
Build word list from query sequence
Find hits in database sequence
Extend the hits to form HSPs
  • Extend alignment from each word in both
    directions so long as score increases
  • These alignments are the HSPs
  • Keep HSPs if score is above a given threshold

36
Extending the hit
Score of new alignment
Score of previous alignment (A)
Score of new aligned pair


(1)
P S A C P S A C 24
p S A P S A 15
C C 9


Score of new aligned pair
(2)
Score of alignment (C)
Score of previous alignment (B)


P S A C Y P S A C Y 31
P S A C P S A C 24
Y Y 7


(3)
Repeat adding aligned pairs until score goes down
or reach end of sequence.
37
BLAST
Combine HSPs into a gapped alignment
Build word list from query sequence
Find hits in database sequence
Extend the hits to form HSPs
  • Highest scoring HSPs extended in both directions
    using dynamic programming
  • Continues as long as score gt threshold

38
Positives 200/310 (64)
Identities 135/310 (43)
Expect 2e-73
39
BLAST statistics
  • BLASTN match with E-value .004

atgctctggccacggcacttgcggatcccagggtgatctgtgcacctgcg
ata
atgctatggccacgggtcttgtggatccca---t
gatgtgtgcacctgcgata
How is this calculated and what does it mean?
40
Significance
  • Significance of hit is measured by E-value or
    expect value
  • Each alignment has a bit score (S)
  • E-value is number of alignments with bit score ?
    S that you expect to find by chance
  • E mn2-s
  • m effective length of query
  • n effective length (total of bases) of
    database

41
BLAST statistics
  • bit score
  • larger S, smaller E-value
  • length of query
  • longer queries usually generate larger E-values
  • size of database
  • larger database results in larger E-values

42
Calculation of raw score
  • Raw score calculated from number of identities,
    mismatches, gaps and characters in the
    alignment
  • R aI bX cO dG
  • I number of identities a reward for each
    identity
  • X number of mismatches b reward for each
    mismatch
  • O number of gaps c is gap-opening penalty
  • G number of d is gap-extension penalty

43
Calculation of raw score
atgctctggccacggcacttgcggatcccagggtgatctgtgcacctgcg
ata
atgctatggccacgggtcttgtggatccca---t
gatgtgtgcacctgcgata
There are 46 identities, 4 mismatches, 1 gap, and
3 - characters
R 46 (-3)(4) (5)(1) (2)(3) 23
44
Calculation of bit score
  • Bit score is obtained from raw score by

?R lnK
S
ln2
  • and K are normalizing parameters
  • dependent on the scoring matrix

45
  • ? 1.37
  • K 0.711
  • H 1.31
  • Effective query length 34
  • Effective database length 7,806,816,630

46
Calculation of bit score
  • Bit score is obtained from raw score by

?R lnK
S
ln2
(1.37)(23) ln(0.711)
S
46
ln2
47
Calculation of E-value
  • E mn2-s
  • m effective length of query
  • n effective length of the database

In this example, S 46, m 34 and n
7,806,816,630
E 0.004
48
Significance of alignment
probability that the observed match could have
happened by chance
P
number of matches as good as the observed one
that would be expected to appear by chance in a
database of the size probed Expect value
E
49
Significance of alignments
  • P values between 0 and 1
  • E P x size of the database
  • E values range from 0 to the size of the database

50
E values
  • Strongly dependent on the size of the database
  • E-value from search of 9000 protein db is 100x
    smaller than E-value for exact same alignment in
    a search of a 900,000 protein db

51
Caveats
  • Repetitive sequence
  • Regions of low complexity
  • Repeated motifs
  • Unusually high number of low abundant amino acids
    (i.e. cysteines)

52
Filtering LCR
gtASH1_HUMAN Achaete-scute homolog 1
(HASH1) MESSAKMESGGAGQQPQPQPQQPFLPPAACFFATAAAAAAA
AAAAAAQSAQQQQQQQQQQQQAPQLRPAADGQPSGGGHKSAPKQVKRQRS
SSPELMRCKRRLNFSGFGYSLPQQQPAAVARRNERERNRVKLVNLGFATL
REHVPNGAANKKMSKVETLRSAVEYIRALQQLLDEHDAVSAAFQAGVLSP
TISPNYSNDLNSMAGSPVSSYSSDEGSYDPLSPEEQELLDFTNWF
  • gtASH1_HUMAN Achaete-scute homolog 1 (HASH1)
    MESSAKMESGGXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
    XXXXXXXXXXXXXXXXXXXXXXQPSGGGHKSAPKQVKRQRSSSPELMRCK
    RRLNFSGFGYSLPQQQPAAXXXXXXXXXXXXXXVNLGFATLREHVPNGAA
    NKKMSKVETLRSAVEYIRALQQLLDEHDAVSAAFQAGVLSPTISPNYSND
    LNSMAGSPVSSYSSDEGSYDPLSPEEQELLDFTNWF

53
Alignment without filtering Note low E-value (E
1e-13) alignment in region 120-133.
54
Alignment with filtering turned on. Note higher
E-value (4e-7) Xs in region 120-133 as a
consequence of the filtering.
55
Flavors of BLAST
  • BLASTP
  • Protein query protein DB
  • BLASTN
  • Megablast, discontinuous megablast, blastn
  • Nucleotide query, nucleotide DB
  • BLASTX
  • Nucleotide query translated in 3 frames protein
    DB
  • TBLASTN
  • Protein query, Nucleotide DB translated in 3
    frames
  • TBLASTX
  • Nucleotide query, nucleotide DB both translated
    in 3 frames
Write a Comment
User Comments (0)
About PowerShow.com