Title: Data Searches and Sequence Alignments
1Data Searches and Sequence Alignments
- Assessing pairwise sequence similarity BLAST
- Creation and analysis of protein multiple
sequence alignment
2Why do we want to perform an alignment?
- Assumption Evolutionary (phylogenetic)
relationship - Functional implication
- Build phylogenetic tree
- Sequences can be used for alignment
- nucleotide sequence or protein sequence
3- Homolog A YES or NO question
- yes, share a common ancestor
- no, not related
- Similarity can be described as a fractional
value - AAATACGCGGTAATAGCATGCATTAGTGGT
- AATTACGCCGTAATTGCAAGCATTAGTGGT
- 26/3087 identity
- Too short to determine if they are homologous
4- For amino acids sequence
- MATPGAGGRDKLIVASCYPVLIFIIAWQMQEP
- MHSPGAAGKERLLVASCYPVIGFILAWNSQDP
- Identity 2132 66
- Similarity -- 2832 87.5
- In general, protein sequences share over 30
similarity are likely to be homologues. - Usually protein sequences are longer than 50
residues (minimal length for a domain)
5Evaluation of two sequences
6DOROTHYCROWFOOTHODKIN DOROTHY-------HODKIN
GAP
7Aligned local sequence
8Tandem repeat
9Low complexity
10Identify exon-intron
11Inverted repeat for terminator
12Aligned with frame 1
frameshift
Aligned with frame 3
13Simple alignment
14Gaps in alignment
15Origination of gaps
- Insertion vs. deletion (indel) events
- One step event in evolution
- Origination penalty (open gap penalty)? higher
penalty value - Length penalty (gap extension penalty)? smaller
penalty value
16Scoring matrices -- nucleotide
Conservative substitutions are more likely to be
preserved in evolution Every mismatch should not
give the same penalty? weighted score
17Scoring matrices amino acid residues
- PAM (point accepted mutation) matrix the scores
are computed by the substitutions that occur in
alignments between highly similar sequences
(relative mutability) - PAM unit -- the amount of evolutionary time
required for an average of one substitution per
100 residues to be observed - Lower numbered PAM matrices are more suitable for
comparing closely related proteins - Usually PAM250 is used
- BLOSUM matrix the substitution rate are
calculated by statistical clustering methods when
the un-gapped related protein sequences are
aligned - Higher numbered BLOSUM matrices are more
suitable for comparing closely related proteins - Usually BLOSUM 62 is used comparing protein
sequences with approx. 62 similarity -
18Similarity Physico-Chemical Properties of Amino
Acids
http//swift.embl-heidelberg.de/course/BSchap-4.ht
mlprin
19(No Transcript)
20(No Transcript)
21Algorithms for searching the best alignment
- Exhaustive search is intractable
- Dynamic programming breaking a problem into
sub-problems and using partial results to compute
the final answer - S. Needleman and C. Wunsch
22GLOBAL AND LOCAL ALIGNMENT
- Global alignment compare two sequences entirely
- Gap penalty is assessed regardless of their
location - Semi-global alignment -- terminal gaps are not
penalized
23ACACTGATCG ACACTG----
24GLOBAL AND LOCAL ALIGNMENT
- LOCAL ALIGNMENT
- Functional unit in sequence are more conserved
than flexible regions - F. Smith and M. Waterman algorithm
25- Why do you want to do a databank search?
- Early discovery of protein purification artifacts
- Identification of new/unknown proteins
- Look for possible functions of an unknown protein
- Collect sequences for other studies
- phylogenetic analysis
- primer design
- looking for new motifs
26Database Search
Six frame translation
27(No Transcript)
28Word size 4
29Score is related to the length of match
30Significance of the BLAST result
m length of query N total number of letter in
the target database l constant, dependent on
scoring matrix S score of HSP
31Score of two random sequence at the same length
Low Score with a lot of hits? no significance
Z (score - mean)/std dev 5
Score high enough not just random hit
If the score of the alignment observed is no
better than expected from a random permutation of
the sequence, then it is likely to have arisen by
chance.
32(No Transcript)
33(No Transcript)
34(No Transcript)
35(No Transcript)
36(No Transcript)
37BLAST cutoffs
- Nucleotide
- Elt10-6
- Identitygt70
- Protein
- Elt10-3
- Identitygt25
- Careful around the twilight zone, use reciprocal
BLAST or randomized sequence to BLAST
38megablast
- Variation of blastn
- Align long or highly similar nucleotide sequences
- Locate the sequence in a contig
- Word size 28, for almost exact match
- No penalty for open gap for fast search, tend to
get matches with more small gaps.
39PSI-BLAST
- Position-specific-iterated BLAST
- Identify distantly related proteins
- Using position-specific score matrix (PSSM) to
train the search, each iteration will include the
new residue into the profile (PSSM), and use the
new profile for next iteration (search).
40Start from BLAST ? using high score hits to make
a PSSM ? using the PSSM for second BLAST ?
repeat for 3rd
41Check the ones to make new PSSM