Data Searches and Sequence Alignments - PowerPoint PPT Presentation

1 / 41

About This Presentation

Title:

Data Searches and Sequence Alignments

Description:

Creation and analysis of protein multiple sequence alignment ... F. Smith and M. Waterman algorithm. GLOBAL AND LOCAL ALIGNMENT ... – PowerPoint PPT presentation

Number of Views:103

Avg rating:3.0/5.0

Slides: 42

Provided by: cclearn

Category:

more less

Transcript and Presenter's Notes

Title: Data Searches and Sequence Alignments

1
Data Searches and Sequence Alignments

Assessing pairwise sequence similarity BLAST
Creation and analysis of protein multiple
sequence alignment

2
Why do we want to perform an alignment?

Assumption Evolutionary (phylogenetic)
relationship
Functional implication
Build phylogenetic tree
Sequences can be used for alignment
nucleotide sequence or protein sequence

Homolog A YES or NO question
yes, share a common ancestor
no, not related
Similarity can be described as a fractional
value
AAATACGCGGTAATAGCATGCATTAGTGGT
AATTACGCCGTAATTGCAAGCATTAGTGGT
26/3087 identity
Too short to determine if they are homologous

For amino acids sequence
MATPGAGGRDKLIVASCYPVLIFIIAWQMQEP
MHSPGAAGKERLLVASCYPVIGFILAWNSQDP
Identity 2132 66
Similarity -- 2832 87.5
In general, protein sequences share over 30
similarity are likely to be homologues.
Usually protein sequences are longer than 50
residues (minimal length for a domain)

5
Evaluation of two sequences

Dot plot

6
DOROTHYCROWFOOTHODKIN DOROTHY-------HODKIN
GAP
7
Aligned local sequence
8
Tandem repeat
9
Low complexity
10
Identify exon-intron
11
Inverted repeat for terminator
12
Aligned with frame 1
frameshift
Aligned with frame 3
13
Simple alignment

Match score and penalty

14
Gaps in alignment
15
Origination of gaps

Insertion vs. deletion (indel) events
One step event in evolution
Origination penalty (open gap penalty)? higher
penalty value
Length penalty (gap extension penalty)? smaller
penalty value

16
Scoring matrices -- nucleotide
Conservative substitutions are more likely to be
preserved in evolution Every mismatch should not
give the same penalty? weighted score
17
Scoring matrices amino acid residues

PAM (point accepted mutation) matrix the scores
are computed by the substitutions that occur in
alignments between highly similar sequences
(relative mutability)
PAM unit -- the amount of evolutionary time
required for an average of one substitution per
100 residues to be observed
Lower numbered PAM matrices are more suitable for
comparing closely related proteins
Usually PAM250 is used
BLOSUM matrix the substitution rate are
calculated by statistical clustering methods when
the un-gapped related protein sequences are
aligned
Higher numbered BLOSUM matrices are more
suitable for comparing closely related proteins
Usually BLOSUM 62 is used comparing protein
sequences with approx. 62 similarity

18
Similarity Physico-Chemical Properties of Amino
Acids
http//swift.embl-heidelberg.de/course/BSchap-4.ht
mlprin
19
(No Transcript)
20
(No Transcript)
21
Algorithms for searching the best alignment

Exhaustive search is intractable
Dynamic programming breaking a problem into
sub-problems and using partial results to compute
the final answer
S. Needleman and C. Wunsch

22
GLOBAL AND LOCAL ALIGNMENT

Global alignment compare two sequences entirely
Gap penalty is assessed regardless of their
location
Semi-global alignment -- terminal gaps are not
penalized

23
ACACTGATCG ACACTG----
24
GLOBAL AND LOCAL ALIGNMENT

LOCAL ALIGNMENT
Functional unit in sequence are more conserved
than flexible regions
F. Smith and M. Waterman algorithm

Why do you want to do a databank search?
Early discovery of protein purification artifacts
Identification of new/unknown proteins
Look for possible functions of an unknown protein
Collect sequences for other studies
phylogenetic analysis
primer design
looking for new motifs

26
Database Search
Six frame translation
27
(No Transcript)
28
Word size 4
29
Score is related to the length of match
30
Significance of the BLAST result

E kmNe-ls

m length of query N total number of letter in
the target database l constant, dependent on
scoring matrix S score of HSP
31
Score of two random sequence at the same length
Low Score with a lot of hits? no significance
Z (score - mean)/std dev 5
Score high enough not just random hit
If the score of the alignment observed is no
better than expected from a random permutation of
the sequence, then it is likely to have arisen by
chance.
32
(No Transcript)
33
(No Transcript)
34
(No Transcript)
35
(No Transcript)
36
(No Transcript)
37
BLAST cutoffs

Nucleotide
Elt10-6
Identitygt70
Protein
Elt10-3
Identitygt25
Careful around the twilight zone, use reciprocal
BLAST or randomized sequence to BLAST

38
megablast

Variation of blastn
Align long or highly similar nucleotide sequences
Locate the sequence in a contig
Word size 28, for almost exact match
No penalty for open gap for fast search, tend to
get matches with more small gaps.

39
PSI-BLAST

Position-specific-iterated BLAST
Identify distantly related proteins
Using position-specific score matrix (PSSM) to
train the search, each iteration will include the
new residue into the profile (PSSM), and use the
new profile for next iteration (search).

40
Start from BLAST ? using high score hits to make
a PSSM ? using the PSSM for second BLAST ?
repeat for 3rd
41
Check the ones to make new PSSM

Write a Comment

User Comments (0)