Title: Chapter 2 Data Searches and Pairwise Alignments
1Chapter 2Data Searches and Pairwise Alignments
- ??????????
- ???
- 2004/03/08
2Introduction
- What is the difference between acctga and agcta?
a c c t g a a g c t g a a g c t - a
3Nomenclature
42.1 Dot Plots
52.2 Simple Alignments
6- mutation (substitution) common
- insertion
- deletion
- scoring scheme
- match score
- mismatch score
gap, indel (rare)
72.3 Gaps
82.3.1 Gap Penalty
- uniform gap
- affine gap
- origination penalty
- length penalty
92.4 Scoring Matrices
10- Modeling ???
- ??????????????
11Modeling
12(No Transcript)
13Define the odds ratio as
142.4.1 PAM Matrices
- Dayhoff, Schwartz, Orcutt (1978)
- Point Accepted Mutation
- Based on observed substitution rates
- (Box. 2.1)
- Input
- A set of observed substitution rates
- Output
- PAM-1 matrix (log-odds matrix)
15Multiple Alignment
- (1) Group the sequences with high similarity (gt
85 identity).
16Phylogenetic Tree
- (2) For each group, build the corresponding
phylogenetic tree.
17Mutation Frequency
- A-gtG, I-gtL, A-gtG, A-gtL, C-gtS, G-gtA
- (3)
- FG,A3
18Relative Mutability
19Mutation Probability
20Odds Ratio
21Log-Odds Ratio
22- Which PAM matrix is the most appropriate?
- the length of the sequences
- How closely the sequences are believed to be
related. - ? PAM 120 for database search
- ? PAM 200 for comparing two specific proteins
232.4.2 BLOSUM Matrices
- Henikoff Henikoff (1992)
- PAM-k k??, ????
- BLOSUM-k k?????
- ? BLOSUM62 for ungapped matching
- ? BLOSUM50 for gapped matching
242.5 Dynamic Programming
- The Needleman and Wunsch Algorithm (Global
Alignment)
25(No Transcript)
26Alignment Graph
27(No Transcript)
28A C - - T C G A C A G T A G
29Complexity
302.6 Global and Local Alignments
- Semi-global alignment
- Local alignment
312.6.1 Semi-global Alignments
- A A C A C G T G T C T
- - - - A C G T - - - -
32(No Transcript)
332.6.2 Local Alignment
- The Smith-Waterman Alignment
34(No Transcript)
352.7 Database Searches
- BLAST and its relatives
- FASTA and related algorithms
362.7.1 BLAST and Its Relatives
Program Database Query
BLASTN Nucleotide Nucleotide
BLASTP Protein Protein
BLASTX Protein Nucleotide? Protein
TBLASTN Nucleotide? Protein Protein
TBLASTX Nucleotide? Protein Nucleotide? Protein
37BLASTP
- Using PAM or BLOSUM matrices
382.7.2 FASTA and Related Algorithms
- ?? dot plot band search
- Preprocess the target sequence.
- Identify the position for each word.
- (for amino acid word length1, a 20-entry
array) - Scan the query sequence.
- Compute the shifts of query to align each word
with the target. - Find the mode (??) of the shifts.
- Join the possible shifts into one new target
sequence. Perform the full local alignment
algorithm.
39- Target FAMLGFIKYLPGCM
- QueryTGFIKYLPGACT
402.7.3 Alignment Scores and Statistical
Significance of Database Searches
- related model v.s. random model
- S-score the alignment score
- E-score expected number of sequences with score
gt S by random chance - P-score probability that one or more sequences
with score gt S would be found randomly - ? Low E P are better.
41 42PAM 120 (ln 2)/2 nats
- A R N D C Q E G H I L K M F P S
T W Y V B Z X - A 3 -3 -1 0 -3 -1 0 1 -3 -1 -3 -2 -2 -4 1 1
1 -7 -4 0 0 -1 -1 -8 - R -3 6 -1 -3 -4 1 -3 -4 1 -2 -4 2 -1 -5 -1 -1
-2 1 -5 -3 -2 -1 -2 -8 - N -1 -1 4 2 -5 0 1 0 2 -2 -4 1 -3 -4 -2 1
0 -4 -2 -3 3 0 -1 -8 - D 0 -3 2 5 -7 1 3 0 0 -3 -5 -1 -4 -7 -3 0
-1 -8 -5 -3 4 3 -2 -8 - C -3 -4 -5 -7 9 -7 -7 -4 -4 -3 -7 -7 -6 -6 -4 0
-3 -8 -1 -3 -6 -7 -4 -8 - Q -1 1 0 1 -7 6 2 -3 3 -3 -2 0 -1 -6 0 -2
-2 -6 -5 -3 0 4 -1 -8 - E 0 -3 1 3 -7 2 5 -1 -1 -3 -4 -1 -3 -7 -2 -1
-2 -8 -5 -3 3 4 -1 -8 - G 1 -4 0 0 -4 -3 -1 5 -4 -4 -5 -3 -4 -5 -2 1
-1 -8 -6 -2 0 -2 -2 -8 - H -3 1 2 0 -4 3 -1 -4 7 -4 -3 -2 -4 -3 -1 -2
-3 -3 -1 -3 1 1 -2 -8 - I -1 -2 -2 -3 -3 -3 -3 -4 -4 6 1 -3 1 0 -3 -2
0 -6 -2 3 -3 -3 -1 -8 - L -3 -4 -4 -5 -7 -2 -4 -5 -3 1 5 -4 3 0 -3 -4
-3 -3 -2 1 -4 -3 -2 -8 - K -2 2 1 -1 -7 0 -1 -3 -2 -3 -4 5 0 -7 -2 -1
-1 -5 -5 -4 0 -1 -2 -8 - M -2 -1 -3 -4 -6 -1 -3 -4 -4 1 3 0 8 -1 -3 -2
-1 -6 -4 1 -4 -2 -2 -8 - F -4 -5 -4 -7 -6 -6 -7 -5 -3 0 0 -7 -1 8 -5 -3
-4 -1 4 -3 -5 -6 -3 -8 - P 1 -1 -2 -3 -4 0 -2 -2 -1 -3 -3 -2 -3 -5 6 1
-1 -7 -6 -2 -2 -1 -2 -8 - S 1 -1 1 0 0 -2 -1 1 -2 -2 -4 -1 -2 -3 1 3
2 -2 -3 -2 0 -1 -1 -8 - T 1 -2 0 -1 -3 -2 -2 -1 -3 0 -3 -1 -1 -4 -1 2
4 -6 -3 0 0 -2 -1 -8 - W -7 1 -4 -8 -8 -6 -8 -8 -3 -6 -3 -5 -6 -1 -7 -2
-6 12 -2 -8 -6 -7 -5 -8
43Applications
- Reconstructing long sequences of DNA from
overlapping sequence fragments - Determining physical and genetic maps from probe
data under various experiment protocols - Database searching
- Comparing two or more sequences for similarities
44- Protein structure prediction (building profiles)
- Comparing the same gene sequenced by two
different labs
452.8 Multiple Sequence Alignemnts
- CLUSTAL
- R. G. Higgins P. M. Sharp, 1988
- CLUSTALW
- Sequences are weighted according to how divergent
they are from the most closely related pair of
sequences. - Gaps are weighted for different sequences.
46Summary
- notion of similarity
- the scoring system used to rank alignments
- the algorithms used to find optimal scoring
alignment - the statistical method used to evaluate the
significance of an alignment score
47?????????
- Fundamental Concepts of BioinformaticsDan E.
Krane and Michael L. Raymer, Benjamin/Cummings,
2003. - BLAST, by I. Korf, M. Yandell, J. Bedell,
OReilly Associates, 2003. (????) - Biological Sequence Analysis Probabilistic
Models of Proteins and Nucleic AcidsR. Durbin,
S. Eddy, A. Krogh, and G. Mitchison,Cambridge
University Press, 1998. - Biochemistry, by J. M. Berg, J. L. Tymoczko, and
L. Stryer, Fith Edition, 2001.