Chapter 2 Data Searches and Pairwise Alignments - PowerPoint PPT Presentation

1 / 47

About This Presentation

Title:

Chapter 2 Data Searches and Pairwise Alignments

Description:

Chapter 2 Data Searches and Pairwise Alignments 2004/03/08 – PowerPoint PPT presentation

Number of Views:134

Avg rating:3.0/5.0

Slides: 48

Provided by: Shie91

Category:

more less

Transcript and Presenter's Notes

Title: Chapter 2 Data Searches and Pairwise Alignments

1
Chapter 2Data Searches and Pairwise Alignments

??????????
???
2004/03/08

2
Introduction

What is the difference between acctga and agcta?

a c c t g a a g c t g a a g c t - a
3
Nomenclature
4
2.1 Dot Plots
5
2.2 Simple Alignments

No gap

mutation (substitution) common
insertion
deletion
scoring scheme
match score
mismatch score

gap, indel (rare)
7
2.3 Gaps
8
2.3.1 Gap Penalty

uniform gap
affine gap
origination penalty
length penalty

9
2.4 Scoring Matrices
10

Modeling ???
??????????????

11
Modeling
12
(No Transcript)
13
Define the odds ratio as
14
2.4.1 PAM Matrices

Dayhoff, Schwartz, Orcutt (1978)
Point Accepted Mutation
Based on observed substitution rates
(Box. 2.1)
Input
A set of observed substitution rates
Output
PAM-1 matrix (log-odds matrix)

15
Multiple Alignment

(1) Group the sequences with high similarity (gt
85 identity).

16
Phylogenetic Tree

(2) For each group, build the corresponding
phylogenetic tree.

17
Mutation Frequency

A-gtG, I-gtL, A-gtG, A-gtL, C-gtS, G-gtA
(3)
FG,A3

18
Relative Mutability

19
Mutation Probability

20
Odds Ratio

21
Log-Odds Ratio

Which PAM matrix is the most appropriate?
the length of the sequences
How closely the sequences are believed to be
related.
? PAM 120 for database search
? PAM 200 for comparing two specific proteins

23
2.4.2 BLOSUM Matrices

Henikoff Henikoff (1992)
PAM-k k??, ????
BLOSUM-k k?????
? BLOSUM62 for ungapped matching
? BLOSUM50 for gapped matching

24
2.5 Dynamic Programming

The Needleman and Wunsch Algorithm (Global
Alignment)

25
(No Transcript)
26
Alignment Graph
27
(No Transcript)
28
A C - - T C G A C A G T A G
29
Complexity
30
2.6 Global and Local Alignments

Semi-global alignment
Local alignment

31
2.6.1 Semi-global Alignments

A A C A C G T G T C T
- - - A C G T - - - -

32
(No Transcript)
33
2.6.2 Local Alignment

The Smith-Waterman Alignment

34
(No Transcript)
35
2.7 Database Searches

BLAST and its relatives
FASTA and related algorithms

36
2.7.1 BLAST and Its Relatives
Program Database Query
BLASTN Nucleotide Nucleotide
BLASTP Protein Protein
BLASTX Protein Nucleotide? Protein
TBLASTN Nucleotide? Protein Protein
TBLASTX Nucleotide? Protein Nucleotide? Protein
37
BLASTP

Using PAM or BLOSUM matrices

38
2.7.2 FASTA and Related Algorithms

?? dot plot band search
Preprocess the target sequence.
Identify the position for each word.
(for amino acid word length1, a 20-entry
array)
Scan the query sequence.
Compute the shifts of query to align each word
with the target.
Find the mode (??) of the shifts.
Join the possible shifts into one new target
sequence. Perform the full local alignment
algorithm.

Target FAMLGFIKYLPGCM
QueryTGFIKYLPGACT

40
2.7.3 Alignment Scores and Statistical
Significance of Database Searches

related model v.s. random model
S-score the alignment score
E-score expected number of sequences with score
gt S by random chance
P-score probability that one or more sequences
with score gt S would be found randomly
? Low E P are better.

length correction
Scores

42
PAM 120 (ln 2)/2 nats

A R N D C Q E G H I L K M F P S
T W Y V B Z X
A 3 -3 -1 0 -3 -1 0 1 -3 -1 -3 -2 -2 -4 1 1
1 -7 -4 0 0 -1 -1 -8
R -3 6 -1 -3 -4 1 -3 -4 1 -2 -4 2 -1 -5 -1 -1
-2 1 -5 -3 -2 -1 -2 -8
N -1 -1 4 2 -5 0 1 0 2 -2 -4 1 -3 -4 -2 1
0 -4 -2 -3 3 0 -1 -8
D 0 -3 2 5 -7 1 3 0 0 -3 -5 -1 -4 -7 -3 0
-1 -8 -5 -3 4 3 -2 -8
C -3 -4 -5 -7 9 -7 -7 -4 -4 -3 -7 -7 -6 -6 -4 0
-3 -8 -1 -3 -6 -7 -4 -8
Q -1 1 0 1 -7 6 2 -3 3 -3 -2 0 -1 -6 0 -2
-2 -6 -5 -3 0 4 -1 -8
E 0 -3 1 3 -7 2 5 -1 -1 -3 -4 -1 -3 -7 -2 -1
-2 -8 -5 -3 3 4 -1 -8
G 1 -4 0 0 -4 -3 -1 5 -4 -4 -5 -3 -4 -5 -2 1
-1 -8 -6 -2 0 -2 -2 -8
H -3 1 2 0 -4 3 -1 -4 7 -4 -3 -2 -4 -3 -1 -2
-3 -3 -1 -3 1 1 -2 -8
I -1 -2 -2 -3 -3 -3 -3 -4 -4 6 1 -3 1 0 -3 -2
0 -6 -2 3 -3 -3 -1 -8
L -3 -4 -4 -5 -7 -2 -4 -5 -3 1 5 -4 3 0 -3 -4
-3 -3 -2 1 -4 -3 -2 -8
K -2 2 1 -1 -7 0 -1 -3 -2 -3 -4 5 0 -7 -2 -1
-1 -5 -5 -4 0 -1 -2 -8
M -2 -1 -3 -4 -6 -1 -3 -4 -4 1 3 0 8 -1 -3 -2
-1 -6 -4 1 -4 -2 -2 -8
F -4 -5 -4 -7 -6 -6 -7 -5 -3 0 0 -7 -1 8 -5 -3
-4 -1 4 -3 -5 -6 -3 -8
P 1 -1 -2 -3 -4 0 -2 -2 -1 -3 -3 -2 -3 -5 6 1
-1 -7 -6 -2 -2 -1 -2 -8
S 1 -1 1 0 0 -2 -1 1 -2 -2 -4 -1 -2 -3 1 3
2 -2 -3 -2 0 -1 -1 -8
T 1 -2 0 -1 -3 -2 -2 -1 -3 0 -3 -1 -1 -4 -1 2
4 -6 -3 0 0 -2 -1 -8
W -7 1 -4 -8 -8 -6 -8 -8 -3 -6 -3 -5 -6 -1 -7 -2
-6 12 -2 -8 -6 -7 -5 -8

43
Applications

Reconstructing long sequences of DNA from
overlapping sequence fragments
Determining physical and genetic maps from probe
data under various experiment protocols
Database searching
Comparing two or more sequences for similarities

Protein structure prediction (building profiles)
Comparing the same gene sequenced by two
different labs

45
2.8 Multiple Sequence Alignemnts

CLUSTAL
R. G. Higgins P. M. Sharp, 1988
CLUSTALW
Sequences are weighted according to how divergent
they are from the most closely related pair of
sequences.
Gaps are weighted for different sequences.

46
Summary

notion of similarity
the scoring system used to rank alignments
the algorithms used to find optimal scoring
alignment
the statistical method used to evaluate the
significance of an alignment score

47
?????????

Fundamental Concepts of BioinformaticsDan E.
Krane and Michael L. Raymer, Benjamin/Cummings,
2003.
BLAST, by I. Korf, M. Yandell, J. Bedell,
OReilly Associates, 2003. (????)
Biological Sequence Analysis Probabilistic
Models of Proteins and Nucleic AcidsR. Durbin,
S. Eddy, A. Krogh, and G. Mitchison,Cambridge
University Press, 1998.
Biochemistry, by J. M. Berg, J. L. Tymoczko, and
L. Stryer, Fith Edition, 2001.