Chapter 2 Data Searches and Pairwise Alignments - PowerPoint PPT Presentation

1 / 47
About This Presentation
Title:

Chapter 2 Data Searches and Pairwise Alignments

Description:

Chapter 2 Data Searches and Pairwise Alignments 2004/03/08 – PowerPoint PPT presentation

Number of Views:118
Avg rating:3.0/5.0
Slides: 48
Provided by: Shie91
Category:

less

Transcript and Presenter's Notes

Title: Chapter 2 Data Searches and Pairwise Alignments


1
Chapter 2Data Searches and Pairwise Alignments
  • ??????????
  • ???
  • 2004/03/08

2
Introduction
  • What is the difference between acctga and agcta?

a c c t g a a g c t g a a g c t - a
3
Nomenclature
4
2.1 Dot Plots
5
2.2 Simple Alignments
  • No gap

6
  • mutation (substitution) common
  • insertion
  • deletion
  • scoring scheme
  • match score
  • mismatch score


gap, indel (rare)
7
2.3 Gaps
8
2.3.1 Gap Penalty
  • uniform gap
  • affine gap
  • origination penalty
  • length penalty

9
2.4 Scoring Matrices
10
  • Modeling ???
  • ??????????????

11
Modeling
12
(No Transcript)
13
Define the odds ratio as
14
2.4.1 PAM Matrices
  • Dayhoff, Schwartz, Orcutt (1978)
  • Point Accepted Mutation
  • Based on observed substitution rates
  • (Box. 2.1)
  • Input
  • A set of observed substitution rates
  • Output
  • PAM-1 matrix (log-odds matrix)

15
Multiple Alignment
  • (1) Group the sequences with high similarity (gt
    85 identity).

16
Phylogenetic Tree
  • (2) For each group, build the corresponding
    phylogenetic tree.

17
Mutation Frequency
  • A-gtG, I-gtL, A-gtG, A-gtL, C-gtS, G-gtA
  • (3)
  • FG,A3

18
Relative Mutability
  • (4)

19
Mutation Probability
  • (5)

20
Odds Ratio
  • (6)

21
Log-Odds Ratio
  • (7)

22
  • Which PAM matrix is the most appropriate?
  • the length of the sequences
  • How closely the sequences are believed to be
    related.
  • ? PAM 120 for database search
  • ? PAM 200 for comparing two specific proteins

23
2.4.2 BLOSUM Matrices
  • Henikoff Henikoff (1992)
  • PAM-k k??, ????
  • BLOSUM-k k?????
  • ? BLOSUM62 for ungapped matching
  • ? BLOSUM50 for gapped matching

24
2.5 Dynamic Programming
  • The Needleman and Wunsch Algorithm (Global
    Alignment)

25
(No Transcript)
26
Alignment Graph
27
(No Transcript)
28
A C - - T C G A C A G T A G
29
Complexity
30
2.6 Global and Local Alignments
  • Semi-global alignment
  • Local alignment

31
2.6.1 Semi-global Alignments
  • A A C A C G T G T C T
  • - - - A C G T - - - -

32
(No Transcript)
33
2.6.2 Local Alignment
  • The Smith-Waterman Alignment

34
(No Transcript)
35
2.7 Database Searches
  • BLAST and its relatives
  • FASTA and related algorithms

36
2.7.1 BLAST and Its Relatives
Program Database Query
BLASTN Nucleotide Nucleotide
BLASTP Protein Protein
BLASTX Protein Nucleotide? Protein
TBLASTN Nucleotide? Protein Protein
TBLASTX Nucleotide? Protein Nucleotide? Protein
37
BLASTP
  • Using PAM or BLOSUM matrices

38
2.7.2 FASTA and Related Algorithms
  • ?? dot plot band search
  • Preprocess the target sequence.
  • Identify the position for each word.
  • (for amino acid word length1, a 20-entry
    array)
  • Scan the query sequence.
  • Compute the shifts of query to align each word
    with the target.
  • Find the mode (??) of the shifts.
  • Join the possible shifts into one new target
    sequence. Perform the full local alignment
    algorithm.

39
  • Target FAMLGFIKYLPGCM
  • QueryTGFIKYLPGACT

40
2.7.3 Alignment Scores and Statistical
Significance of Database Searches
  • related model v.s. random model
  • S-score the alignment score
  • E-score expected number of sequences with score
    gt S by random chance
  • P-score probability that one or more sequences
    with score gt S would be found randomly
  • ? Low E P are better.

41
  • length correction
  • Scores

42
PAM 120 (ln 2)/2 nats
  • A R N D C Q E G H I L K M F P S
    T W Y V B Z X
  • A 3 -3 -1 0 -3 -1 0 1 -3 -1 -3 -2 -2 -4 1 1
    1 -7 -4 0 0 -1 -1 -8
  • R -3 6 -1 -3 -4 1 -3 -4 1 -2 -4 2 -1 -5 -1 -1
    -2 1 -5 -3 -2 -1 -2 -8
  • N -1 -1 4 2 -5 0 1 0 2 -2 -4 1 -3 -4 -2 1
    0 -4 -2 -3 3 0 -1 -8
  • D 0 -3 2 5 -7 1 3 0 0 -3 -5 -1 -4 -7 -3 0
    -1 -8 -5 -3 4 3 -2 -8
  • C -3 -4 -5 -7 9 -7 -7 -4 -4 -3 -7 -7 -6 -6 -4 0
    -3 -8 -1 -3 -6 -7 -4 -8
  • Q -1 1 0 1 -7 6 2 -3 3 -3 -2 0 -1 -6 0 -2
    -2 -6 -5 -3 0 4 -1 -8
  • E 0 -3 1 3 -7 2 5 -1 -1 -3 -4 -1 -3 -7 -2 -1
    -2 -8 -5 -3 3 4 -1 -8
  • G 1 -4 0 0 -4 -3 -1 5 -4 -4 -5 -3 -4 -5 -2 1
    -1 -8 -6 -2 0 -2 -2 -8
  • H -3 1 2 0 -4 3 -1 -4 7 -4 -3 -2 -4 -3 -1 -2
    -3 -3 -1 -3 1 1 -2 -8
  • I -1 -2 -2 -3 -3 -3 -3 -4 -4 6 1 -3 1 0 -3 -2
    0 -6 -2 3 -3 -3 -1 -8
  • L -3 -4 -4 -5 -7 -2 -4 -5 -3 1 5 -4 3 0 -3 -4
    -3 -3 -2 1 -4 -3 -2 -8
  • K -2 2 1 -1 -7 0 -1 -3 -2 -3 -4 5 0 -7 -2 -1
    -1 -5 -5 -4 0 -1 -2 -8
  • M -2 -1 -3 -4 -6 -1 -3 -4 -4 1 3 0 8 -1 -3 -2
    -1 -6 -4 1 -4 -2 -2 -8
  • F -4 -5 -4 -7 -6 -6 -7 -5 -3 0 0 -7 -1 8 -5 -3
    -4 -1 4 -3 -5 -6 -3 -8
  • P 1 -1 -2 -3 -4 0 -2 -2 -1 -3 -3 -2 -3 -5 6 1
    -1 -7 -6 -2 -2 -1 -2 -8
  • S 1 -1 1 0 0 -2 -1 1 -2 -2 -4 -1 -2 -3 1 3
    2 -2 -3 -2 0 -1 -1 -8
  • T 1 -2 0 -1 -3 -2 -2 -1 -3 0 -3 -1 -1 -4 -1 2
    4 -6 -3 0 0 -2 -1 -8
  • W -7 1 -4 -8 -8 -6 -8 -8 -3 -6 -3 -5 -6 -1 -7 -2
    -6 12 -2 -8 -6 -7 -5 -8

43
Applications
  • Reconstructing long sequences of DNA from
    overlapping sequence fragments
  • Determining physical and genetic maps from probe
    data under various experiment protocols
  • Database searching
  • Comparing two or more sequences for similarities

44
  • Protein structure prediction (building profiles)
  • Comparing the same gene sequenced by two
    different labs

45
2.8 Multiple Sequence Alignemnts
  • CLUSTAL
  • R. G. Higgins P. M. Sharp, 1988
  • CLUSTALW
  • Sequences are weighted according to how divergent
    they are from the most closely related pair of
    sequences.
  • Gaps are weighted for different sequences.

46
Summary
  • notion of similarity
  • the scoring system used to rank alignments
  • the algorithms used to find optimal scoring
    alignment
  • the statistical method used to evaluate the
    significance of an alignment score

47
?????????
  • Fundamental Concepts of BioinformaticsDan E.
    Krane and Michael L. Raymer, Benjamin/Cummings,
    2003.
  • BLAST, by I. Korf, M. Yandell, J. Bedell,
    OReilly Associates, 2003. (????)
  • Biological Sequence Analysis Probabilistic
    Models of Proteins and Nucleic AcidsR. Durbin,
    S. Eddy, A. Krogh, and G. Mitchison,Cambridge
    University Press, 1998.
  • Biochemistry, by J. M. Berg, J. L. Tymoczko, and
    L. Stryer, Fith Edition, 2001.
Write a Comment
User Comments (0)
About PowerShow.com