BCB 444544 - PowerPoint PPT Presentation

1 / 23
About This Presentation
Title:

BCB 444544

Description:

Database Searching with Smith-Waterman Method ... (as in Smith-Waterman algorithm) Heuristic - does NOT test every possibility ... – PowerPoint PPT presentation

Number of Views:64
Avg rating:3.0/5.0
Slides: 24
Provided by: dobbslabG
Category:
Tags: bcb | waterman

less

Transcript and Presenter's Notes

Title: BCB 444544


1
BCB 444/544
Lecture 8 Substitution Matrices BLAST 8 Sep 12
  • Thanks to Drena Dobbs (ISU for many borrowed
    modified PPTs

2
Required Reading (before lecture)
  • Fri Sep 12 - for Lecture 8
  • Chp 4
  • Mon Sep 15 for Lecture 9
  • Chp 4

3
Homework Assignment 2
  • Posted on the course webpage
  • Due today

4
PAM Matrix Point Accepted Mutation
  • Relies on "evolutionary model" based on observed
    differences in closely related proteins
    Dayhoff78
  • Model includes defined rate for each type of
    sequence change
  • Suffix number (n) reflects amount of "time"
    passed
  • rate of expected mutation if n of amino acids
    had changed
  • e.g., PAM1 matrix estimates what rate of
    substitution would be expected if 1 of the amino
    acids had changed
  • PAM1 matrix is used as basis for calculating
    other matrices assumes that repeated mutations
    would follow same pattern as those in PAM1
    matrix, and multiple substitutions can occur at
    the same site
  • PAM1 - for less divergent sequences (shorter
    time)
  • PAM250 - for more divergent sequences (longer
    time)

5
BLOSUM BLOck SUbstitution Matrix
  • Based on aa substitutions observed in blocks of
    conserved sequences within evolutionarily
    divergent proteins (in BLOCKS database) Henikoff
    Henikoff92
  • Doesn't rely on a specific evolutionary model
  • Suffix number (n) reflects expected similarity
  • avg aa identity in MSA from which matrix was
    generated
  • e.g., BLOSUM62 is derived from sequence
    alignments of proteins with no more than 62
    identity
  • Blocks database contains ungapped aligned
    segments corresponding to the most highly
    conserved regions of proteins
  • BLOSUM45 - for more divergent sequences
  • BLOSUM62 - for less divergent sequences

6
(No Transcript)
7
Scoring Matrices What are the scores?
  • See Xiong Textbook
  • Fig 3.5 PAM250
  • Fig 3.6 BLOSUM62
  • Usually only 1/2 of matrix is displayed (it is
    symmetric)
  • s(a,b) corresponds to score of aligning character
    a with character b
  • These are log-odds scores
  • each entry
  • log (freq(observed)/freq(expected)
  • ? more likely than random
  • 0 ? at random base rate
  • - ? less likely than random

8
Which is Better? PAM or BLOSUM
  • PAM matrices
  • derived from evolutionary model
  • often used in reconstructing phylogenetic trees -
    but, not very good for highly divergent sequences
  • BLOSUM matrices
  • based on direct observations
  • more 'realistic" - and outperform PAM matrices in
    terms of accuracy in local alignment

9
How Should Gaps be Scored?
  • So far, we've used
  • Simple linear gap penalty function
  • Gap of length k
  • Incurs penalty - k x ?
  • However, in biological sequences, gaps often
    occur in clusters
  • AGKLAVRSTMIESTRVILTWRKW
  • AGKLAVRS------RVILTWRKW
  • More realistic? "Affine" gap penalty
  • penalty for one long gap
  • is smaller than penalty
  • for many smaller gaps
  • that add up to same size

w(k) ? (k 1) x ? ?
? gap
gap opening extension
10
Affine Gap Penalty Functions
  • Affine Gap Penalties Differential Gap
    Penalties used to reflect cost differences
    between opening a gap and extending an existing
    gap
  • Total Gap Penalty is function of gap length
  • W ? ? X (k - 1)
  • where ? gap opening penalty
  • ? gap extension penalty
  • k length of gap
  • Sometimes, a Constant Gap Penalty is used, but it
    is usually less realistic than the Affine Gap
    Penalty

11
Calculating an Alignment Score using a
Substitution Matrix an Affine Gap Penalty
  • Alignment score is sum of all match/mismatch
    scores (from substitution matrix) with an affine
    penalty subtracted for each gap
  • a b c - - da c c e f d9 2 7 6 gt 24 -
    (10 2) 12

Matchscore
Gap opening extension
AlignmentScore
Values from substitution matrix
12
Parameter Selection in Sequence Alignment
  • Optimal alignment between a pair of sequences
    depends critically
  • on the selection of substitution matrix gap
    penalty function
  • In using alignment software, it is important to
    understand and, sometimes, to adjust these
    parameters (default is NOT always best!)
  • How do we pick parameters that give the most
    biologically meaningful alignments and alignment
    scores?

13
How Do We Assess the Statistical Significance of
an Alignment?
  • Compare score of an alignment with distribution
    of scores of alignments for many 'randomized'
    (shuffled) versions of the original sequence
  • If score is in extreme margin, then unlikely due
    to random chance
  • P-value probability that original alignment is
    due to random chance (lower P is better)
  • P 10-5 - 10-50 sequences have clear homology
  • P gt 10-1 no better than random

Check out PRSS (Probability of Random
Shuffles) http//www.ch.embnet.org/software/PRSS_f
orm.html
14
Chp 4- Database Similarity Searching
  • SECTION II SEQUENCE ALIGNMENT
  • Xiong Chp 4
  • Database Similarity Searching
  • Unique Requirements of Database Searching
  • Heuristic Database Searching
  • Basic Local Alignment Search Tool (BLAST)
  • FASTA
  • Comparison of FASTA and BLAST
  • Database Searching with Smith-Waterman Method

15
Database searching
Sequence database
Query Sequence
Target sequences ranked by score
Sequence comparison algorithm
16
Why search a database?
  • Given a newly discovered gene,
  • Does it occur in other species?
  • Is its function known in another species?
  • Given a newly sequenced genome, which regions
    align with genomes of other organisms?
  • Identification of potential genes
  • Identification of other functional parts of
    chromosomes
  • Find members of a multigene family

17
Recall There are 3 Basic Types of Alignment
Algorithms
  • SECTION II SEQUENCE ALIGNMENT
  • Xiong Chp 3 1) Dot Matrix
  • 2) Dynamic Programming
  • Xiong Chp 4 3) Word or k-tuple methods
  • (BLAST FASTA)
  • Wikipedia
  • Word methods, also known as k-tuple methods, are
    heuristic methods that are not guaranteed to find
    an optimal alignment solution, but are
    significantly more efficient than dynamic
    programming.

18
Exhaustive vs Heuristic Methods
  • Exhaustive - tests every possible solution
  • guaranteed to give best answer
  • (identifies optimal solution)
  • can be very time/space intensive!
  • e.g., Dynamic Programming
  • (as in Smith-Waterman algorithm)
  • Heuristic - does NOT test every possibility
  • no guarantee that answer is best
  • (but, often can identify optimal
    solution)
  • sacrifices accuracy (potentially) for speed
  • uses "rules of thumb" or "shortcuts"
  • e.g., BLAST FASTA

19
Why do we Need Fast Search Algorithms?
  • Your query is 200 amino acids long (N)
  • You are searching a non-redundant database, which
    currently contains gt106 proteins (K)
  • If proteins in database have avg length 200 aa
    (M), then
  • Must fill in 200 ? 200 ? 106 4 ? 1010 DP
    entries!!
  • 4 ? 1010 operations just to fill in the DP
    matrix!
  • DP for pairwise alignment is O(NM)
  • Searching in a database is O(NMK)
  • Need faster algorithms for searching in large
    databases!

20
BLAST - Statistical Significance?
  • E-value E m x n x P
  • m total number of residues in database
  • n number of residues in query sequence
  • P probability that an HSP is result of random
    chance
  • lower E-value, less likely to result from random
    chance, thus higher significance
  • Bit Score S'
  • normalized score, to account for differences in
    size of database (m) sequence length(n) - more
    later
  • 3. Low Complexity Masking
  • remove repeats that confound scoring

21
BLAST algorithms can generate both "global" and
"local" alignments
Global alignment
Local alignment
22
BLAST - a Family of Programs Different BLAST
"flavors"
  • BLASTP - protein sequence query against protein
    DB
  • BLASTN - DNA/RNA seq query against DNA DB
    (GenBank)
  • BLASTX - 6-frame translated DNA seq query against
    protein DB
  • TBLASTN - protein query against 6-frame DNA
    translation
  • TBLASTX - 6-frame DNA query to 6-frame DNA
    translation
  • PSI-BLAST - protein "profile" query against
    protein DB
  • PHI-BLAST - protein pattern against protein DB
  • Newest MEGA-BLAST - optimized for highly similar
    sequences

Which tool should you use?
http//www.ncbi.nlm.nih.gov/blast/producttable.sht
ml
23
BLAST Basic Local Alignment Search Tool
  • STEPS
  • Create list of very possible "word" (e.g., 3-11
    letters) from query sequence
  • Search database to identify sequences that
    contain matching words
  • Score match of word with sequence, using a
    substitution matrix
  • Extend match (seed) in both directions, while
    calculating alignment score at each step
  • Continue extension until score drops below a
    threshold (due to mismatches)
  • High Scoring Segment Pair (HSP) - contiguous
    aligned segment pair (no gaps)
Write a Comment
User Comments (0)
About PowerShow.com