Accelerating sequence alignment and homology search algorithms - PowerPoint PPT Presentation

1 / 14
About This Presentation
Title:

Accelerating sequence alignment and homology search algorithms

Description:

Smith-Waterman, Gotoh, etc., guarantee the best-scoring alignment ... Acceleration of the Smith-Waterman algorithm. Identify strong diagonals in dotplots ... – PowerPoint PPT presentation

Number of Views:27
Avg rating:3.0/5.0
Slides: 15
Provided by: ianho9
Category:

less

Transcript and Presenter's Notes

Title: Accelerating sequence alignment and homology search algorithms


1
Accelerating sequence alignment and homology
search algorithms
  • BioE131/231

2
DP Accelerator Hardware
3
Faster alignment
  • Smith-Waterman, Gotoh, etc., guarantee the
    best-scoring alignment
  • However, they are too slow to be practical
    (except with expensive h/w)
  • Subsequent algorithms attempted to provide
    approximate solutions
  • FASTA
  • BLAST

4
FASTA (1988)
  • Acceleration of the Smith-Waterman algorithm
  • Identify strong diagonals in dotplots
  • Use exact word hits to find these diagonals
  • Rescore best N diagonals with approximate word
    hits
  • Try to join together good diagonals
  • For sequence-pairs that score well under these
    criteria, do full SW alignment

5
Homology Search
  • Good alignments have ungapped diagonal runs
  • Such runs contain conserved words
  • BLAST lookup table of high-scoring words

6
BLAST (1990)
  • Basic Local Alignment Search Tool
  • Altschul, Gish, Miller, Myers, Lipman
  • Motivation FASTA still too slow
  • Idea find fewer, but better, seed hits
  • Uses longer word sizes
  • Allows approximate word hits in initial search
    phase

7
BLAST terminology
  • Segment pair
  • Pair of equal-length substrings of sequences X
    and Y
  • Locally maximal segment pair
  • Segment pair whose score cannot be improved by
    extending it
  • Maximum segment pair (MSP)
  • Segment pair with highest score of all segment
    pairs between X and Y

8
BLAST outline
  • Query sequence, X
  • Database of target sequences, Y
  • Find all Ys such that
  • X and Y have an MSP of score gtS
  • Such MSPs are called HSPs
  • High-Scoring segment Pairs
  • Key choose S large enough so as to minimize
    false positives
  • but small enough to get real positives

9
BLAST parameters
  • Word length W
  • Threshold T
  • A hit is a W-length sequence that has score gtT
    when aligned to some W-length sequence in X
  • Typically, W is
  • 3-5 for amino acids
  • 11-12 for DNA
  • Choice of T is critical to separate signal from
    noise

10
BLAST algorithm
  • Make a list of all length-W words in X
  • For each word
  • Precalculate score of exact match
  • Make a list of nearby words that score gtT
  • This greatly reduces the number of words
  • From exponential to linear in W, if T is large
    enough

11
BLAST search algorithm
  • Search database for hits (matches to precomputed
    word list)
  • This can be done very fast, e.g. with a trie

Trie for words i, in, inn, to, tea,
ten
12
Original vs gapped BLAST
  • Original BLAST
  • For every hit, extend match to locally maximal
    segment (MSP) and check if score gtS
  • Gapped BLAST
  • Require two hits before you start
  • Improves speed by 3X
  • Hits must be within distance A
  • Use with a lower T threshold

13
BLAST statistics
  • Extreme value distribution

14
Masking
  • Repetitive or low-complexity sequence generates
    spurious hits
  • Hard masking
  • Replace low-complexity sequence with Ns
  • Soft masking
  • Use lower case charactersAGCAGTGatatatatatatatat
    aatatatGAGTCT
Write a Comment
User Comments (0)
About PowerShow.com