Class 4: Sequence Alignment II Gaps, Heuristic Search - PowerPoint PPT Presentation

About This Presentation
Title:

Class 4: Sequence Alignment II Gaps, Heuristic Search

Description:

Gap Penalty Models (I) Constant model. Gives each gap a constant score, ... Each extra space contributes less penalty. Gap function is convex in its length ... – PowerPoint PPT presentation

Number of Views:130
Avg rating:3.0/5.0
Slides: 39
Provided by: NirFri
Category:

less

Transcript and Presenter's Notes

Title: Class 4: Sequence Alignment II Gaps, Heuristic Search


1
Class 4 Sequence Alignment IIGaps, Heuristic
Search
2
Alignment with Gaps Example
1
A A C A A T T A A G A C T A C G T T C A T G A C
A C G A T T A G C A C A C T G T A G A
2
A A C A A T T A A G A C T A C G T T C A T G A C
A A C A A T T G T T C A T G A C G C A
3
Gaps
  • Both alignments have the same number of matches
    and spaces but alignment 2 seems better
  • Definition A gap is any maximal, consecutive run
    of spaces in a single string.
  • The length of a gap the number of spaces in it
  • Example 1 has 11 gaps, example 2 only 2 gaps
  • Idea develop alignment scores that take gaps
    (not spaces) into account

4
Biological Motivation
  • Number of mutational events
  • A single gap due to a single event that removed
    a number of residues
  • Each gap due to distinct, independent event
  • Protein structure
  • Protein secondary structure consists of alpha
    helices, beta sheets and loops
  • Loops of varying size can lead to very similar
    structure

5
Biological Motivation
6
cDNA Mataching
  • cDNA is the sequence after splicing (introns have
    been removed) and editing
  • We expect regions of high similarity, separated
    by long gaps

7
Gap Penalty Models (I)
  • Constant model
  • Gives each gap a constant score, spaces are free
  • Maximize
  • Time O(mn)
  • Works well with cDNA matching
  • Affine model
  • Penalty for starting a gap penalty for each
    additional space
  • Each gap costs Wg qWs
  • Maximize
  • Time O(mn)
  • Widely used

8
Gap Penalty Models (II)
  • Convex model
  • Each extra space contributes less penalty
  • Gap function is convex in its length
  • Example Ws log(q)
  • Time O(mnlogm)
  • A better model of biology
  • General model
  • The weight of a gap is some arbitrary w(q)
  • Time O(mn2 nm2)

9
Example Revised
1
A A C A A T T A A G A C T A C G T T C A T G A C
A C G A T T A G C A C A C T G T A G A
2
A A C A A T T A A G A C T A C G T T C A T G A C
A A C A A T T G T T C A T G A C G C A
10
Indel Model
1
A A C A A T T A A G A C T A C G T T C A T G A C
A C G A T T A G C A C A C T G T A G A
Score -6
2
A A C A A T T A A G A C T A C G T T C A T G A C
A A C A A T T G T T C A T G A C G C A
Score -6
Scoring Parameters Match 1 Indel -2
11
Constant Model
1
A A C A A T T A A G A C T A C G T T C A T G A C
A C G A T T A G C A C A C T G T A G A
Score -6
2
A A C A A T T A A G A C T A C G T T C A T G A C
A A C A A T T G T T C A T G A C G C A
Score 12
Scoring Parameters Match 1 Open gap -2
12
Affine Model
1
A A C A A T T A A G A C T A C G T T C A T G A C
A C G A T T A G C A C A C T G T A G A
Score -17
2
A A C A A T T A A G A C T A C G T T C A T G A C
A A C A A T T G T T C A T G A C G C A
Score 1
Scoring Parameters Match 1 Open gap -2, each
space -1
13
Convex Model
1
A A C A A T T A A G A C T A C G T T C A T G A C
A C G A T T A G C A C A C T G T A G A
Score -6
2
A A C A A T T A A G A C T A C G T T C A T G A C
A A C A A T T G T T C A T G A C G C A
Score 7
Scoring Parameters Match 1 Open gap -2, gap
length -logn
14
Affine Weight Model
  • We divide all possible alignments of the prefixes
    s1..i and t1..j into 3 types
  • s i
  • t j
  • s i-----
  • t j
  • s i
  • t j-----

15
Affine Weight Model
Recurrence relations
16
Affine Weight Model
Initial condition Optimal alignment Complex
ity Time O(mn) Space O(mn)
17
Affine Weight Model
  • This model has a natural explanation as a finite
    state automata

18
Alignment in Real Life
  • One of the major uses of alignments is to find
    sequences in a database
  • Such collections contain massive number of
    sequences (order of 106)
  • Finding homologies in these databases with the
    standard dynamic programming can take too long
  • Example
  • query protein 232 AAs
  • NR protein DB 2.7 million sequences 748
    million AAs
  • mn 1.7 1011 cells !

19
Heuristic Search
  • Instead, most searches rely on heuristic
    procedures
  • These are not guaranteed to find the best match
  • Sometimes, they will completely miss a
    high-scoring match
  • We now describe the main ideas used by some of
    these procedures
  • Actual implementations often contain additional
    tricks and hacks

20
Basic Intuition
  • The main resource consuming factor in the
    standard DP is decision of where the gaps are. If
    there were no gaps, life was easy!
  • Almost all heuristic search procedures are based
    on the observation that real-life well-matching
    pairs of sequences often do contain long strings
    with gap-less matches.
  • These heuristics try to find significant local
    gap-less matches and then extend them.

21
Banded DP
  • Suppose that we have two strings s1..n and
    t1..m such that n?m
  • If the optimal global alignment of s and t has
    few gaps, then path of the alignment will be
    close to the diagonal

s
t
22
Banded DP
  • To find such a path, it suffices to search in a
    diagonal region of the matrix
  • If the diagonal band has presumed width a, then
    the dynamic programming step takes O(an)
  • Much faster than O(n2) of standard DP in this case

s
a
t
23
Banded DP
  • Problem (for local alignment)
  • If we know that ti..j matches the query
    sp..q, then we can use banded DP to evaluate
    quality of the match
  • However, we do not know i,j,p,q !
  • How do we select which sub-sequences to align
    using banded DP?

24
FASTA Overview
  • Main idea
  • Find (fast!) good diagonals and extend them to
    complete matches
  • Suppose that we have a relatively long gap-less
    local match (diagonal)
  • AGCGCCATGGATTGAGCGA
  • TGCGACATTGATCGACCTA
  • Can we find clues that will let us find it
    quickly?


25
Signature of a Match
  • Assumption good matches contain several
    patches of perfect matches
  • AGCGCCATGGATTGAGCGA
  • TGCGACATTGATCGACCTA

26
FASTA
  • Given s and t, and a parameter k
  • Find all pairs (i,j) such that si..ik and
    tj..jk match perfectly
  • Locate sets of pairs that are on the same
    diagonal by sorting according to i-j thus
  • Locating diagonals that contain
  • many close pairs.
  • This is faster than O(nm) !

s
i ik
j
jk
t
27
FASTA
  • Extend the best diagonal matches to imperfect
    (yet ungapped) matches, compute alignment scores
    per diagonal. Pick the best-scoring matches.
  • Try to combine close diagonals to potential
    gapped matches, picking the best-scoring matches.
  • Finally, run banded DP on the regions containing
    these matches, resulting in several good
    candidate alignments.
  • Most applications of FASTA use very small k (2
    for proteins, and 4-6 for DNA)

28
BLAST Overview
  • FASTA drawback is its reliance on perfect matches
  • BLAST (Basic Local Alignment Search Tool) uses
    similar intuition, but relies on high scoring
    matches rather than exact matches
  • Given parameters length k, and threshold T
  • Two strings s and t of length k are a high
    scoring pair (HSP) if d(s,t) gt T

29
High-Scoring Pair
  • Given a query string s, BLAST construct all words
    w (neighborhood words), such that w is an HSP
    with a k-substring of s.
  • Note not all k-mers have an HSP in s

30
BLAST phase 1
  • Phase 1 compile a list of word pairs (k3) above
    threshold T
  • Example for the following query
  • FSGTWYA (query word is in green)
  • A list of words (k3) is
  • FSG SGT GTW TWY WYA
  • YSG TGT ATW SWY WFA
  • FTG SVT GSW TWF WYS

31
BLAST phase 1
scores
GTW 6,5,11 22 neighborhood ASW 6,1,11
18 word hits ATW 0,5,11 16 gt threshold NTW
0,5,11 16 GTY 6,5,2 13 GNW 10 neighborh
ood GAW 9 word hits below threshold
(T11)
32
BLAST phase 2
  • Search the database for perfect matches with
    neighborhood words. Those are hits for further
    alignment.
  • We can locate seed words in a large database in a
    single pass, given the database is properly
    preprocessed (using hashing techniques).

33
Extending Potential Matches
  • Once a hit is found, BLAST attempts to find a
    local alignment that extends it.
  • Seeds on the same diagonal tend to be combined
    (as in FASTA)

34
Two HSP diagonal
  • An improvement look for 2 HSPs on close
    diagonals
  • Extend the alignment between them
  • Fewer extensions considered
  • There is a version of BLAST,
  • involving gapped
  • extensions.
  • Generally faster then FASTA,
  • arguably better.

s
t
35
Blast Variants
  • blastn (nucleotide BLAST)
  • blastp (protein BLAST)
  • tblastn (protein query, translated DB BLAST)
  • blastx (translated query, protein DB BLAST)
  • tblastx (translated query, translated DB BLAST)
  • bl2seq (pairwise alignment)

36
Biological Databases
  • Today, most of the biological information can be
    freely accessed on the web.
  • One can
  • Search for information on a known gene
  • Check if a sequence exists in a database
  • Find a homologous protein, helping us guess
  • Structure
  • Function

37
Databases and Tool
  • Important gateways
  • National Center for Biotechnology (GenBank)
  • http//www.ncbi.nlm.nih.gov/
  • European Bioinformatics Institue (EMBL-Bank)
  • http//www.ebi.ac.uk/
  • Expert Protein Analysis System (SwissProt)
  • http//www.expasy.org/
  • ? Different tools and DBs to allow biologists a
    rich suite of queries

38
Database Types
  • Nucleotide DBs (GenBank, EMBL-Bank)
  • Contain any and every type of DNA fragment
  • Full cDNA, ESTs, repeats, fragments
  • Dirty and redundant
  • Protein DBs (SwissProt)
  • Contain amino-acid sequences for full proteins
  • High quality, strict screening process
  • Lots of annotated information on each protein
Write a Comment
User Comments (0)
About PowerShow.com