Sequence Alignment - PowerPoint PPT Presentation

1 / 31
About This Presentation
Title:

Sequence Alignment

Description:

Hi-1j-1 m(si,tj) Hij = max Hij-1 m(-,tj) Hi-1j m(si,-) Hi0 = 0,k m(sk,-) H0j = 0,k m(-,tk) ... Find a path of max weight. Init-n score total path weight ... – PowerPoint PPT presentation

Number of Views:19
Avg rating:3.0/5.0
Slides: 32
Provided by: bioinfo3
Category:
Tags: alignment | max | sequence | tj

less

Transcript and Presenter's Notes

Title: Sequence Alignment


1
Sequence Alignment
Motivation
  • Storing, retrieving and comparing DNA sequences
    in Databases.
  • Comparing two or more sequences for similarities.
  • Searching databases for related sequences and
    subsequences.
  • Exploring frequently occurring patterns of
    nucleotides.
  • Finding informative elements in protein and DNA
    sequences.
  • Various experimental applications (reconstruction
    of DNA, etc.)

2
Seq.Align. Protein Function
Protein1
Protein2
More than 25 sequence identity
?
Similar function
3
Exact Pattern Matching
  • Given a pattern P of length m and a (longer)
    string T of length n, find all the occurrences of
    P in T.
  • Naïve algorithm O(mn)
  • Boyer-Moore, Knuth-Pratt-Morris O(nm)

4
Alignment - inexact matching
  • Substitution - replacing a sequence base by
    another.
  • Insertion - an insertion of a base (letter)
    or several bases to the sequence.
  • Deletion - deleting a base (or more) from
    the sequence.
  • (Insertion and deletion are the reverse of one
    another)

5
Seq. Align. Score
Commonly used matrices PAM250, BLOSUM64
6
Global Alignment
Global Alignment INPUT Two sequences S and T of
roughly the same length. QUESTION What is the
maximum similarity between them?
Find one of the best alignments.
7
The IDEA
s1n
t1m
To align s1...i with t1j we have three
choices align s1i-1 with t1j-1 and match
si with tj align s1i with t1j-1 and
match a space with tj align s1i-1 with
t1j and match si with a space
s1i-1 i
t1j-1 j
s1 i -
t1j-1 j
s1i-1 i
t1 j -
8
Recursive Relation
Define scoring matrix m(a,b) a,b ? ? U
- Define Hij the best score of alignment
between s1i and t1j for 1 lt i lt n, 1 lt
j lt m Hi-1j-1 m(si,tj) Hij
max Hij-1 m(-,tj) Hi-1j
m(si,-) Hi0 ?0,k m(sk,-) H0j ?0,k
m(-,tk)
Optimal alignment score Hnm
Needleman-Wunsch 1970
9
t
I T
s3 t4
I -
s3 -
- T
- t4
s
10
(No Transcript)
11
Local Alignment
Local Alignment INPUT Two sequences S and T
. QUESTION What is the maximum similarity
between a subsequence of S and a
subsequence of T ? Find most
similar subsequences.
12
Recursive Relation
for 1 lt i lt n, 1 lt j lt m Hi-1j-1
m(si,tj) Hij-1
m(-,tj) Hi-1j m(si,-)
0 Hi0 0 H0j 0
Hij max
Optimal alignment score maxij Hij
Smith-Waterman 1981
Penalties should be negative
13
Sequence Alignment
Complexity Time O(nm) Space O(nm) (exist
algorithm with O(min(n,m)) )
14
Ends free alignment
Ends free alignment
INPUT Two equences
S and T (possibly of different length).
QUESTION Find one of the best alignments between
subsequences of S and T when at least one of
these subsequences is a prefix of the original
sequence and one (not necessarily the other) is a
suffix.
or
15
Gap Alignment
Definition A gap is the maximal contiguous run
of spaces in a single sequence within a given
alignment.The length of a gap is the number of
indel operations on it. A gap penalty function is
a function that measure the cost of a gap as a
(nonlinear) function of its length.
Gap penalty

INPUT Two sequences S and T (possibly of
different length). QUESTION Find one of the
best alignments between the two
sequences using the gap penalty function.
Affine Gap
Wtotal Wg qWs Wg weight to open the gap Ws
weight to extend the gap
16
What Kind of Alignment to Use?
  • The same protein from the different organisms.
  • Two different proteins sharing the same function.
  • Protein domain against a database of complete
    proteins.
  • Protein against a database of small patterns
    (functional units)
  • ...

17
Sequence Alignment vs. Database
  • Task Given a query sequence and millions of
    database records, find the optimal alignment
    between the query and a record

ACTTTTGGTGACTGTAC
18
Sequence Alignment vs. Database
  • Tool Given two sequences,there exists an
    algorithm tofind the best alignment.
  • Naïve Solution Apply algorithm to each of the
    records, one by one

19
Sequence Alignment vs. Database
  • Problem An exact algorithm is too slowto run
    millions of times (even linear time algorithm
    will run slowly on a huge DB)
  • Solution
  • Run in parallel (expensive).
  • Use a fast (heuristic) method to discard
    irrelevant records. Then apply the exact
    algorithm to the remaining few.

20
Sequence Alignment vs. Database
  • General Strategy of Heuristic Algorithms
  • Homologous sequences are expected to contain
    un-gapped (at least) short segments (probably
    with substitutions, but without ins/dels)
  • Preprocess DB into some fast access data
    structure of short segments.

21
FASTA Idea
  • Idea a good alignment probably matches some
    identical words (ktups)
  • Example
  • Database record
  • ACTTGTAGATACAAAATGTG
  • Aligned query sequence
  • A-TTGTCG-TACAA-ATCTGT
  • Matching words of size 4







22
Dictionaries of Words
  • ACTTGTAGATAC Is translated to the dictionary
  • ACTT,
  • CTTG,
  • TTGT,
  • TGTA
  • Dictionaries of well aligned sequences share
    words.

23
FASTA Stage I
  • Prepare dictionary for db sequence (in advance)
  • Upon query
  • Prepare dictionary for query sequence
  • For each DB record
  • Find matching words
  • Search for long diagonal runs of matching words
  • Init-1 score longest run
  • Discard record if low score

matching word



Position in DB record
Position in query
24
FASTA stage II
  • Good alignment path through many runs,
    withshort connections
  • Assign weights to runs()and connections(-)
  • Find a path of max weight
  • Init-n score total path weight
  • Discard record if low score

25
FASTA Stage III
  • Improve Init-1. Apply an exact algorithm around
    Init-1 diagonal within a given width band.
  • Init-1 Opt-score new weight
  • Discard record if low score

26
FASTA final stage
  • Apply an exact algorithm to surviving records,
    computing the final alignment score.

27
BLAST (Basic Local Alignment Search Tool)
Approximate Matches
  • BLAST
  • Words are allowed to contain inexact matching.
  • Example
  • In the polypeptide sequence IHAVEADREAM
  • The 4-long word HAVE starting at position 2 may
    match
  • HAVE,RAVE,HIVE,HALE,

28
Approximate Matches
  • For each word of length w from a Data Base
    generate all similar words.
  • Similar means score( word, word ) gt T
  • Store all similar words in a look-up table.

29
DB search
  • 1) For each word of length w from a query
    sequence generate all similar words.
  • 2) Access DB.
  • 3) Each hit extend as much as possible -gt
    High-scoring Segment Pair (HSP)
  • score(HSP) gt V

THEFIRSTLINIHAVEADREAMESIRPATRICKREAD
INVIEIAMDEADMEATTNAMHEWASNINETEEN
30
DB search (2)
4) Around HSP perform DP. At each step
alignment score should be gt T
s-db
starting point (seed pair)
s-query
31
Homework
  • Write a cgi script (using Perl) that performs
    pairwise Local/Global Alignment for DNA
    sequences. All I/O is via HTML only.
  • Input
  • Choice for Local/Global alignment.
  • Two sequences text boxes.
  • Values for match, mismatch, ins/dels.
  • Number of iterations for computing random scores.
  • Output
  • Alignment score.
  • z-score value (z (score-average)/standard
    deviation.)

Remarks 1) you are allowed to use only linear
space. 2)To compute z-score perform random
shuffling srand(time ) init,
-proc.id int(rand(i)) returns rand. number
between 0,i. 3)Shuffling is done in windows
(non-overlapping) of 10 bases length. Number of
shuffling for each window is random 0,10.
Write a Comment
User Comments (0)
About PowerShow.com