Pair-Wise Sequence Alignment Methods and Tools - PowerPoint PPT Presentation

1 / 79
About This Presentation
Title:

Pair-Wise Sequence Alignment Methods and Tools

Description:

The second approach for scanning a database is to construct a deterministic finite automata ... FastA is an algorithm that attempts to speed up string matching over ... – PowerPoint PPT presentation

Number of Views:148
Avg rating:3.0/5.0
Slides: 80
Provided by: Jef5198
Category:

less

Transcript and Presenter's Notes

Title: Pair-Wise Sequence Alignment Methods and Tools


1
Pair-Wise Sequence Alignment Methods and Tools
2
Pair-Wise Alignment Global and Local
AlignmentsPart I Fundament Concept
3
(No Transcript)
4
(No Transcript)
5
(No Transcript)
6
(No Transcript)
7
(No Transcript)
8
(No Transcript)
9
Scoring Matrices
  • Alignment of two nucleotide sequences is
    traditionally scored using values which may be
    looked up in a weighted scoring matrix.
  • The matrices most frequently used for scoring
    alignments of amino acid and nucleotide sequences
    come from the PAM (Percent (point) Accepted
    Mutation)and BLOSUM (Blocks Substitution
    Matrices) families.
  • PAM
  • The PAM matrices are constructed so that the
    highest scoring alignments using PAMn will be
    within n PAM units of each other. PAM1 is the
    base PAM matrix. It is constructed so that
    maximal scores are produced for alignments with
    only 1 mutation (99 conservation).
  • To construct the PAMn matrix from PAM1(M1), the
    following formula is used

10
PAM100
11
Scoring Matrices (contd.)
  • BLOSUM
  • BLOSUM matrices are based on local multiple
    alignments of more distantly related sequences.
    Unlike PAM matrices, BLOSUM matrices were created
    from real amino acid data.
  • For the creation of BLOSUM matrices, a database
    of multiple alignments without gaps for short
    regions of related sequences was derived. Within
    each alignment, the sequences were clustered into
    groups of sequences similar at by some threshold
    percent value.
  • Substitution frequencies for all pairs of amino
    acids were calculated between the groups, to
    create the Block Substitution Matrix for each
    cluster.
  • The number associated with the matrix is the
    minimum percent of identity of the sequences in
    the block. For example BLOSUM50 means that the
    sequences in this block are at least 50
    identical.

12
(No Transcript)
13
(No Transcript)
14
(No Transcript)
15
(No Transcript)
16
(No Transcript)
17
Not only one best path
18
(No Transcript)
19
(No Transcript)
20
(No Transcript)
21
(No Transcript)
22
(No Transcript)
23
FASTP
FASTA
24
FASTA http//www.ebi.ac.uk/Tools/fasta33/index.ht
ml
25
FastP and FastA (1)
  • FastA is an algorithm that attempts to speed up
    string matching over the standard optimal
    alignment.
  • String matching using dynamic programming run in
    quadratic time. FastA uses direct addressing or
    k-tuple preprocessing to cut down the dynamic
    programming search space significantly. This
    results in reduced search time at the expense of
    some sensitivity.
  • The FastA algorithm is implemented in the
    following 6 stages
  • Locate hot spots
  • Find the 10 best regions in the matrix
  • Score using a substitution matrix

26
FastP and FastA (2)
  • Combine initial regions from different diagonals
  • Optimal alignment
  • Presentation
  • Locating Hot spots
  • FastA allows the specification of a parameter
    called ktup. The ktup sets the basis word length
    for the comparisons between the query string and
    a given string in the database. ktup values are
    typically six for DNA sequences and two for
    protein sequences. For the DNA case each word is
    represented as a base 4 number that is also the
    index into the table.
  • The matching ktup-length substrings are referred
    to as hot spots. To locate the hot spots, FastA
    creates a dictionary of all possible words of
    length ktup that occurs in the query sequence.

27
FastP and FastA (3)
  • Each entry contains the offsets where this
    particular combination of 6 letters occur in the
    query sequence. In this way, for each word in
    the searched string, only the dictionary need be
    consulted to determine if and where the word
    occurs in the query string.
  • Finding the 10 Best Regions
  • A region is a sequence of consecutive hot spots
    on the same diagonal. Spaces between the hot
    spots are permitted.
  • FastA ranks regions by giving each hot spot a
    positive score. The intervening space between
    consecutive hot spots in is given a negative
    score. The larger the gap the more severe the
    penalty.
  • The score of the diagonal run is the sum of the
    hot spots scores and the interspot penalties.

28
FastP and FastA (4)
  • Scoring with Substitution Matrix
  • FastA next applies a substitution matrix to the
    10 best regions found above. The substitution
    matrix may be an amino acid or nucleotide based.
    This step allows different matches to be weighted
    differently.
  • The single best subalignment found after the
    application of the substitution matrix is termed
    init1.
  • Combining Initial Regions from Different
    Diagonals
  • In this step, FastA checks to see if any of the
    initial regions from different diagonals may be
    combined to form a new higher scoring region.
    The score for the combined regions is the sum of
    the scores of the contributing regions less a
    joining penalty for each join.
  • The score for the highest scoring region after
    this step is termed initn.

29
FastP and FastA (5)
  • This step can be implemented using directed
    weighted graphs where the vertices are the
    subalignments from the last stage.
  • The maximum weight path gives the initn
    alignment.
  • Optimal Alignment
  • In addition to initn, FastA computes an
    alternative local alignment score opt. This
    score is obtained by considering a narrow
    diagonal band in the dynamic programming matrix,
    centered along the init1 diagonal.
  • Presentation
  • Finally the database is ordered by either the opt
    or initn scores and the highest ranking result
    sequences are run thorough a full Smith-Waterman
    alignment.

30
(No Transcript)
31
(No Transcript)
32
BLAST http//blast.ncbi.nlm.nih.gov/Blast.cgi
33
BLAST (1)
  • The Basic Local Alignment Search Tool (BLAST)
    program uses a heuristic algorithm to search for
    local alignments of a query string on a BLAST
    formatted database.(directly approximates
    alignments that optimize a measure of local
    similarity, the maximal segment pair (MSP)
    score.)
  • It is reported to run 100 times faster than a
    Smith-Waterman serial search. There is a
    different BLAST version for each of the
    combinations of query types and database types.

34
BLAST (2)
  • The BLAST database consists of three files for
    every FastA file input.
  • The first contains all of the sequence headers,
    textual information about the amino acid or
    nucleotide sequence.
  • The second contains the compressed sequences (2
    bits for each nucleotide, 5 bits for each amino
    acid).
  • The third file contains an index of the
    compressed sequences so that they can be matched
    with the corresponding headers.
  • The program runs in 3 rounds.
  • Database Scanning (table search or Finite state
    machine)
  • Seed Growing
  • Combining Alignments

35
BLAST (3)
  • Database Scanning
  • BLAST searches the database using sequential
    search for short words (k-tups) of length W (Word
    Size) in the query string which score higher than
    T (Neighborhood Word Score Threshold).
  • In BLAST 1.4, W is usually 3 amino acids or 11
    nucleotides, and in BLAST 2.0 this is usually 2
    amino acids or 2 sets of 5 nucleotides that are
    not contiguous.
  • Once the tuple size has been selected, scanning
    may be accomplished in either of two ways.
  • The first maps all k-tups of length W into a
    unique integer.
  • The second approach for scanning a database is to
    construct a deterministic finite automata (DFA)
    based on the query word to scan the database.
    This second approach was chosen since it saved on
    time and space. The list of successful words is
    called the neighborhood. This information is
    stored in an index for the next round.

36
BLAST (4)
  • Seed Growing
  • In this round the matches found in the first
    round are used as "seeds" to "grow" the alignment
    in both directions.
  • Combining Alignments
  • Now the program attempts to combine multiple
    alignments.

37
(No Transcript)
38
(No Transcript)
39
Pair-Wise Alignment Global and Local
AlignmentsPart II Advance Concept
40
(No Transcript)
41
(No Transcript)
42
(No Transcript)
43
(No Transcript)
44
(No Transcript)
45
(No Transcript)
46
(No Transcript)
47
(No Transcript)
48
(No Transcript)
49
(No Transcript)
50
(No Transcript)
51
(No Transcript)
52
(No Transcript)
53
(No Transcript)
54
(No Transcript)
55
(No Transcript)
56
(No Transcript)
57
(No Transcript)
58
Three Sequence Alignment
59
Three-Sequence Alignment algorithm (1)
  • The definitions proposed by X. Huang (SAC, 1994).

1-gap block gap-open penalty is q1 and
gap-extension penalty is r1
2-gap block gap-open penalty is q2 and
gap-extension penalty is r2
a triple of 1-gap block
a triple of 2-gap block
Example
There are three possible forms for 1-gap and 2-gap
triple
60
(No Transcript)
61
(No Transcript)
62
Three-Sequence Alignment algorithm (2)
  • Applying Hirschberg algorithm

63
Open Problems
  • Local three sequences alignment? (no solution)
  • How to align two whole genome sequences?
  • How to align other types data, as secondary or 3D
    structure data? (or mixed data)

64
Extended reading
65
(No Transcript)
66
(No Transcript)
67
SSEA http//protein.bio.unipd.it/ssea/
68
(No Transcript)
69
(No Transcript)
70
CE http//cl.sdsc.edu/ce.html
71
CE-MC http//pathway.rit.albany.edu/cemc/
72
(No Transcript)
73
PRALINE http//zeus.cs.vu.nl/programs/pralinewww/
74
Tools and Sequences
  • BioEdit (http//www.mbio.ncsu.edu/bioedit/bioedit.
    html)
  • MEGA (http//evolgen.biol.metro-u.ac.jp/MEGA/)
  • Sequences
  • U07611
  • JN400599
  • EF126963

75
BioEdit
76
(No Transcript)
77
(No Transcript)
78
MEGA
79
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com