Pair-Wise Sequence Alignment Methods and Tools - PowerPoint PPT Presentation

1 / 79

About This Presentation

Title:

Pair-Wise Sequence Alignment Methods and Tools

Description:

The second approach for scanning a database is to construct a deterministic finite automata ... FastA is an algorithm that attempts to speed up string matching over ... – PowerPoint PPT presentation

Number of Views:154

Avg rating:3.0/5.0

Slides: 80

Provided by: Jef5198

Category:

more less

Transcript and Presenter's Notes

Title: Pair-Wise Sequence Alignment Methods and Tools

1
Pair-Wise Sequence Alignment Methods and Tools
2
Pair-Wise Alignment Global and Local
AlignmentsPart I Fundament Concept
3
(No Transcript)
4
(No Transcript)
5
(No Transcript)
6
(No Transcript)
7
(No Transcript)
8
(No Transcript)
9
Scoring Matrices

Alignment of two nucleotide sequences is
traditionally scored using values which may be
looked up in a weighted scoring matrix.
The matrices most frequently used for scoring
alignments of amino acid and nucleotide sequences
come from the PAM (Percent (point) Accepted
Mutation)and BLOSUM (Blocks Substitution
Matrices) families.
PAM
The PAM matrices are constructed so that the
highest scoring alignments using PAMn will be
within n PAM units of each other. PAM1 is the
base PAM matrix. It is constructed so that
maximal scores are produced for alignments with
only 1 mutation (99 conservation).
To construct the PAMn matrix from PAM1(M1), the
following formula is used

10
PAM100
11
Scoring Matrices (contd.)

BLOSUM
BLOSUM matrices are based on local multiple
alignments of more distantly related sequences.
Unlike PAM matrices, BLOSUM matrices were created
from real amino acid data.
For the creation of BLOSUM matrices, a database
of multiple alignments without gaps for short
regions of related sequences was derived. Within
each alignment, the sequences were clustered into
groups of sequences similar at by some threshold
percent value.
Substitution frequencies for all pairs of amino
acids were calculated between the groups, to
create the Block Substitution Matrix for each
cluster.
The number associated with the matrix is the
minimum percent of identity of the sequences in
the block. For example BLOSUM50 means that the
sequences in this block are at least 50
identical.

12
(No Transcript)
13
(No Transcript)
14
(No Transcript)
15
(No Transcript)
16
(No Transcript)
17
Not only one best path
18
(No Transcript)
19
(No Transcript)
20
(No Transcript)
21
(No Transcript)
22
(No Transcript)
23
FASTP
FASTA
24
FASTA http//www.ebi.ac.uk/Tools/fasta33/index.ht
ml
25
FastP and FastA (1)

FastA is an algorithm that attempts to speed up
string matching over the standard optimal
alignment.
String matching using dynamic programming run in
quadratic time. FastA uses direct addressing or
k-tuple preprocessing to cut down the dynamic
programming search space significantly. This
results in reduced search time at the expense of
some sensitivity.
The FastA algorithm is implemented in the
following 6 stages
Locate hot spots
Find the 10 best regions in the matrix
Score using a substitution matrix

26
FastP and FastA (2)

Combine initial regions from different diagonals
Optimal alignment
Presentation
Locating Hot spots
FastA allows the specification of a parameter
called ktup. The ktup sets the basis word length
for the comparisons between the query string and
a given string in the database. ktup values are
typically six for DNA sequences and two for
protein sequences. For the DNA case each word is
represented as a base 4 number that is also the
index into the table.
The matching ktup-length substrings are referred
to as hot spots. To locate the hot spots, FastA
creates a dictionary of all possible words of
length ktup that occurs in the query sequence.

27
FastP and FastA (3)

Each entry contains the offsets where this
particular combination of 6 letters occur in the
query sequence. In this way, for each word in
the searched string, only the dictionary need be
consulted to determine if and where the word
occurs in the query string.
Finding the 10 Best Regions
A region is a sequence of consecutive hot spots
on the same diagonal. Spaces between the hot
spots are permitted.
FastA ranks regions by giving each hot spot a
positive score. The intervening space between
consecutive hot spots in is given a negative
score. The larger the gap the more severe the
penalty.
The score of the diagonal run is the sum of the
hot spots scores and the interspot penalties.

28
FastP and FastA (4)

Scoring with Substitution Matrix
FastA next applies a substitution matrix to the
10 best regions found above. The substitution
matrix may be an amino acid or nucleotide based.
This step allows different matches to be weighted
differently.
The single best subalignment found after the
application of the substitution matrix is termed
init1.
Combining Initial Regions from Different
Diagonals
In this step, FastA checks to see if any of the
initial regions from different diagonals may be
combined to form a new higher scoring region.
The score for the combined regions is the sum of
the scores of the contributing regions less a
joining penalty for each join.
The score for the highest scoring region after
this step is termed initn.

29
FastP and FastA (5)

This step can be implemented using directed
weighted graphs where the vertices are the
subalignments from the last stage.
The maximum weight path gives the initn
alignment.
Optimal Alignment
In addition to initn, FastA computes an
alternative local alignment score opt. This
score is obtained by considering a narrow
diagonal band in the dynamic programming matrix,
centered along the init1 diagonal.
Presentation
Finally the database is ordered by either the opt
or initn scores and the highest ranking result
sequences are run thorough a full Smith-Waterman
alignment.

30
(No Transcript)
31
(No Transcript)
32
BLAST http//blast.ncbi.nlm.nih.gov/Blast.cgi
33
BLAST (1)

The Basic Local Alignment Search Tool (BLAST)
program uses a heuristic algorithm to search for
local alignments of a query string on a BLAST
formatted database.(directly approximates
alignments that optimize a measure of local
similarity, the maximal segment pair (MSP)
score.)
It is reported to run 100 times faster than a
Smith-Waterman serial search. There is a
different BLAST version for each of the
combinations of query types and database types.

34
BLAST (2)

The BLAST database consists of three files for
every FastA file input.
The first contains all of the sequence headers,
textual information about the amino acid or
nucleotide sequence.
The second contains the compressed sequences (2
bits for each nucleotide, 5 bits for each amino
acid).
The third file contains an index of the
compressed sequences so that they can be matched
with the corresponding headers.
The program runs in 3 rounds.
Database Scanning (table search or Finite state
machine)
Seed Growing
Combining Alignments

35
BLAST (3)

Database Scanning
BLAST searches the database using sequential
search for short words (k-tups) of length W (Word
Size) in the query string which score higher than
T (Neighborhood Word Score Threshold).
In BLAST 1.4, W is usually 3 amino acids or 11
nucleotides, and in BLAST 2.0 this is usually 2
amino acids or 2 sets of 5 nucleotides that are
not contiguous.
Once the tuple size has been selected, scanning
may be accomplished in either of two ways.
The first maps all k-tups of length W into a
unique integer.
The second approach for scanning a database is to
construct a deterministic finite automata (DFA)
based on the query word to scan the database.
This second approach was chosen since it saved on
time and space. The list of successful words is
called the neighborhood. This information is
stored in an index for the next round.

36
BLAST (4)

Seed Growing
In this round the matches found in the first
round are used as "seeds" to "grow" the alignment
in both directions.
Combining Alignments
Now the program attempts to combine multiple
alignments.

37
(No Transcript)
38
(No Transcript)
39
Pair-Wise Alignment Global and Local
AlignmentsPart II Advance Concept
40
(No Transcript)
41
(No Transcript)
42
(No Transcript)
43
(No Transcript)
44
(No Transcript)
45
(No Transcript)
46
(No Transcript)
47
(No Transcript)
48
(No Transcript)
49
(No Transcript)
50
(No Transcript)
51
(No Transcript)
52
(No Transcript)
53
(No Transcript)
54
(No Transcript)
55
(No Transcript)
56
(No Transcript)
57
(No Transcript)
58
Three Sequence Alignment
59
Three-Sequence Alignment algorithm (1)

The definitions proposed by X. Huang (SAC, 1994).

1-gap block gap-open penalty is q1 and
gap-extension penalty is r1
2-gap block gap-open penalty is q2 and
gap-extension penalty is r2
a triple of 1-gap block
a triple of 2-gap block
Example
There are three possible forms for 1-gap and 2-gap
triple
60
(No Transcript)
61
(No Transcript)
62
Three-Sequence Alignment algorithm (2)

Applying Hirschberg algorithm

63
Open Problems

Local three sequences alignment? (no solution)
How to align two whole genome sequences?
How to align other types data, as secondary or 3D
structure data? (or mixed data)

64
Extended reading
65
(No Transcript)
66
(No Transcript)
67
SSEA http//protein.bio.unipd.it/ssea/
68
(No Transcript)
69
(No Transcript)
70
CE http//cl.sdsc.edu/ce.html
71
CE-MC http//pathway.rit.albany.edu/cemc/
72
(No Transcript)
73
PRALINE http//zeus.cs.vu.nl/programs/pralinewww/
74
Tools and Sequences