Title: Homology Search Tools
1Homology Search Tools
- Kun-Mao Chao (???)
- Department of Computer Science and Information
Engineering - National Taiwan University, Taiwan
- WWW http//www.csie.ntu.edu.tw/kmchao
2Homology Search Tools
- Smith-Waterman(Smith and Waterman, 1981
Waterman and Eggert, 1987) - FASTA(Wilbur and Lipman, 1983 Lipman and
Pearson, 1985) - BLAST(Altschul et al., 1990 Altschul et al.,
1997) - BLAT(Kent, 2002)
- PatternHunter(Li et al., 2004)
3Finding Exact Word Matches
- Hash Tables
- Suffix Trees
- Suffix Arrays
4Hash Tables
5Suffix Trees (I)
6Suffix Trees (II)
7Suffix Arrays
8FASTA
- Find runs of identities, and identify regions
with the highest density of identities. - Re-score using PAM matrix, and keep top scoring
segments. - Eliminate segments that are unlikely to be part
of the alignment. - Optimize the alignment in a band.
9FASTA
Step 1 Find runes of identities, and identify
regions with the highest density of identities.
Sequence B
Sequence A
10FASTA
Step 2 Re-score using PAM matrix, andkeep top
scoring segments.
11FASTA
Step 3 Eliminate segments that are unlikely to
be part of the alignment.
12FASTA
Step 4 Optimize the alignment in a band.
13BLAST
- Basic Local Alignment Search Tool(by Altschul,
Gish, Miller, Myers and Lipman) - The central idea of the BLAST algorithm is that a
statistically significant alignment is likely to
contain a high-scoring pair of aligned words.
14The maximal segment pair measure
- A maximal segment pair (MSP) is defined to be the
highest scoring pair of identical length segments
chosen from 2 sequences.(for DNA Identities
5 Mismatches -4)
- The MSP score may be computed in time
proportional to the product of their lengths.
(How?) An exact procedure is too time consuming. - BLAST heuristically attempts to calculate the MSP
score.
the highest scoring pair
15A matrix of similarity scores
16A maximum-scoring segment
17BLAST
- Build the hash table for Sequence A.
- Scan Sequence B for hits.
- Extend hits.
18BLAST
Step 1 Build the hash table for Sequence A.
(3-tuple example)
For protein sequences Seq. A ELVISAdd xyz to
the hash table if Score(xyz, ELV) ? TAdd
xyz to the hash table if Score(xyz, LVI) ?
TAdd xyz to the hash table if Score(xyz,
VIS) ? T
For DNA sequences Seq. A AGATCGAT
12345678 AAAAAC..AGA 1..ATC 3..CGA
5..GAT 2 6..TCG 4..TTT
19BLAST
Step2 Scan sequence B for hits.
20BLAST
Step2 Scan sequence B for hits.
Step 3 Extend hits.
BLAST 2.0 saves the time spent in extension, and
considers gapped alignments.
hit
Terminate if the score of the sxtension fades
away. (That is, when we reach a segment pair
whose score falls a certain distance below the
best score found for shorter extensions.)
21Gapped BLAST (I)
The two-hit method
22Gapped BLAST (II)
Confining the dynamic-programming
23BLAT
24PatternHunter (I)
25PatternHunter (II)
26Remarks
- Filtering is based on the observation that a good
alignment usually includes short identical or
very similar fragments. - The idea of filtration was used in FASTA, BLAST,
BLAT, and PatternHunter.