Title: Algorithms for Local Sequence Alignments BLAST, FASTA
1Algorithms forLocal Sequence AlignmentsBLAST,
FASTA
- A. Brüngger, Labhead BioinformaticsNovartis
Pharma AG - adrian.bruengger_at_pharma.novartis.com
2Algorithms for Local Sequence Alignments BLAST,
FASTA
- Sequence Similarity and Homology
- Origins of homology
- Sequence alignment
- Global Alignment
- Local Alignment
- Content of Sequence DBs
- GenBank, SwissProt, RefSeq
- Size of sequence DB requires special search tools
- Algorithms for searching Sequence Databases
- Basics of sequence DB searches
- Efficient detection of identical k-mers
- BLAST2 improvements
- Statistical significance of hits
Outline follows David W. Mount, "Bioionformatics
- Sequence and Genome Analysis Cold Spring
Harbour Laboratory Press, 2001. Online
http//www.bioinformaticsonline.org
3Rational for Sequence Analysis, Origins of
Sequence Similarity
Similar sequence leads to similar function
Sequence Analysis as the basic tool to discover
functional, structural, evolutionary information
in biological sequences
Sequence A
Sequence B
Evolutionary relationship between two similar
sequences and a possible common ancestor. The
number of steps to convert one sequence into the
other is the "evolutionary" distance between the
sequences (x y). Usually, the ancestor sequence
is not available, only (x y) can be computed.
y Steps
x Steps
common ancestor sequence
4Origins of Homology ? Significance of Sequence
Alignments
- Possible Origins of Sequence Homology
- orthologs (panel A and B) a1 in species I and a1
in species II (same ancestor!) - paralogs (panel A and B) a1 and a2 (arose from
gene duplication event) - analogs (panel C) different genes converge to
same function by different evolutionary paths - transfer of genetic material (panel D) between
different species - Homology vs. Similarity
- Similarity can be computed (by sequence
alignments) - Homology is deduced (e.g. from similarity, but
also from other evidence!)
5Definition of Sequence Alignment
- Computational procedure (algorithm) for
comparing two/many sequences - identify series of identical residues or patterns
of identical residuesthat appear in the same
order in the sequences - visualized by writing sequences as follows
- sequence alignment is an optimiztion
problembringing as many identical residues as
possible into corresponding positions
MLGPSSKQTGKGS-SRIWDN
MLN-ITKSAGKGAIMRLGDA
Pairwise Global Alignment (over whole length of
sequences)
GKG GKG
Pairwise Local Alignment (similar parts of
sequences)
6Content of Sequence Databases
- Sequencing efforts during the last 15 years led
to a wealth of sequence DBs - GenBank (NCBI)
- Direct submission of DNA/RNA sequences by the
researchers - Uncurated varying quality of sequences (ESTs,
mRNAs, genomic DNA) - Entries are hardly ever changed (even if there
are obvious mistakes!) - Highly redundant
- 30'000'000 sequences
- SwissProt (EBI, SIB)
- Protein database
- curated ongoing effort to improve data quality
by human curation - annotations controlled vocabulary, structured
information - minimally redundant
- 30'000 sequences
- RefSeq (NCBI)
- manually and computationally annotated/curated
set of genes, mRNAs, and proteins - minimally redundant
- capture relationship between DNA, RNA, Proteins
(splice variants, SNPs etc.) - 100'000 sequences
7Growth of DBs requires specific algorithms to
search
- Given my sequence of interest ("query")
- Is the query contained in a sequence DB?
- Are there orthologs, paralogs, analogs to the
query in a sequence DB? - Are there sequences sharing high/medium/low
degree of similarity with query parts? - Are there other sequence variants (splice
variants, SNPs) in the DB?
8Searching a sequence DB with a query sequence
- Approach 1
- global/local pairwise sequence alignment of query
with each DB sequence - dynamic programming requires O(nm) space and
time - space/time complexity such that resulting
computation-time is prohibitive - Smith-Waterman or alike not feasible!
- Approach 2
- fast identification of identical subsequences in
DB ("seeds") - extension around seed to construct local
alignment - General difficulty
- small query, large database
- in some cases, an identified "hit" may happen by
pure chance - assign statistical significance to a "hit"
- Example
- DB human genomic DNA, 3109 b
- query tggtacaaatgttct (glucocorticoid response
element GRE)
9Basics of Sequence DB Searches Detection of
identical k-mers
- Idea identify identical k-mers in DB and q
(seed) expand alignment from seed in both
directions - Example
q MAAARLCLSLLLLSTCVALLLQPLLGAQGAPLEPVYPGDNATPEQM
AQYAADLRRYINMLTRPRYGKRHKEDTLAFSEWGS
...
MAVAYCCLSLFLVSTWVALLLQPLQGTWGAPLEPMYPGDYATPEQMAQYE
TQLRRYINTLTRPRYGKRAEEENTGGLP...
- BLAST
- HSP, high scoring pair
- gapped alignment
- starting extension also from similar (and not
only identical) seeds
10Basics of Sequence DB Searches Detection of
identical k-mers
- Precompute position of all k-mers in DB sequence
- Indexing all peptides of length k in database
- Example
0 1 2 3 1234567890123456789
012345678901234 MAAARLCLSLLLLSTCVALLLQPLLGAQGAPLEP
MAAAR AAARL AARLC ....
APLEP Sorted AAARL 2
AARLC 3 APLEP 30 ... MAAAR 1 ...
VALLL 17
- For each peptide of length k in the query,
identical peptides in "database" are detected
efficiently (binary search in sorted list) - For identical pairs, extension step in both
directions is performed
11Basics of Sequence DB Searches Detection of
identical k-mers
- Indexing all peptides of length k in database
some refinements
0 1 2 3 1234567890123456789
012345678901234 MAAARLCVALLLLSTCVALLLQPLLGAQGAPLEP
- 8 Pointers
to previous occurrences List of all words of
length k and last occurrence in query AAAAA
- AAAAC - AAAAD - .... AARLC 3
.... APLEP 30 ... ... VALLL 17 ...
Simple, yet efficient data-structure - array of
integers (sizelength of db) - array of integers
(sizenumber of words with length
k) Book-keeping - more than one db/query
sequence - build database chunks that fit into
main memory(speeds up computation
1000x) Extension step optimizations
- For each peptide of length k in the query, the
position in the wordlist can be easily computed
(no binary search!)
12Improvement of sensitivity/selectivity in BLAST
A W T V A S A V R T S I
- (optional) filtering for low complexity region in
query - all query words of length 3 are listed
- to each word, 50 'high scoring' additional words
are added - matching words are identified in DB (as described
before) - ungapped alignment constructed from word matches,
'HSP' - statistics determines, whether HSP is significant
- SW-alignment for significant HSPs
AWT VAS AVR TSI WTV ASA VRT TVA SAV RTS
AWT VAS AVR TSI WTV ASA VRT TVA SAV RTS AWA
IAS TVR ... TWA LAS AIR ... ART ITS AVS ... ...
... ...
13An example comparing mus/rat/hum chr X
Each dot conserved stretch of AA, HSP, high
scoring pair Sequence lengths gt 140 M bp
14Significance of matches DNA case
- issue searching with short query vs. large
database? found match could have occurred by
pure chance - assume equal distribution of c,g,a,t
- what is ...
- the probability p, that sequence q (lenm) is
contained in sequence t (lenn)? - the expected length of the longest common
subsequence of two sequences? - the expected score of the best local alignment of
two sequences? - the expected score distribution when locally
aligning two seqeuences? - Example
- s tggtacaaatgttct (glucocorticoid response
element GRE) - t 10000 bp (promoter, upstream DNA to start
codon) - if promoter sequence was random, how often do we
expect to find a GRE? - Probability that q (lenm) is contained in t
(lenn) - a. There are (n-m) 'words' of length m in
sequence A - b. In total, there are 4m sequences of length m
- c. p (n-m) / 4m
This is wrong. Why?
15Conclusions and Outlook
- types of sequence alignments pairwise, multiple,
query vs. database - local and global alignment
- optimal alignment in practical terms only
feasible for pairwise alignment - sequence database searches
- has become the single most important tool in
sequence analysis - basic tool for hypothesis building about function
of unknown sequence - basic algorithm to identify local alignments in
sequence databases - efficiently find occurrence of query k-mers in db
sequences (seeds) - expand (ungapped or gapped) HSP from seeds
- attach statistical significance
- outlook
- some more about statistical significance
- multiple sequence alignments
- profile based sequence searches
- construction of phylogenetic trees from sequence
alignments