Title: Algorithms for Local Sequence Alignments BLAST, FASTA
1Algorithms forLocal Sequence AlignmentsBLAST,
FASTA
- A. Brüngger, Discovery InformaticsBasilea
Pharmaceutica Ltd. - adrian.bruengger_at_basilea.com
2Algorithms for Local Sequence Alignments BLAST,
FASTA
- Sequence Similarity and Homology
- Origins of homology
- Sequence alignment
- Global Alignment
- Local Alignment
- Scoring Systems for Proteins
- Content of Sequence DBs
- GenBank, SwissProt, UniProt, RefSeq
- Size of sequence DB requires special search tools
- Algorithms for searching Sequence Databases
- Basics of sequence DB searches
- Efficient detection of identical k-mers
- BLAST2 improvements
- Statistical significance of hits
Outline follows David W. Mount, "Bioionformatics
- Sequence and Genome Analysis Cold Spring
Harbour Laboratory Press, 2001. Online
http//www.bioinformaticsonline.org
3Rational for Sequence Analysis, Origins of
Sequence Similarity
Similar sequence leads to similar function
Sequence Analysis as the basic tool to discover
functional, structural, evolutionary information
in biological sequences
Sequence A
Sequence B
Evolutionary relationship between two similar
sequences and a possible common ancestor. The
number of steps to convert one sequence into the
other is the "evolutionary" distance between the
sequences (x y). Usually, the ancestor sequence
is not available, only (x y) can be computed.
y Steps
x Steps
common ancestor sequence
4Origins of Homology ? Significance of Sequence
Alignments
- Possible Origins of Sequence Homology
- orthologs (panel A and B) a1 in species I and a1
in species II (same ancestor!) - paralogs (panel A and B) a1 and a2 (arose from
gene duplication event) - analogs (panel C) different genes converge to
same function by different evolutionary paths - transfer of genetic material (panel D) between
different species - Homology vs. Similarity
- Similarity can be computed (by sequence
alignments) - Homology is deduced (e.g. from similarity, but
also from other evidence!)
5Definition of Sequence Alignment
- Computational procedure (algorithm) for
comparing two/many sequences - identify series of identical residues or patterns
of identical residuesthat appear in the same
order in the sequences - visualized by writing sequences as follows
- sequence alignment is an optimiztion
problembringing as many identical residues as
possible into corresponding positions
MLGPSSKQTGKGS-SRIWDN
MLN-ITKSAGKGAIMRLGDA
Pairwise Global Alignment (over whole length of
sequences)
GKG GKG
Pairwise Local Alignment (similar parts of
sequences)
6Assigning Scores Quantify the Quality of an
Alignment
- Simple scoring system (e.g. for DNA sequences)
- two identical residues are aligned 1
- two non-identical residues are aligned -1
- when a gap is introduced -1
- Total score of alginment is the sum of the scores
in each position - Alignment is optimal when assigned score is
maximal - Note Scoring scheme defines optimal alignment
agaag-tagattcta aggaggtag-tt-ta
- Compute Score
- 11 matches
- 1 mismatch
- 3 gaps
- Score 11 - 1 -3 7
7Scoring Systems for Proteins (1)
- Observation Protein (Function, Structure) is
preserved when substitutions in certain AAs
happen - scoring system less simple in case of proteins
- Example The same protein in three different
species
A W T V A S A V R T S I A Y T V A A A V R T S I A
W T V A A A V L T S I
Organism A Organism B Organism C
A
B
C
Therefore Assigning L to R scores higher than
assigning it to another AA
W to Y
L to R
8Scoring Systems for Proteins (2)
- Generalization Dayhoff PAM Weight
Matrices(percent accepted mutation) - observed mutations during a unit of evolutionary
time(1 PAM time that is required that an AA is
changed with probability 1 ) - assumption mutation at a certain site
independent of previous changes in the site(-gt
Markov Model) - extrapolation to longer evolutionary distances
possible - initial PAM matrix (PAM1) derived from 1572
changes in 71 groups of protein sequences that
were at least 85 similar - observed in proteins that have the same function
-gt accepted mutations
C S T P .. C fcc fcs fct fcp S fsc
fss fst fsp . .
9Scoring Systems for Proteins (3)
- From PAM1 to PAM60 to PAM250
- multiplying the PAM1 matrix with itself!
- from matrices, percentage identity canbe
computed - PAM60 60PAM80 50PAM120 40PAM250
20 - Convenience for sequence alignments
- log odds scores
- can be summed up for aligned AA-pairs in order to
get score
M
R
K
PAM1 p(HM) 0 PAM2 p(HM) p(HK)p(KM) p(HR
)p(RM)
H
p (AA-pair is found in alignment of homologuous
proteins) p (AA-pair is found in alignment of
unrelatede proteins by chance)
log
10Scoring Systems for Proteins (4)
- Chosing the 'best' PAM matrix for sequence
searches - ungapped alignment scores are maximal, when
'correct' PAM matrix is used! - using PAM200 90 of this maximum score for
ungapped alignments of length 16-62 (Altschul
1991) - BLOSUM Using large set of diverse proteins
- Blocks Amino Acid Substitution Matrices BLOSUM
2000 conserved AA-patterns ("Blocks") in more
than 500 families (family signatures, Prosite
patterns Bairoch) - patterns with 60 identity led to BLOSUM60, 80
gt BLOSUM80 - Empirical tests, whether matrix finds family
members in DBs gt BLOSUM62 - Which one to use?
- BLOSUM for conserved domains (regardless
evolutionary distance) - PAM for "evolutionary origin"
11Content of Sequence Databases
- Sequencing efforts during the last 15 years led
to a wealth of sequence DBs - GenBank (NCBI)
- Direct submission of DNA/RNA sequences by the
researchers - Uncurated varying quality of sequences (ESTs,
mRNAs, genomic DNA) - Entries are hardly ever changed (even if there
are obvious mistakes!) - Highly redundant
- 40'000'000 sequences (40000000000 bp)
- SwissProt, UniProt (EBI, SIB)
- Protein database
- curated ongoing effort to improve data quality
by human curation - annotations controlled vocabulary, structured
information - minimally redundant
- 30'000 sequences
- RefSeq (NCBI), Gene
- manually and computationally annotated/curated
set of genes, mRNAs, and proteins - minimally redundant
- capture relationship between DNA, RNA, Proteins
(splice variants, SNPs etc.) - 100'000 sequences
12Growth of DBs requires specific algorithms to
search
- Given my sequence of interest ("query")
- Is the query contained in a sequence DB?
- Are there orthologs, paralogs, analogs to the
query in a sequence DB? - Are there sequences sharing high/medium/low
degree of similarity with query parts? - Are there other sequence variants (splice
variants, SNPs) in the DB?
13Searching a sequence DB with a query sequence
- Approach 1
- global/local pairwise sequence alignment of query
with each DB sequence - dynamic programming requires O(nm) space and
time - space/time complexity such that resulting
computation-time is prohibitive - Smith-Waterman or alike not feasible!
- Approach 2
- fast identification of identical subsequences in
DB ("seeds") - extension around seed to construct local
alignment - General difficulty
- small query, large database
- in some cases, an identified "hit" may happen by
pure chance - assign statistical significance to a "hit"
- Example
- DB human genomic DNA, 3109 b
- query tggtacaaatgttct (glucocorticoid response
element GRE)
14Basics of Sequence DB Searches Detection of
identical k-mers
- Idea identify identical k-mers in DB and q
(seed) expand alignment from seed in both
directions - Example
q MAAARLCLSLLLLSTCVALLLQPLLGAQGAPLEPVYPGDNATPEQM
AQYAADLRRYINMLTRPRYGKRHKEDTLAFSEWGS
...
MAVAYCCLSLFLVSTWVALLLQPLQGTWGAPLEPMYPGDYATPEQMAQYE
TQLRRYINTLTRPRYGKRAEEENTGGLP...
- BLAST
- HSP, high scoring pair
- gapped alignment
- starting extension also from similar (and not
only identical) seeds
15Basics of Sequence DB Searches Detection of
identical k-mers
- Precompute position of all k-mers in DB sequence
- Indexing all peptides of length k in database
- Example
0 1 2 3 1234567890123456789
012345678901234 MAAARLCLSLLLLSTCVALLLQPLLGAQGAPLEP
MAAAR AAARL AARLC ....
APLEP Sorted AAARL 2
AARLC 3 APLEP 30 ... MAAAR 1 ...
VALLL 17
- For each peptide of length k in the query,
identical peptides in "database" are detected
efficiently (binary search in sorted list) - For identical pairs, extension step in both
directions is performed
16Basics of Sequence DB Searches Detection of
identical k-mers
- Indexing all peptides of length k in database
some refinements
0 1 2 3 1234567890123456789
012345678901234 MAAARLCVALLLLSTCVALLLQPLLGAQGAPLEP
- 8 Pointers
to previous occurrences List of all words of
length k and last occurrence in query AAAAA
- AAAAC - AAAAD - .... AARLC 3
.... APLEP 30 ... ... VALLL 17 ...
Simple, yet efficient data-structure - array of
integers (sizelength of db) - array of integers
(sizenumber of words with length
k) Book-keeping - more than one db/query
sequence - build database chunks that fit into
main memory(speeds up computation
1000x) Extension step optimizations
- For each peptide of length k in the query, the
position in the wordlist can be easily computed
(no binary search!)
17Improvement of sensitivity/selectivity in BLAST
A W T V A S A V R T S I
- (optional) filtering for low complexity region in
query - all query words of length 3 are listed
- to each word, 50 'high scoring' additional words
are added - matching words are identified in DB (as described
before) - ungapped alignment constructed from word matches,
'HSP' - statistics determines, whether HSP is significant
- SW-alignment for significant HSPs
AWT VAS AVR TSI WTV ASA VRT TVA SAV RTS
AWT VAS AVR TSI WTV ASA VRT TVA SAV RTS AWA
IAS TVR ... TWA LAS AIR ... ART ITS AVS ... ...
... ...
18An example comparing mus/rat/hum chr X
Each dot conserved stretch of AA, HSP, high
scoring pair Sequence lengths gt 140 M bp
19Significance of matches DNA case
- issue searching with short query vs. large
database? found match could have occurred by
pure chance - assume equal distribution of c,g,a,t
- what is ...
- the probability p, that sequence q (lenm) is
contained in sequence t (lenn)? - the expected length of the longest common
subsequence of two sequences? - the expected score of the best local alignment of
two sequences? - the expected score distribution when locally
aligning two seqeuences? - Example
- s tggtacaaatgttct (glucocorticoid response
element GRE) - t 10000 bp (promoter, upstream DNA to start
codon) - if promoter sequence was random, how often do we
expect to find a GRE? - Probability that q (lenm) is contained in t
(lenn) - a. There are (n-m) 'words' of length m in
sequence A - b. In total, there are 4m sequences of length m
- c. p (n-m) / 4m
This is wrong. Why?
20Conclusions and Outlook
- optimal alignment in practical terms only
feasible for pairwise alignment - sequence database searches
- has become the single most important tool in
sequence analysis - basic tool for hypothesis building about function
of unknown sequence - basic algorithm to identify local alignments in
sequence databases - efficiently find occurrence of query k-mers in db
sequences (seeds) - expand (ungapped or gapped) HSP from seeds
- attach statistical significance E-Value (BLAST)