Algorithms for Local Sequence Alignments BLAST, FASTA - PowerPoint PPT Presentation

1 / 20

About This Presentation

Title:

Algorithms for Local Sequence Alignments BLAST, FASTA

Description:

11 matches. 1 mismatch. 3 gaps. Score = 11 - 1 -3 = 7. Scoring Systems ... Direct submission of DNA/RNA sequences by ... ART ITS AVS ... An example: comparing ... – PowerPoint PPT presentation

Number of Views:345

Avg rating:3.0/5.0

Slides: 21

Provided by: adrianbr

Category:

more less

Transcript and Presenter's Notes

Title: Algorithms for Local Sequence Alignments BLAST, FASTA

1
Algorithms forLocal Sequence AlignmentsBLAST,
FASTA

A. Brüngger, Discovery InformaticsBasilea
Pharmaceutica Ltd.
adrian.bruengger_at_basilea.com

2
Algorithms for Local Sequence Alignments BLAST,
FASTA

Sequence Similarity and Homology
Origins of homology
Sequence alignment
Global Alignment
Local Alignment
Scoring Systems for Proteins
Content of Sequence DBs
GenBank, SwissProt, UniProt, RefSeq
Size of sequence DB requires special search tools
Algorithms for searching Sequence Databases
Basics of sequence DB searches
Efficient detection of identical k-mers
BLAST2 improvements
Statistical significance of hits

Outline follows David W. Mount, "Bioionformatics
- Sequence and Genome Analysis Cold Spring
Harbour Laboratory Press, 2001. Online
http//www.bioinformaticsonline.org
3
Rational for Sequence Analysis, Origins of
Sequence Similarity
Similar sequence leads to similar function
Sequence Analysis as the basic tool to discover
functional, structural, evolutionary information
in biological sequences
Sequence A
Sequence B
Evolutionary relationship between two similar
sequences and a possible common ancestor. The
number of steps to convert one sequence into the
other is the "evolutionary" distance between the
sequences (x y). Usually, the ancestor sequence
is not available, only (x y) can be computed.
y Steps
x Steps
common ancestor sequence
4
Origins of Homology ? Significance of Sequence
Alignments

Possible Origins of Sequence Homology
orthologs (panel A and B) a1 in species I and a1
in species II (same ancestor!)
paralogs (panel A and B) a1 and a2 (arose from
gene duplication event)
analogs (panel C) different genes converge to
same function by different evolutionary paths
transfer of genetic material (panel D) between
different species
Homology vs. Similarity
Similarity can be computed (by sequence
alignments)
Homology is deduced (e.g. from similarity, but
also from other evidence!)

5
Definition of Sequence Alignment

Computational procedure (algorithm) for
comparing two/many sequences
identify series of identical residues or patterns
of identical residuesthat appear in the same
order in the sequences
visualized by writing sequences as follows
sequence alignment is an optimiztion
problembringing as many identical residues as
possible into corresponding positions

MLGPSSKQTGKGS-SRIWDN
MLN-ITKSAGKGAIMRLGDA
Pairwise Global Alignment (over whole length of
sequences)
GKG GKG
Pairwise Local Alignment (similar parts of
sequences)
6
Assigning Scores Quantify the Quality of an
Alignment

Simple scoring system (e.g. for DNA sequences)
two identical residues are aligned 1
two non-identical residues are aligned -1
when a gap is introduced -1
Total score of alginment is the sum of the scores
in each position
Alignment is optimal when assigned score is
maximal
Note Scoring scheme defines optimal alignment

agaag-tagattcta aggaggtag-tt-ta

Compute Score
11 matches
1 mismatch
3 gaps
Score 11 - 1 -3 7

7
Scoring Systems for Proteins (1)

Observation Protein (Function, Structure) is
preserved when substitutions in certain AAs
happen
scoring system less simple in case of proteins
Example The same protein in three different
species

A W T V A S A V R T S I A Y T V A A A V R T S I A
W T V A A A V L T S I
Organism A Organism B Organism C
A
B
C
Therefore Assigning L to R scores higher than
assigning it to another AA
W to Y
L to R
8
Scoring Systems for Proteins (2)

Generalization Dayhoff PAM Weight
Matrices(percent accepted mutation)
observed mutations during a unit of evolutionary
time(1 PAM time that is required that an AA is
changed with probability 1 )
assumption mutation at a certain site
independent of previous changes in the site(-gt
Markov Model)
extrapolation to longer evolutionary distances
possible
initial PAM matrix (PAM1) derived from 1572
changes in 71 groups of protein sequences that
were at least 85 similar
observed in proteins that have the same function
-gt accepted mutations

C S T P .. C fcc fcs fct fcp S fsc
fss fst fsp . .
9
Scoring Systems for Proteins (3)

From PAM1 to PAM60 to PAM250
multiplying the PAM1 matrix with itself!
from matrices, percentage identity canbe
computed
PAM60 60PAM80 50PAM120 40PAM250
20
Convenience for sequence alignments
log odds scores
can be summed up for aligned AA-pairs in order to
get score

M
R
K
PAM1 p(HM) 0 PAM2 p(HM) p(HK)p(KM) p(HR
)p(RM)
H
p (AA-pair is found in alignment of homologuous
proteins) p (AA-pair is found in alignment of
unrelatede proteins by chance)
log
10
Scoring Systems for Proteins (4)

Chosing the 'best' PAM matrix for sequence
searches
ungapped alignment scores are maximal, when
'correct' PAM matrix is used!
using PAM200 90 of this maximum score for
ungapped alignments of length 16-62 (Altschul
1991)
BLOSUM Using large set of diverse proteins
Blocks Amino Acid Substitution Matrices BLOSUM
2000 conserved AA-patterns ("Blocks") in more
than 500 families (family signatures, Prosite
patterns Bairoch)
patterns with 60 identity led to BLOSUM60, 80
gt BLOSUM80
Empirical tests, whether matrix finds family
members in DBs gt BLOSUM62
Which one to use?
BLOSUM for conserved domains (regardless
evolutionary distance)
PAM for "evolutionary origin"

11
Content of Sequence Databases

Sequencing efforts during the last 15 years led
to a wealth of sequence DBs
GenBank (NCBI)
Direct submission of DNA/RNA sequences by the
researchers
Uncurated varying quality of sequences (ESTs,
mRNAs, genomic DNA)
Entries are hardly ever changed (even if there
are obvious mistakes!)
Highly redundant
40'000'000 sequences (40000000000 bp)
SwissProt, UniProt (EBI, SIB)
Protein database
curated ongoing effort to improve data quality
by human curation
annotations controlled vocabulary, structured
information
minimally redundant
30'000 sequences
RefSeq (NCBI), Gene
manually and computationally annotated/curated
set of genes, mRNAs, and proteins
minimally redundant
capture relationship between DNA, RNA, Proteins
(splice variants, SNPs etc.)
100'000 sequences

12
Growth of DBs requires specific algorithms to
search

Given my sequence of interest ("query")
Is the query contained in a sequence DB?
Are there orthologs, paralogs, analogs to the
query in a sequence DB?
Are there sequences sharing high/medium/low
degree of similarity with query parts?
Are there other sequence variants (splice
variants, SNPs) in the DB?

13
Searching a sequence DB with a query sequence

Approach 1
global/local pairwise sequence alignment of query
with each DB sequence
dynamic programming requires O(nm) space and
time
space/time complexity such that resulting
computation-time is prohibitive
Smith-Waterman or alike not feasible!
Approach 2
fast identification of identical subsequences in
DB ("seeds")
extension around seed to construct local
alignment
General difficulty
small query, large database
in some cases, an identified "hit" may happen by
pure chance
assign statistical significance to a "hit"
Example
DB human genomic DNA, 3109 b
query tggtacaaatgttct (glucocorticoid response
element GRE)

14
Basics of Sequence DB Searches Detection of
identical k-mers

Idea identify identical k-mers in DB and q
(seed) expand alignment from seed in both
directions
Example

q MAAARLCLSLLLLSTCVALLLQPLLGAQGAPLEPVYPGDNATPEQM
AQYAADLRRYINMLTRPRYGKRHKEDTLAFSEWGS

...
MAVAYCCLSLFLVSTWVALLLQPLQGTWGAPLEPMYPGDYATPEQMAQYE
TQLRRYINTLTRPRYGKRAEEENTGGLP...

BLAST
HSP, high scoring pair
gapped alignment
starting extension also from similar (and not
only identical) seeds

15
Basics of Sequence DB Searches Detection of
identical k-mers

Precompute position of all k-mers in DB sequence
Indexing all peptides of length k in database
Example

0 1 2 3 1234567890123456789
012345678901234 MAAARLCLSLLLLSTCVALLLQPLLGAQGAPLEP
MAAAR AAARL AARLC ....
APLEP Sorted AAARL 2
AARLC 3 APLEP 30 ... MAAAR 1 ...
VALLL 17

For each peptide of length k in the query,
identical peptides in "database" are detected
efficiently (binary search in sorted list)
For identical pairs, extension step in both
directions is performed

16
Basics of Sequence DB Searches Detection of
identical k-mers

Indexing all peptides of length k in database
some refinements

0 1 2 3 1234567890123456789
012345678901234 MAAARLCVALLLLSTCVALLLQPLLGAQGAPLEP
- 8 Pointers
to previous occurrences List of all words of
length k and last occurrence in query AAAAA
- AAAAC - AAAAD - .... AARLC 3
.... APLEP 30 ... ... VALLL 17 ...
Simple, yet efficient data-structure - array of
integers (sizelength of db) - array of integers
(sizenumber of words with length
k) Book-keeping - more than one db/query
sequence - build database chunks that fit into
main memory(speeds up computation
1000x) Extension step optimizations

For each peptide of length k in the query, the
position in the wordlist can be easily computed
(no binary search!)

17
Improvement of sensitivity/selectivity in BLAST
A W T V A S A V R T S I

(optional) filtering for low complexity region in
query
all query words of length 3 are listed
to each word, 50 'high scoring' additional words
are added
matching words are identified in DB (as described
before)
ungapped alignment constructed from word matches,
'HSP'
statistics determines, whether HSP is significant
SW-alignment for significant HSPs

AWT VAS AVR TSI WTV ASA VRT TVA SAV RTS
AWT VAS AVR TSI WTV ASA VRT TVA SAV RTS AWA
IAS TVR ... TWA LAS AIR ... ART ITS AVS ... ...
... ...
18
An example comparing mus/rat/hum chr X
Each dot conserved stretch of AA, HSP, high
scoring pair Sequence lengths gt 140 M bp
19
Significance of matches DNA case

issue searching with short query vs. large
database? found match could have occurred by
pure chance
assume equal distribution of c,g,a,t
what is ...
the probability p, that sequence q (lenm) is
contained in sequence t (lenn)?
the expected length of the longest common
subsequence of two sequences?
the expected score of the best local alignment of
two sequences?
the expected score distribution when locally
aligning two seqeuences?
Example
s tggtacaaatgttct (glucocorticoid response
element GRE)
t 10000 bp (promoter, upstream DNA to start
codon)
if promoter sequence was random, how often do we
expect to find a GRE?
Probability that q (lenm) is contained in t
(lenn)
a. There are (n-m) 'words' of length m in
sequence A
b. In total, there are 4m sequences of length m
c. p (n-m) / 4m