Algorithms for Local Sequence Alignments BLAST, FASTA - PowerPoint PPT Presentation

1 / 15

About This Presentation

Title:

Algorithms for Local Sequence Alignments BLAST, FASTA

Description:

Direct submission of DNA/RNA sequences by the researchers. Uncurated: varying quality of sequences ... ART ITS AVS ... An example: comparing mus/rat/hum chr X ... – PowerPoint PPT presentation

Number of Views:316

Avg rating:3.0/5.0

Slides: 16

Provided by: adrianbr

Category:

more less

Transcript and Presenter's Notes

Title: Algorithms for Local Sequence Alignments BLAST, FASTA

1
Algorithms forLocal Sequence AlignmentsBLAST,
FASTA

A. Brüngger, Labhead BioinformaticsNovartis
Pharma AG
adrian.bruengger_at_pharma.novartis.com

2
Algorithms for Local Sequence Alignments BLAST,
FASTA

Sequence Similarity and Homology
Origins of homology
Sequence alignment
Global Alignment
Local Alignment
Content of Sequence DBs
GenBank, SwissProt, RefSeq
Size of sequence DB requires special search tools
Algorithms for searching Sequence Databases
Basics of sequence DB searches
Efficient detection of identical k-mers
BLAST2 improvements
Statistical significance of hits

Outline follows David W. Mount, "Bioionformatics
- Sequence and Genome Analysis Cold Spring
Harbour Laboratory Press, 2001. Online
http//www.bioinformaticsonline.org
3
Rational for Sequence Analysis, Origins of
Sequence Similarity
Similar sequence leads to similar function
Sequence Analysis as the basic tool to discover
functional, structural, evolutionary information
in biological sequences
Sequence A
Sequence B
Evolutionary relationship between two similar
sequences and a possible common ancestor. The
number of steps to convert one sequence into the
other is the "evolutionary" distance between the
sequences (x y). Usually, the ancestor sequence
is not available, only (x y) can be computed.
y Steps
x Steps
common ancestor sequence
4
Origins of Homology ? Significance of Sequence
Alignments

Possible Origins of Sequence Homology
orthologs (panel A and B) a1 in species I and a1
in species II (same ancestor!)
paralogs (panel A and B) a1 and a2 (arose from
gene duplication event)
analogs (panel C) different genes converge to
same function by different evolutionary paths
transfer of genetic material (panel D) between
different species
Homology vs. Similarity
Similarity can be computed (by sequence
alignments)
Homology is deduced (e.g. from similarity, but
also from other evidence!)

5
Definition of Sequence Alignment

Computational procedure (algorithm) for
comparing two/many sequences
identify series of identical residues or patterns
of identical residuesthat appear in the same
order in the sequences
visualized by writing sequences as follows
sequence alignment is an optimiztion
problembringing as many identical residues as
possible into corresponding positions

MLGPSSKQTGKGS-SRIWDN
MLN-ITKSAGKGAIMRLGDA
Pairwise Global Alignment (over whole length of
sequences)
GKG GKG
Pairwise Local Alignment (similar parts of
sequences)
6
Content of Sequence Databases

Sequencing efforts during the last 15 years led
to a wealth of sequence DBs
GenBank (NCBI)
Direct submission of DNA/RNA sequences by the
researchers
Uncurated varying quality of sequences (ESTs,
mRNAs, genomic DNA)
Entries are hardly ever changed (even if there
are obvious mistakes!)
Highly redundant
30'000'000 sequences
SwissProt (EBI, SIB)
Protein database
curated ongoing effort to improve data quality
by human curation
annotations controlled vocabulary, structured
information
minimally redundant
30'000 sequences
RefSeq (NCBI)
manually and computationally annotated/curated
set of genes, mRNAs, and proteins
minimally redundant
capture relationship between DNA, RNA, Proteins
(splice variants, SNPs etc.)
100'000 sequences

7
Growth of DBs requires specific algorithms to
search

Given my sequence of interest ("query")
Is the query contained in a sequence DB?
Are there orthologs, paralogs, analogs to the
query in a sequence DB?
Are there sequences sharing high/medium/low
degree of similarity with query parts?
Are there other sequence variants (splice
variants, SNPs) in the DB?

8
Searching a sequence DB with a query sequence

Approach 1
global/local pairwise sequence alignment of query
with each DB sequence
dynamic programming requires O(nm) space and
time
space/time complexity such that resulting
computation-time is prohibitive
Smith-Waterman or alike not feasible!
Approach 2
fast identification of identical subsequences in
DB ("seeds")
extension around seed to construct local
alignment
General difficulty
small query, large database
in some cases, an identified "hit" may happen by
pure chance
assign statistical significance to a "hit"
Example
DB human genomic DNA, 3109 b
query tggtacaaatgttct (glucocorticoid response
element GRE)

9
Basics of Sequence DB Searches Detection of
identical k-mers

Idea identify identical k-mers in DB and q
(seed) expand alignment from seed in both
directions
Example

q MAAARLCLSLLLLSTCVALLLQPLLGAQGAPLEPVYPGDNATPEQM
AQYAADLRRYINMLTRPRYGKRHKEDTLAFSEWGS

...
MAVAYCCLSLFLVSTWVALLLQPLQGTWGAPLEPMYPGDYATPEQMAQYE
TQLRRYINTLTRPRYGKRAEEENTGGLP...

BLAST
HSP, high scoring pair
gapped alignment
starting extension also from similar (and not
only identical) seeds

10
Basics of Sequence DB Searches Detection of
identical k-mers

Precompute position of all k-mers in DB sequence
Indexing all peptides of length k in database
Example

0 1 2 3 1234567890123456789
012345678901234 MAAARLCLSLLLLSTCVALLLQPLLGAQGAPLEP
MAAAR AAARL AARLC ....
APLEP Sorted AAARL 2
AARLC 3 APLEP 30 ... MAAAR 1 ...
VALLL 17

For each peptide of length k in the query,
identical peptides in "database" are detected
efficiently (binary search in sorted list)
For identical pairs, extension step in both
directions is performed

11
Basics of Sequence DB Searches Detection of
identical k-mers

Indexing all peptides of length k in database
some refinements

0 1 2 3 1234567890123456789
012345678901234 MAAARLCVALLLLSTCVALLLQPLLGAQGAPLEP
- 8 Pointers
to previous occurrences List of all words of
length k and last occurrence in query AAAAA
- AAAAC - AAAAD - .... AARLC 3
.... APLEP 30 ... ... VALLL 17 ...
Simple, yet efficient data-structure - array of
integers (sizelength of db) - array of integers
(sizenumber of words with length
k) Book-keeping - more than one db/query
sequence - build database chunks that fit into
main memory(speeds up computation
1000x) Extension step optimizations

For each peptide of length k in the query, the
position in the wordlist can be easily computed
(no binary search!)

12
Improvement of sensitivity/selectivity in BLAST
A W T V A S A V R T S I

(optional) filtering for low complexity region in
query
all query words of length 3 are listed
to each word, 50 'high scoring' additional words
are added
matching words are identified in DB (as described
before)
ungapped alignment constructed from word matches,
'HSP'
statistics determines, whether HSP is significant
SW-alignment for significant HSPs

AWT VAS AVR TSI WTV ASA VRT TVA SAV RTS
AWT VAS AVR TSI WTV ASA VRT TVA SAV RTS AWA
IAS TVR ... TWA LAS AIR ... ART ITS AVS ... ...
... ...
13
An example comparing mus/rat/hum chr X
Each dot conserved stretch of AA, HSP, high
scoring pair Sequence lengths gt 140 M bp
14
Significance of matches DNA case

issue searching with short query vs. large
database? found match could have occurred by
pure chance
assume equal distribution of c,g,a,t
what is ...
the probability p, that sequence q (lenm) is
contained in sequence t (lenn)?
the expected length of the longest common
subsequence of two sequences?
the expected score of the best local alignment of
two sequences?
the expected score distribution when locally
aligning two seqeuences?
Example
s tggtacaaatgttct (glucocorticoid response
element GRE)
t 10000 bp (promoter, upstream DNA to start
codon)
if promoter sequence was random, how often do we
expect to find a GRE?
Probability that q (lenm) is contained in t
(lenn)
a. There are (n-m) 'words' of length m in
sequence A
b. In total, there are 4m sequences of length m
c. p (n-m) / 4m