Algorithms for Local Sequence Alignments BLAST, FASTA - PowerPoint PPT Presentation

1 / 20
About This Presentation
Title:

Algorithms for Local Sequence Alignments BLAST, FASTA

Description:

11 matches. 1 mismatch. 3 gaps. Score = 11 - 1 -3 = 7. Scoring Systems ... Direct submission of DNA/RNA sequences by ... ART ITS AVS ... An example: comparing ... – PowerPoint PPT presentation

Number of Views:345
Avg rating:3.0/5.0
Slides: 21
Provided by: adrianbr
Category:

less

Transcript and Presenter's Notes

Title: Algorithms for Local Sequence Alignments BLAST, FASTA


1
Algorithms forLocal Sequence AlignmentsBLAST,
FASTA
  • A. Brüngger, Discovery InformaticsBasilea
    Pharmaceutica Ltd.
  • adrian.bruengger_at_basilea.com

2
Algorithms for Local Sequence Alignments BLAST,
FASTA
  • Sequence Similarity and Homology
  • Origins of homology
  • Sequence alignment
  • Global Alignment
  • Local Alignment
  • Scoring Systems for Proteins
  • Content of Sequence DBs
  • GenBank, SwissProt, UniProt, RefSeq
  • Size of sequence DB requires special search tools
  • Algorithms for searching Sequence Databases
  • Basics of sequence DB searches
  • Efficient detection of identical k-mers
  • BLAST2 improvements
  • Statistical significance of hits

Outline follows David W. Mount, "Bioionformatics
- Sequence and Genome Analysis Cold Spring
Harbour Laboratory Press, 2001. Online
http//www.bioinformaticsonline.org
3
Rational for Sequence Analysis, Origins of
Sequence Similarity
Similar sequence leads to similar function
Sequence Analysis as the basic tool to discover
functional, structural, evolutionary information
in biological sequences
Sequence A
Sequence B
Evolutionary relationship between two similar
sequences and a possible common ancestor. The
number of steps to convert one sequence into the
other is the "evolutionary" distance between the
sequences (x y). Usually, the ancestor sequence
is not available, only (x y) can be computed.
y Steps
x Steps
common ancestor sequence
4
Origins of Homology ? Significance of Sequence
Alignments
  • Possible Origins of Sequence Homology
  • orthologs (panel A and B) a1 in species I and a1
    in species II (same ancestor!)
  • paralogs (panel A and B) a1 and a2 (arose from
    gene duplication event)
  • analogs (panel C) different genes converge to
    same function by different evolutionary paths
  • transfer of genetic material (panel D) between
    different species
  • Homology vs. Similarity
  • Similarity can be computed (by sequence
    alignments)
  • Homology is deduced (e.g. from similarity, but
    also from other evidence!)

5
Definition of Sequence Alignment
  • Computational procedure (algorithm) for
    comparing two/many sequences
  • identify series of identical residues or patterns
    of identical residuesthat appear in the same
    order in the sequences
  • visualized by writing sequences as follows
  • sequence alignment is an optimiztion
    problembringing as many identical residues as
    possible into corresponding positions

MLGPSSKQTGKGS-SRIWDN
MLN-ITKSAGKGAIMRLGDA
Pairwise Global Alignment (over whole length of
sequences)
GKG GKG
Pairwise Local Alignment (similar parts of
sequences)
6
Assigning Scores Quantify the Quality of an
Alignment
  • Simple scoring system (e.g. for DNA sequences)
  • two identical residues are aligned 1
  • two non-identical residues are aligned -1
  • when a gap is introduced -1
  • Total score of alginment is the sum of the scores
    in each position
  • Alignment is optimal when assigned score is
    maximal
  • Note Scoring scheme defines optimal alignment

agaag-tagattcta aggaggtag-tt-ta
  • Compute Score
  • 11 matches
  • 1 mismatch
  • 3 gaps
  • Score 11 - 1 -3 7

7
Scoring Systems for Proteins (1)
  • Observation Protein (Function, Structure) is
    preserved when substitutions in certain AAs
    happen
  • scoring system less simple in case of proteins
  • Example The same protein in three different
    species

A W T V A S A V R T S I A Y T V A A A V R T S I A
W T V A A A V L T S I
Organism A Organism B Organism C
A
B
C
Therefore Assigning L to R scores higher than
assigning it to another AA
W to Y
L to R
8
Scoring Systems for Proteins (2)
  • Generalization Dayhoff PAM Weight
    Matrices(percent accepted mutation)
  • observed mutations during a unit of evolutionary
    time(1 PAM time that is required that an AA is
    changed with probability 1 )
  • assumption mutation at a certain site
    independent of previous changes in the site(-gt
    Markov Model)
  • extrapolation to longer evolutionary distances
    possible
  • initial PAM matrix (PAM1) derived from 1572
    changes in 71 groups of protein sequences that
    were at least 85 similar
  • observed in proteins that have the same function
    -gt accepted mutations

C S T P .. C fcc fcs fct fcp S fsc
fss fst fsp . .
9
Scoring Systems for Proteins (3)
  • From PAM1 to PAM60 to PAM250
  • multiplying the PAM1 matrix with itself!
  • from matrices, percentage identity canbe
    computed
  • PAM60 60PAM80 50PAM120 40PAM250
    20
  • Convenience for sequence alignments
  • log odds scores
  • can be summed up for aligned AA-pairs in order to
    get score

M
R
K
PAM1 p(HM) 0 PAM2 p(HM) p(HK)p(KM) p(HR
)p(RM)
H
p (AA-pair is found in alignment of homologuous
proteins) p (AA-pair is found in alignment of
unrelatede proteins by chance)
log
10
Scoring Systems for Proteins (4)
  • Chosing the 'best' PAM matrix for sequence
    searches
  • ungapped alignment scores are maximal, when
    'correct' PAM matrix is used!
  • using PAM200 90 of this maximum score for
    ungapped alignments of length 16-62 (Altschul
    1991)
  • BLOSUM Using large set of diverse proteins
  • Blocks Amino Acid Substitution Matrices BLOSUM
    2000 conserved AA-patterns ("Blocks") in more
    than 500 families (family signatures, Prosite
    patterns Bairoch)
  • patterns with 60 identity led to BLOSUM60, 80
    gt BLOSUM80
  • Empirical tests, whether matrix finds family
    members in DBs gt BLOSUM62
  • Which one to use?
  • BLOSUM for conserved domains (regardless
    evolutionary distance)
  • PAM for "evolutionary origin"

11
Content of Sequence Databases
  • Sequencing efforts during the last 15 years led
    to a wealth of sequence DBs
  • GenBank (NCBI)
  • Direct submission of DNA/RNA sequences by the
    researchers
  • Uncurated varying quality of sequences (ESTs,
    mRNAs, genomic DNA)
  • Entries are hardly ever changed (even if there
    are obvious mistakes!)
  • Highly redundant
  • 40'000'000 sequences (40000000000 bp)
  • SwissProt, UniProt (EBI, SIB)
  • Protein database
  • curated ongoing effort to improve data quality
    by human curation
  • annotations controlled vocabulary, structured
    information
  • minimally redundant
  • 30'000 sequences
  • RefSeq (NCBI), Gene
  • manually and computationally annotated/curated
    set of genes, mRNAs, and proteins
  • minimally redundant
  • capture relationship between DNA, RNA, Proteins
    (splice variants, SNPs etc.)
  • 100'000 sequences

12
Growth of DBs requires specific algorithms to
search
  • Given my sequence of interest ("query")
  • Is the query contained in a sequence DB?
  • Are there orthologs, paralogs, analogs to the
    query in a sequence DB?
  • Are there sequences sharing high/medium/low
    degree of similarity with query parts?
  • Are there other sequence variants (splice
    variants, SNPs) in the DB?

13
Searching a sequence DB with a query sequence
  • Approach 1
  • global/local pairwise sequence alignment of query
    with each DB sequence
  • dynamic programming requires O(nm) space and
    time
  • space/time complexity such that resulting
    computation-time is prohibitive
  • Smith-Waterman or alike not feasible!
  • Approach 2
  • fast identification of identical subsequences in
    DB ("seeds")
  • extension around seed to construct local
    alignment
  • General difficulty
  • small query, large database
  • in some cases, an identified "hit" may happen by
    pure chance
  • assign statistical significance to a "hit"
  • Example
  • DB human genomic DNA, 3109 b
  • query tggtacaaatgttct (glucocorticoid response
    element GRE)

14
Basics of Sequence DB Searches Detection of
identical k-mers
  • Idea identify identical k-mers in DB and q
    (seed) expand alignment from seed in both
    directions
  • Example

q MAAARLCLSLLLLSTCVALLLQPLLGAQGAPLEPVYPGDNATPEQM
AQYAADLRRYINMLTRPRYGKRHKEDTLAFSEWGS

...
MAVAYCCLSLFLVSTWVALLLQPLQGTWGAPLEPMYPGDYATPEQMAQYE
TQLRRYINTLTRPRYGKRAEEENTGGLP...
  • BLAST
  • HSP, high scoring pair
  • gapped alignment
  • starting extension also from similar (and not
    only identical) seeds

15
Basics of Sequence DB Searches Detection of
identical k-mers
  • Precompute position of all k-mers in DB sequence
  • Indexing all peptides of length k in database
  • Example

0 1 2 3 1234567890123456789
012345678901234 MAAARLCLSLLLLSTCVALLLQPLLGAQGAPLEP
MAAAR AAARL AARLC ....
APLEP Sorted AAARL 2
AARLC 3 APLEP 30 ... MAAAR 1 ...
VALLL 17
  • For each peptide of length k in the query,
    identical peptides in "database" are detected
    efficiently (binary search in sorted list)
  • For identical pairs, extension step in both
    directions is performed

16
Basics of Sequence DB Searches Detection of
identical k-mers
  • Indexing all peptides of length k in database
    some refinements

0 1 2 3 1234567890123456789
012345678901234 MAAARLCVALLLLSTCVALLLQPLLGAQGAPLEP
- 8 Pointers
to previous occurrences List of all words of
length k and last occurrence in query AAAAA
- AAAAC - AAAAD - .... AARLC 3
.... APLEP 30 ... ... VALLL 17 ...
Simple, yet efficient data-structure - array of
integers (sizelength of db) - array of integers
(sizenumber of words with length
k) Book-keeping - more than one db/query
sequence - build database chunks that fit into
main memory(speeds up computation
1000x) Extension step optimizations
  • For each peptide of length k in the query, the
    position in the wordlist can be easily computed
    (no binary search!)

17
Improvement of sensitivity/selectivity in BLAST
A W T V A S A V R T S I
  • (optional) filtering for low complexity region in
    query
  • all query words of length 3 are listed
  • to each word, 50 'high scoring' additional words
    are added
  • matching words are identified in DB (as described
    before)
  • ungapped alignment constructed from word matches,
    'HSP'
  • statistics determines, whether HSP is significant
  • SW-alignment for significant HSPs

AWT VAS AVR TSI WTV ASA VRT TVA SAV RTS
AWT VAS AVR TSI WTV ASA VRT TVA SAV RTS AWA
IAS TVR ... TWA LAS AIR ... ART ITS AVS ... ...
... ...
18
An example comparing mus/rat/hum chr X
Each dot conserved stretch of AA, HSP, high
scoring pair Sequence lengths gt 140 M bp
19
Significance of matches DNA case
  • issue searching with short query vs. large
    database? found match could have occurred by
    pure chance
  • assume equal distribution of c,g,a,t
  • what is ...
  • the probability p, that sequence q (lenm) is
    contained in sequence t (lenn)?
  • the expected length of the longest common
    subsequence of two sequences?
  • the expected score of the best local alignment of
    two sequences?
  • the expected score distribution when locally
    aligning two seqeuences?
  • Example
  • s tggtacaaatgttct (glucocorticoid response
    element GRE)
  • t 10000 bp (promoter, upstream DNA to start
    codon)
  • if promoter sequence was random, how often do we
    expect to find a GRE?
  • Probability that q (lenm) is contained in t
    (lenn)
  • a. There are (n-m) 'words' of length m in
    sequence A
  • b. In total, there are 4m sequences of length m
  • c. p (n-m) / 4m

This is wrong. Why?
20
Conclusions and Outlook
  • optimal alignment in practical terms only
    feasible for pairwise alignment
  • sequence database searches
  • has become the single most important tool in
    sequence analysis
  • basic tool for hypothesis building about function
    of unknown sequence
  • basic algorithm to identify local alignments in
    sequence databases
  • efficiently find occurrence of query k-mers in db
    sequences (seeds)
  • expand (ungapped or gapped) HSP from seeds
  • attach statistical significance E-Value (BLAST)
Write a Comment
User Comments (0)
About PowerShow.com