Title: Lecture 2: Database search Based on class by Larry Hunter
1Lecture 2 Database search(Based on class by
Larry Hunter)
2First Behind the Screen
- Databases are largely devoted to search.
- Also, integrity, security, etc.
- Search means taking a query and retrieving some
database entry that matches it. - We will start by discussing how to find an exact
match, and then move to finding inexact matches
(like BLAST search). - Efficiency is a key want to find things fast,
regardless of how big the database gets.
3Computational Complexity
- A key idea in computer science How much work
does it take to solve a class of problems? - How do we measure complexity?
- Relative to problem size
- How long does it take?
- Clock time versus operations
- Order O(?) notation
- Other resources used (particularly space)
4The complexity of search
- Compare several algorithms for exact methods
- exhaustive (linear time, constant space)
- indexed (log time, linear space)
- hash tables (constant time, linear space)
- Then look at inexact search methods
- Dynamic programming (Smith-Waterman)
- BLAST
5Linear search
- Database
- ACTGA
- TTAGG
- CGTAA
- AGAGA
- CGATA
- CCGGA
- GCCCT
- TTACG
- Test query against each target sequentially
- Worst case, query matches last target and you
have as manytests as targets (size of database) - Average case, test half the targets.
- Linear in the size of the database
6Data Representation
Data structures (internal representations of
data) effect computational complexity
- Linked lists
- Variable length
- Each element has a pointer to the next (doubles
space)
- Vectors ( arrays)
- Fixed length
- Often allocate more space than necessary
- Each element has a specific position
actga
ttgaca
...
7Computational Complexity
- For linked lists
- finding the nth element takes time proportional
to the length of the list - insertion or deletion takes constant time
- For vectors
- finding the nth element takes constant time.
- insertion of deletion takes time proportional to
the length of the list (have to move all
subsequent entries).
8Indexed (binary) search
- Create an sorted set of keys that point to
entries - Start in the middle, then figure out which half
- Eliminate half the database each step, so need
log2 steps at worst - Need to build the index (takes space and time at
each database update)
9Hash tables
f (TTACG) 8
- Map each query to an arbitrary number with a
hash function - Use those numbers as an index into a table
- Collisions can happen, but are rare
- Constant time lookup
10What makes a good hash function?
- Basic must map keys to a number that is within
the size of the table - Desired minimize collisions
- So similar keys should lead to different numbers
- Good general method map key to a number, and
then take the remainder when divided by a prime
number
11Pairwise Sequence Alignment
- What is an alignment, and why might it be
significant? - An alignment is a mapping from one sequence to
another, identifying elements that are likely to
have arisen from a common ancestor - A good alignment is an indication of homology
- Alignments are NOT exact matches. We will need a
method to find good alignments in a database...
12Similarity vs. HomologyParalogs vs. Orthologs
- Homology is an evolutionary relationship that
either exists or does not. It cannot be partial. - An ortholog is a homolog with shared function.
- A paralog is a homolog that arose through a gene
duplication event. Paralogs often have divergent
function. - Similarity is a measure of the quality of
alignment between two sequences. High similarity
is evidence for homology. Similar sequences may
be orthologs or paralogs.
13How do we compute similarity?
- Similarity can be defined by counting positions
that are identical between two sequences - Gaps (insertions/deletions) can be important
abcdef abcdef abcdef
abceef acdef a-cdef
14Not all mismatches are the same
- Some amino acids are more substitutable for each
other than others. Serine and threonine are
more alike than tryptophan and alanine. - We can introduce "mismatch costs" for handling
different substitutions. - We don't usually use mismatch costs in aligning
nucleotide sequences, since no substitution is
per se better than any other.
15Many possible alignments to consider
- Without gaps, there are are NxM possible
alignments between sequences of length N and M - Once we start allowing gaps, there are many
possible arrangements to consider abcbcd
abcbcd abcbcd
abc--d a--bcd ab--cd - This becomes a very large number when we allow
mismatches, since we then need to look at every
possible pairing between elements there are
roughly NM possible alignments.
16Exponential computations get big fast
- If nm100, there are 100100 10200
100,000,000,000,000,000,000,000,000,000,000,000,00
0,000,000,000,000,000,000,000,000,000,000,000,000,
000,000,000,000,000,000,000,000,000,000,000,000,00
0,000,000,000,000,000,000,000,000,000,000,000,000,
000,000,000,000,000,000,000,000,000,000,000,000,00
0,000,000,000,000 different alignments. - And 100 amino acids is a small protein!
17Avoiding random alignments with a score function
- Not only are there many possible gapped
alignments, but introducing too many gaps makes
nonsense alignments possible
s--e-----qu---en--ce sometimesquipsentice - Need to distinguish between alignments that occur
due to homology, and those that could be expected
to be seen just by chance. - Define a score function that accounts for both
element mismatches and a gap penalty
18Match scores
- Match scores are often calculated on the basis of
the frequency of particular mutations in very
similar sequences. - We can transform substitution frequencies into
log odds scores, which can then be added together.
19Local vs. Global alignments
- A global alignment includes all elements of a
sequence, and includes gaps - A global alignment may or may not include "end
gap" penalties. - A local alignment is includes only subsequences,
and sometimes computed without gaps. - Local alignments can find shared domains in
divergent proteins and are fast to compute - Global alignments are better indicators of
homology and take longer to compute.
20An alignment score
- An alignment score is the sum of all the match
scores of an alignment, with a penalty subtracted
for each gap. - Gap penalties are usually "affine" meaning that
the penalty for one long gap is smaller than the
penalty for many smaller gaps that add up to the
same size.a b c - - da c c e f d9 2 7 6
gt 24 - (10 2) 12
Gap start continuationpenalty
Matchscore
AlignmentScore
21Finding the optimal alignment
- Given a pair of sequences and a score function,
identify the best scoring (optimal) alignment
between the sequences. - Remember, exponential number of possible
alignments (most with terrible scores). - Computer science to the rescue dynamic
programming identifies optimal alignments in time
proportional to the sum of the lengths of the
sequences
22Dynamic programming
- The name comes from an operations research task,
and has nothing to do with writing programs. - The key idea is to start aligning the sequences
left to right once a prefix is optimally
aligned, nothing about the remainder of the
alignment changes the alignment of the prefix. - We construct a matrix of possible alignment
scores (NxM2 calculations worst case) and then
"traceback" to find the optimal alignment. - Called Needleman-Wunch or Smith-Waterman
23Alignment matrix
- Create a matrix with each sequence to be aligned
along one edge and the score of the alignment of
each pair of elements in a cell. - Best local alignment is just the highest
scoring diagonal
24Dynamic programming matrix
- Each cell has the score for the best aligned
sequence prefix up to that position. - Number in ( )s is thealignment score forthe
pair of amino acids at that position. - Gap penalty here is-12 to start and -4 to
continue.
25Optimal alignment by traceback
- We traceback a path that gets us the highest
score. If we don't have end gap penalties,
then takeany path from thelast row or columnto
the first. - Otherwise we needto include the top and bottom
corners
26Study guide....
- Dynamic programming alignments are a key
technology in bioinformatics, and you should
understand how they work. - The method is counterintuitive
- Work some examples by hand.
- More detail and supplementary material on the
course web site.
27How do we pick match scores?
- For match scores, two main options
- PAM based on global alignments of closely related
sequences. Normalized to changes per 100 sites,
then exponentiated for more distant relatives. - BLOSUM based on local alignments in much more
diverse sequences - Picking the right distance is important, and may
be hard to do. BLOSUM seems to work better for
more evolutionarily distant sequences. BLOSUM62
is a good default.
28Picking gap penalties
- Many different possible forms
- Most common is affine (gap open gap continue
penalties) - More complex penalties have been proposed.
- Penalties must be commensurate with match scores.
Therefore, the match scoring scheme influences
the gap penalty - Most alignment programs suggest appropriate
penalties for each match score option.
29Searching for optimal scores
- One possibility is to try several different match
score and gap penalties, and choose the best - In general, this is called parameter space search
and it is important in many areas. - Problems
- requires a lot computation
- we need some principled way to compare the
results. - Use significance testing to compare...
30The significance of an alignment
- Significance testing is the branch of statistics
that is concerned with assessing the probability
that a particular result could have occurred by
chance. - How do we calculate the probability that an
alignment occurred by chance? - Either with a model of evolution, or
- Empirically, by scrambling our sequences and
calculating scores on many randomized sequences.
31For next week
- Read
- Molecular Biology Database Collection
- Entrez tutorial
- Lecture notes on class web site
- BLAST education site
- Try working some dynamic programming examples by
hand.