Lecture 2: Database search Based on class by Larry Hunter presentation

About This Presentation

Transcript and Presenter's Notes

Title: Lecture 2: Database search Based on class by Larry Hunter

1
Lecture 2 Database search(Based on class by
Larry Hunter)
2
First Behind the Screen

Databases are largely devoted to search.
Also, integrity, security, etc.
Search means taking a query and retrieving some
database entry that matches it.
We will start by discussing how to find an exact
match, and then move to finding inexact matches
(like BLAST search).
Efficiency is a key want to find things fast,
regardless of how big the database gets.

3
Computational Complexity

A key idea in computer science How much work
does it take to solve a class of problems?
How do we measure complexity?
Relative to problem size
How long does it take?
Clock time versus operations
Order O(?) notation
Other resources used (particularly space)

4
The complexity of search

Compare several algorithms for exact methods
exhaustive (linear time, constant space)
indexed (log time, linear space)
hash tables (constant time, linear space)
Then look at inexact search methods
Dynamic programming (Smith-Waterman)
BLAST

5
Linear search

Database
ACTGA
TTAGG
CGTAA
AGAGA
CGATA
CCGGA
GCCCT
TTACG

Test query against each target sequentially
Worst case, query matches last target and you
have as manytests as targets (size of database)
Average case, test half the targets.
Linear in the size of the database

6
Data Representation
Data structures (internal representations of
data) effect computational complexity

Linked lists
Variable length
Each element has a pointer to the next (doubles
space)

Vectors ( arrays)
Fixed length
Often allocate more space than necessary
Each element has a specific position

actga
ttgaca
...
7
Computational Complexity

For linked lists
finding the nth element takes time proportional
to the length of the list
insertion or deletion takes constant time
For vectors
finding the nth element takes constant time.
insertion of deletion takes time proportional to
the length of the list (have to move all
subsequent entries).

8
Indexed (binary) search

Create an sorted set of keys that point to
entries
Start in the middle, then figure out which half
Eliminate half the database each step, so need
log2 steps at worst
Need to build the index (takes space and time at
each database update)

9
Hash tables
f (TTACG) 8

Map each query to an arbitrary number with a
hash function
Use those numbers as an index into a table
Collisions can happen, but are rare
Constant time lookup

10
What makes a good hash function?

Basic must map keys to a number that is within
the size of the table
Desired minimize collisions
So similar keys should lead to different numbers
Good general method map key to a number, and
then take the remainder when divided by a prime
number

11
Pairwise Sequence Alignment

What is an alignment, and why might it be
significant?
An alignment is a mapping from one sequence to
another, identifying elements that are likely to
have arisen from a common ancestor
A good alignment is an indication of homology
Alignments are NOT exact matches. We will need a
method to find good alignments in a database...

12
Similarity vs. HomologyParalogs vs. Orthologs

Homology is an evolutionary relationship that
either exists or does not. It cannot be partial.
An ortholog is a homolog with shared function.
A paralog is a homolog that arose through a gene
duplication event. Paralogs often have divergent
function.
Similarity is a measure of the quality of
alignment between two sequences. High similarity
is evidence for homology. Similar sequences may
be orthologs or paralogs.

13
How do we compute similarity?

Similarity can be defined by counting positions
that are identical between two sequences
Gaps (insertions/deletions) can be important
abcdef abcdef abcdef
abceef acdef a-cdef

14
Not all mismatches are the same

Some amino acids are more substitutable for each
other than others. Serine and threonine are
more alike than tryptophan and alanine.
We can introduce "mismatch costs" for handling
different substitutions.
We don't usually use mismatch costs in aligning
nucleotide sequences, since no substitution is
per se better than any other.

15
Many possible alignments to consider

Without gaps, there are are NxM possible
alignments between sequences of length N and M
Once we start allowing gaps, there are many
possible arrangements to consider abcbcd
abcbcd abcbcd
abc--d a--bcd ab--cd
This becomes a very large number when we allow
mismatches, since we then need to look at every
possible pairing between elements there are
roughly NM possible alignments.

16
Exponential computations get big fast

If nm100, there are 100100 10200
100,000,000,000,000,000,000,000,000,000,000,000,00
0,000,000,000,000,000,000,000,000,000,000,000,000,
000,000,000,000,000,000,000,000,000,000,000,000,00
0,000,000,000,000,000,000,000,000,000,000,000,000,
000,000,000,000,000,000,000,000,000,000,000,000,00
0,000,000,000,000 different alignments.
And 100 amino acids is a small protein!

17
Avoiding random alignments with a score function

Not only are there many possible gapped
alignments, but introducing too many gaps makes
nonsense alignments possible
s--e-----qu---en--ce sometimesquipsentice
Need to distinguish between alignments that occur
due to homology, and those that could be expected
to be seen just by chance.
Define a score function that accounts for both
element mismatches and a gap penalty

18
Match scores

Match scores are often calculated on the basis of
the frequency of particular mutations in very
similar sequences.
We can transform substitution frequencies into
log odds scores, which can then be added together.

19
Local vs. Global alignments

A global alignment includes all elements of a
sequence, and includes gaps
A global alignment may or may not include "end
gap" penalties.
A local alignment is includes only subsequences,
and sometimes computed without gaps.
Local alignments can find shared domains in
divergent proteins and are fast to compute
Global alignments are better indicators of
homology and take longer to compute.

20
An alignment score

An alignment score is the sum of all the match
scores of an alignment, with a penalty subtracted
for each gap.
Gap penalties are usually "affine" meaning that
the penalty for one long gap is smaller than the
penalty for many smaller gaps that add up to the
same size.a b c - - da c c e f d9 2 7 6
gt 24 - (10 2) 12

Gap start continuationpenalty
Matchscore
AlignmentScore
21
Finding the optimal alignment

Given a pair of sequences and a score function,
identify the best scoring (optimal) alignment
between the sequences.
Remember, exponential number of possible
alignments (most with terrible scores).
Computer science to the rescue dynamic
programming identifies optimal alignments in time
proportional to the sum of the lengths of the
sequences

22
Dynamic programming

The name comes from an operations research task,
and has nothing to do with writing programs.
The key idea is to start aligning the sequences
left to right once a prefix is optimally
aligned, nothing about the remainder of the
alignment changes the alignment of the prefix.
We construct a matrix of possible alignment
scores (NxM2 calculations worst case) and then
"traceback" to find the optimal alignment.
Called Needleman-Wunch or Smith-Waterman

23
Alignment matrix

Create a matrix with each sequence to be aligned
along one edge and the score of the alignment of
each pair of elements in a cell.
Best local alignment is just the highest
scoring diagonal

24
Dynamic programming matrix

Each cell has the score for the best aligned
sequence prefix up to that position.
Number in ( )s is thealignment score forthe
pair of amino acids at that position.
Gap penalty here is-12 to start and -4 to
continue.

25
Optimal alignment by traceback

We traceback a path that gets us the highest
score. If we don't have end gap penalties,
then takeany path from thelast row or columnto
the first.
Otherwise we needto include the top and bottom
corners

26
Study guide....

Dynamic programming alignments are a key
technology in bioinformatics, and you should
understand how they work.
The method is counterintuitive
Work some examples by hand.
More detail and supplementary material on the
course web site.

27
How do we pick match scores?

For match scores, two main options
PAM based on global alignments of closely related
sequences. Normalized to changes per 100 sites,
then exponentiated for more distant relatives.
BLOSUM based on local alignments in much more
diverse sequences
Picking the right distance is important, and may
be hard to do. BLOSUM seems to work better for
more evolutionarily distant sequences. BLOSUM62
is a good default.

28
Picking gap penalties

Many different possible forms
Most common is affine (gap open gap continue
penalties)
More complex penalties have been proposed.
Penalties must be commensurate with match scores.
Therefore, the match scoring scheme influences
the gap penalty
Most alignment programs suggest appropriate
penalties for each match score option.

29
Searching for optimal scores

One possibility is to try several different match
score and gap penalties, and choose the best
In general, this is called parameter space search
and it is important in many areas.
Problems
requires a lot computation
we need some principled way to compare the
results.
Use significance testing to compare...

30
The significance of an alignment

Significance testing is the branch of statistics
that is concerned with assessing the probability
that a particular result could have occurred by
chance.
How do we calculate the probability that an
alignment occurred by chance?
Either with a model of evolution, or
Empirically, by scrambling our sequences and
calculating scores on many randomized sequences.

31
For next week

Read
Molecular Biology Database Collection
Entrez tutorial
Lecture notes on class web site
BLAST education site
Try working some dynamic programming examples by
hand.

Write a Comment

User Comments (0)

About PowerShow.com

Lecture 2: Database search Based on class by Larry Hunter PowerPoint PPT Presentation