Lecture 2: Database search Based on class by Larry Hunter PowerPoint PPT Presentation

presentation player overlay
1 / 31
About This Presentation
Transcript and Presenter's Notes

Title: Lecture 2: Database search Based on class by Larry Hunter


1
Lecture 2 Database search(Based on class by
Larry Hunter)
2
First Behind the Screen
  • Databases are largely devoted to search.
  • Also, integrity, security, etc.
  • Search means taking a query and retrieving some
    database entry that matches it.
  • We will start by discussing how to find an exact
    match, and then move to finding inexact matches
    (like BLAST search).
  • Efficiency is a key want to find things fast,
    regardless of how big the database gets.

3
Computational Complexity
  • A key idea in computer science How much work
    does it take to solve a class of problems?
  • How do we measure complexity?
  • Relative to problem size
  • How long does it take?
  • Clock time versus operations
  • Order O(?) notation
  • Other resources used (particularly space)

4
The complexity of search
  • Compare several algorithms for exact methods
  • exhaustive (linear time, constant space)
  • indexed (log time, linear space)
  • hash tables (constant time, linear space)
  • Then look at inexact search methods
  • Dynamic programming (Smith-Waterman)
  • BLAST

5
Linear search
  • Database
  • ACTGA
  • TTAGG
  • CGTAA
  • AGAGA
  • CGATA
  • CCGGA
  • GCCCT
  • TTACG
  • Test query against each target sequentially
  • Worst case, query matches last target and you
    have as manytests as targets (size of database)
  • Average case, test half the targets.
  • Linear in the size of the database

6
Data Representation
Data structures (internal representations of
data) effect computational complexity
  • Linked lists
  • Variable length
  • Each element has a pointer to the next (doubles
    space)
  • Vectors ( arrays)
  • Fixed length
  • Often allocate more space than necessary
  • Each element has a specific position

actga
ttgaca
...
7
Computational Complexity
  • For linked lists
  • finding the nth element takes time proportional
    to the length of the list
  • insertion or deletion takes constant time
  • For vectors
  • finding the nth element takes constant time.
  • insertion of deletion takes time proportional to
    the length of the list (have to move all
    subsequent entries).

8
Indexed (binary) search
  • Create an sorted set of keys that point to
    entries
  • Start in the middle, then figure out which half
  • Eliminate half the database each step, so need
    log2 steps at worst
  • Need to build the index (takes space and time at
    each database update)

9
Hash tables
f (TTACG) 8
  • Map each query to an arbitrary number with a
    hash function
  • Use those numbers as an index into a table
  • Collisions can happen, but are rare
  • Constant time lookup

10
What makes a good hash function?
  • Basic must map keys to a number that is within
    the size of the table
  • Desired minimize collisions
  • So similar keys should lead to different numbers
  • Good general method map key to a number, and
    then take the remainder when divided by a prime
    number

11
Pairwise Sequence Alignment
  • What is an alignment, and why might it be
    significant?
  • An alignment is a mapping from one sequence to
    another, identifying elements that are likely to
    have arisen from a common ancestor
  • A good alignment is an indication of homology
  • Alignments are NOT exact matches. We will need a
    method to find good alignments in a database...

12
Similarity vs. HomologyParalogs vs. Orthologs
  • Homology is an evolutionary relationship that
    either exists or does not. It cannot be partial.
  • An ortholog is a homolog with shared function.
  • A paralog is a homolog that arose through a gene
    duplication event. Paralogs often have divergent
    function.
  • Similarity is a measure of the quality of
    alignment between two sequences. High similarity
    is evidence for homology. Similar sequences may
    be orthologs or paralogs.

13
How do we compute similarity?
  • Similarity can be defined by counting positions
    that are identical between two sequences
  • Gaps (insertions/deletions) can be important
    abcdef abcdef abcdef
    abceef acdef a-cdef

14
Not all mismatches are the same
  • Some amino acids are more substitutable for each
    other than others. Serine and threonine are
    more alike than tryptophan and alanine.
  • We can introduce "mismatch costs" for handling
    different substitutions.
  • We don't usually use mismatch costs in aligning
    nucleotide sequences, since no substitution is
    per se better than any other.

15
Many possible alignments to consider
  • Without gaps, there are are NxM possible
    alignments between sequences of length N and M
  • Once we start allowing gaps, there are many
    possible arrangements to consider abcbcd
    abcbcd abcbcd
    abc--d a--bcd ab--cd
  • This becomes a very large number when we allow
    mismatches, since we then need to look at every
    possible pairing between elements there are
    roughly NM possible alignments.

16
Exponential computations get big fast
  • If nm100, there are 100100 10200
    100,000,000,000,000,000,000,000,000,000,000,000,00
    0,000,000,000,000,000,000,000,000,000,000,000,000,
    000,000,000,000,000,000,000,000,000,000,000,000,00
    0,000,000,000,000,000,000,000,000,000,000,000,000,
    000,000,000,000,000,000,000,000,000,000,000,000,00
    0,000,000,000,000 different alignments.
  • And 100 amino acids is a small protein!

17
Avoiding random alignments with a score function
  • Not only are there many possible gapped
    alignments, but introducing too many gaps makes
    nonsense alignments possible
    s--e-----qu---en--ce sometimesquipsentice
  • Need to distinguish between alignments that occur
    due to homology, and those that could be expected
    to be seen just by chance.
  • Define a score function that accounts for both
    element mismatches and a gap penalty

18
Match scores
  • Match scores are often calculated on the basis of
    the frequency of particular mutations in very
    similar sequences.
  • We can transform substitution frequencies into
    log odds scores, which can then be added together.

19
Local vs. Global alignments
  • A global alignment includes all elements of a
    sequence, and includes gaps
  • A global alignment may or may not include "end
    gap" penalties.
  • A local alignment is includes only subsequences,
    and sometimes computed without gaps.
  • Local alignments can find shared domains in
    divergent proteins and are fast to compute
  • Global alignments are better indicators of
    homology and take longer to compute.

20
An alignment score
  • An alignment score is the sum of all the match
    scores of an alignment, with a penalty subtracted
    for each gap.
  • Gap penalties are usually "affine" meaning that
    the penalty for one long gap is smaller than the
    penalty for many smaller gaps that add up to the
    same size.a b c - - da c c e f d9 2 7 6
    gt 24 - (10 2) 12

Gap start continuationpenalty
Matchscore
AlignmentScore
21
Finding the optimal alignment
  • Given a pair of sequences and a score function,
    identify the best scoring (optimal) alignment
    between the sequences.
  • Remember, exponential number of possible
    alignments (most with terrible scores).
  • Computer science to the rescue dynamic
    programming identifies optimal alignments in time
    proportional to the sum of the lengths of the
    sequences

22
Dynamic programming
  • The name comes from an operations research task,
    and has nothing to do with writing programs.
  • The key idea is to start aligning the sequences
    left to right once a prefix is optimally
    aligned, nothing about the remainder of the
    alignment changes the alignment of the prefix.
  • We construct a matrix of possible alignment
    scores (NxM2 calculations worst case) and then
    "traceback" to find the optimal alignment.
  • Called Needleman-Wunch or Smith-Waterman

23
Alignment matrix
  • Create a matrix with each sequence to be aligned
    along one edge and the score of the alignment of
    each pair of elements in a cell.
  • Best local alignment is just the highest
    scoring diagonal

24
Dynamic programming matrix
  • Each cell has the score for the best aligned
    sequence prefix up to that position.
  • Number in ( )s is thealignment score forthe
    pair of amino acids at that position.
  • Gap penalty here is-12 to start and -4 to
    continue.

25
Optimal alignment by traceback
  • We traceback a path that gets us the highest
    score. If we don't have end gap penalties,
    then takeany path from thelast row or columnto
    the first.
  • Otherwise we needto include the top and bottom
    corners

26
Study guide....
  • Dynamic programming alignments are a key
    technology in bioinformatics, and you should
    understand how they work.
  • The method is counterintuitive
  • Work some examples by hand.
  • More detail and supplementary material on the
    course web site.

27
How do we pick match scores?
  • For match scores, two main options
  • PAM based on global alignments of closely related
    sequences. Normalized to changes per 100 sites,
    then exponentiated for more distant relatives.
  • BLOSUM based on local alignments in much more
    diverse sequences
  • Picking the right distance is important, and may
    be hard to do. BLOSUM seems to work better for
    more evolutionarily distant sequences. BLOSUM62
    is a good default.

28
Picking gap penalties
  • Many different possible forms
  • Most common is affine (gap open gap continue
    penalties)
  • More complex penalties have been proposed.
  • Penalties must be commensurate with match scores.
    Therefore, the match scoring scheme influences
    the gap penalty
  • Most alignment programs suggest appropriate
    penalties for each match score option.

29
Searching for optimal scores
  • One possibility is to try several different match
    score and gap penalties, and choose the best
  • In general, this is called parameter space search
    and it is important in many areas.
  • Problems
  • requires a lot computation
  • we need some principled way to compare the
    results.
  • Use significance testing to compare...

30
The significance of an alignment
  • Significance testing is the branch of statistics
    that is concerned with assessing the probability
    that a particular result could have occurred by
    chance.
  • How do we calculate the probability that an
    alignment occurred by chance?
  • Either with a model of evolution, or
  • Empirically, by scrambling our sequences and
    calculating scores on many randomized sequences.

31
For next week
  • Read
  • Molecular Biology Database Collection
  • Entrez tutorial
  • Lecture notes on class web site
  • BLAST education site
  • Try working some dynamic programming examples by
    hand.
Write a Comment
User Comments (0)
About PowerShow.com