Title: LargeScale Oligonucleotide Design: Principles and Application in Osprey Dr' Christoph Sensen
1Large-Scale Oligonucleotide Design Principles
and Application in OspreyDr. Christoph Sensen
2Overview
- Review of the Biochemistry
- Review of exisitng techniques and component tools
- How weve improved calculations in Osprey
- How to access Osprey
- Goals
- See existing analysis tools in a larger context
- Realize the advantages of computational
expensive dynamic methods
3Oligo Usage
- Osprey focuses on applications that require the
design of large numbers of oligos, i.e. those
that benefit from automation. - Microarrays
- Large clone and directed genome sequencing
4Thermodynamic Parameters
- We want
- Binding to target sequence that has the right
duplex melting temperature for the experiment - Distinguish between similar genes
- We want to avoid
- Oligos that fold back on themselves (hairpins)
- Oligos that bind to other copies of themselves
(dimers) - Oligos that bind well to more than one potential
location in the DNA sample (secondary binding)
5Melting Temperatures
- Oligo duplexes are not two state, therefore the
melting temperature (Tm) is considered the point
at which half the duplexes are denatured. - Simple models include
- Wallace Tm 2(AT) 4(GC) (in 1M NaCl)
- Tm 100.5 41(GC)/(ATGC) - (820/(ATGC))
16.6log10(Na) - There are many models for determining melting
temperature, but they share a common theme...
6Oligonucleotide Length
- Melting temperature is directly proportional to
oligo duplex length, and dependent on the base
composition of the sequence. - These two factors create a sweet spot for
finding oligos of the appropriate melting
temperature based on a oligo length base
composition probabilities in the input sequence - e.g. In Desulfolovibrio vulgaris (65 GC) the
average 70mer melts at about 100C in 1M NaCl - In Sulfolobus solfataricus (37 GC) it's about
65C
7Melting Temperature Complex
- More throrough models are based on physics
melting temperature is determined by the enthalpy
and the entropy of the oligo duplex - Tm ?Hº/(?Sº R ln (C/4))
- Where
- ?Hº enthalpy (order) for the whole duplex
- ?Sº entropy (disorder) for the whole duplex
- R molar gas constant, 1.987 cal/(Kmol)
- C concentration of DNA
8DNA Duplex Energy
- There are several models to determine enthalpy
and entropy, all based on the accepted Nearest
Neighbour (NN) concept the overall energy of the
duplex can be predicted by summing the
interaction of adjacent basepairs. e.g. - Enthalpy (H) of GCCCTA
- H(GC)2H(CC)H(CT)H(TA)
- Neighbour energies are experimentally derived
9NN Techniques
- Models derived from empirical data include
Gotoh, Vologodskii, Breslauer, Benight, Sugimoto - SantaLucia created a unified model from the
combination of these datasets and his own data,
generally considered the best model yet.
10Popular Oligo Tools
- Primer3 (Uses Breslauer model, good for PCR
primers, interactive usage) - HyTher (SantaLucia's own system, accurate,
interactive usage) - No product has dominated the microarray design
market yet, most chips are designed using
commercial software - Free software includes OligoArray, OligoChecker,
PrimeGens
11Why develop new tools?
- Many programs require intimate knowledge of
parameters or cast a wide net to find oligos - For large scale projects such as microarrays, the
accuracy of the oligos can greatly affect their
usability (e.g. ensuring even melting
temperatures) - Many tools can't deal with large secondary
binding data (i.e., the known transcriptome) well.
12Osprey Goals
- Require minimal user parameterization data
preprocessing - Deal nicely with large datasets
- Be as accurate as possible
- Be fast
13General Procedure for Design
- Read in sequence
- Filter repetitive elements
- Determine optimal melting temperature/ oligo
length - Find candidate duplexes
- Check for undesirable oligo configurations
- Check for undesirable binding to the potential
sample
14Search Techniques
- Run the program iteratively, starting with strict
constraints, and automaticallky relax them as
needed, optimizing oligo similarity of length and
melting temperature - Why reinvent the wheel? There are techniques and
tools that can be used and adapted to satisfy the
oligo selection criteria - These include heuristic methods such as BLAST
for filtering, and dynamic methods where utmost
accuracy is required
15Repetitive Regions
- The design of oligos for degenerative repetitive
regions can be tricky, but we would like to
isolate the overlap regions (of significant
length) to bind unique sets of genes.
16Repetitive Analysis Tools
- MEGABLAST compares two large DNA sequences
against each other. If the query and the subject
are the same, you find repeats! - We use this in Osprey, iteratively, to filter out
repeats, leaving one copy of each repeat. - The key parameters to tweak in MEGABLAST are word
length (min. exact match), ID cutoff, and filter
disabling (will ignore repeats instead of marking
them) - Also useful for cross-genome comparisons
17DNA Folding
- The most commonly used tool for check the folding
confirmation of nucleice acids is the
DNA-FOLD/RNA-FOLD family of programs from Zuker.
We use this to check hairpins, hopefully no more
than 10 to 13kcal/mol. - A member of the package useful for large scale
analysis (many small seqs) is quikfold sic,
which produces only thermodynamic statistics
rather than the pretty fold images available from
the other programs
18Secondary Binding
- The dataset to check against for a microarray
would be the all the ORFs (prokaryotes) or all of
the known cDNA (eukaryotes) - For sequencing it would be the vector and any
known sequence from the clone - Most software uses a ID cutoff, but the location
of the mismatches can greatly change the effect
on melting temperature!
19Accurate Secondary Binding Checks
- Free energy (G) is derived from the same enthalpy
and entropy values as temperature, so why not use
a dynamic method to optimize it in sequence
searches? ?Gº ?Hº - T?Sº
Free energy (G) vs. Tm for left to right, random
50-, 25-, 20-, 15-, and 10-mers
20Profile Alignments
- Hidden Markov Models such as found in the Pfam
database, and Profiles such as found in ProSite
are types of Position Specific Scoring Matrices
(PSSMs). - At each model position, a score is given to each
possible match/mismatch. An A-G mismatch in
different positions may score differently,
usually based on frequencies in MSAs. Below is
the start of a Pfam protein model match
21Profile Alignments
- We can also use the PSSM to encode nearest
neighbour free energy (G) thermodynamics - Note that the first A matches with a score of
580, the second with a score of 1000, this is a
position specific differences because they have
different neighbours (TA vs. AA)
22Caveats
- Profiles, like Smith-Waterman alignments, use
dynamic methods, therefore the optimal solution
will always be found - Using BLAST to find secondary binding, you may
miss matches with good thermodynamics, because of
the shortcuts used to make BLAST fast. - At the very least use a dynamic method to check
oligo secondary matches. Profiles will
additionally tell you not only a ID, but the
free energy
23Results Interpretation
- Given a free energy cutoff, we can check all
secondary results for their melting temperature,
and ensure it is outside the margin desired by
the user (e.g. more than 10 degrees below the
melting temperature of the target duplex)
24Thermodynamics Encoding Rules
- Using a profile we can incorporate the most
important thermodynamic values at each position - 1. a match score is the molar caloric free energy
contribution of the matched base and its 5
neighbor, and a portion of the unified models
length-dependent salt concentration penalty.
25Rules cont'd
- 2. a mismatch score include the free energy
contribution of (a) the matched 5 neighbor and
mismatched base (b) the matched 5 neighbor and
mismatched base on the opposite strand (c)
discount for the NN contribution in the next
position
26Rules cont'd
- 3. the gap insertion penalty reflects the NN free
energy penalty for single base bulges in a duplex - 4. the start of the sequence encodes the unified
models self-complementarity penalties if
applicable. Mismatches in this position also
encode mismatched end thermodynamics.
27Thermodynamic Encoding
- 5. one extra state at each end of the profile
sequence encodes dangling end thermodynamics in
the case that the oligo matches to either
terminus. - These rules are far too complicated to encode in
a standard pairwise scoring matrix, therefore we
use PSSMs. - We use Profiles rather than HMMs because the
scores in HMMs must sum to 1 after a conversion
process, a restriction we can't comply with.
28Hardware Acceleration
- Dynamic methods such as PSSMs are expensive, but
we have hardware accelerators. - Decypher runs these searches at least 50 times
faster than software, so we can use dynamic
methods on the Sulfolobus genome and design a
chip of 3456 optimized oligos in 1.5 hours
29Data Parallelism
- In addition to using hardware accelerators, we
can use software parallelization to speed up
processing, since the calculation of any oligo is
independent from the calculation of any other.
You can apply this prinicipal generally in your
analyses break down your query files into chunks
that can be run in different shells, then
concatenated at the end.
30Web Interface
- A Web interface using the hardware acceleration
for the design of microarrays is available at - http//osprey.ucalgary.ca/
- Your assignment
- Separately try some of the component tools
Osprey uses - Try the Osprey interface