Title: Applied Computational Genomics Course
1High throughput searches at the COE
- Using the Decypher and TimeLogic bioinformatics
accelerators
2Our Goals Sensitivity Selectivity
Sensitivity What proportion of the real hits are
reported? (More sensitive means more real hits)
Selectivity What proportion of the reported
hits are real? (More selective means less false
positives)
3Why do we need all these searches?
- BLAST is optimized for finding protein similarity
and strong DNA homology - Sequence with frameshifts (e.g. ESTs) will not be
aligned properly - BLAST alignments do not span introns nicely
- Distant or short DNA homology may be missed
- Similarity between a sequence and a family of
sequences may be a lot stronger than to any
particular member of that family - BLAST will not do global alignments (e.g. make
sure an EST matches completely against a genomic
segment) - We need special systems to find these more
difficult matches, requiring much more
computation
4Physical Machinery
- Timelogic system
- a Sun v880 server with 4 PCI cards containing
reprogrammable hardware (a.k.a. Decypher, coe02) - Paracel system
- a 24 CPU Linux cluster (aka BlastMachine, coe04)
- a 27000 specialized processor unit (a.k.a.
GeneMatcher, coe05)
5What can we speed up?
6Acceleration General Principles
- Bottlenecks in the computation include
- Number of processing cycles available at one time
(maxing out the processors) - Getting data from disk or memory to the
processing unit (causing processor to idly wait) - Unnecessary processing instructions created by
generically compiled software - Accelerators try to reduce these bottlenecks to
improve total throughput, not always
time-to-completion of a single job.
7Acceleration Solutions
- Parallelization, including one or more of
- All sequences from a batch have their own
independent processors to run on (GeneMatcher
Decypher) - A single database input stream passes through all
the processors (GeneMatcher Decypher) - The database searched is broken down into parts,
searched independently, then results are combined
(Paracel BlastMachine)
8Acceleration Solutions
- Just like 3D graphics cards use hardware to
accelerate games, hardware can speed up
bioinformatics. - Hardware logic available includes
- ASIC, low-cost hardwiring with some flexibility
in the logic using microcode (used by
GeneMatcher) - FPGA, expensive hardware whose logic is
completely reconfigurable (used by Decypher),
meaning it is never obsolete
9Strengths Weaknesses Decypher
- Profiles Hidden Markov Models fastest by far
- TeraBLAST algorithm performs seed finding in
hardware, alignment in software - GeneBLAST allows introns in local alignment
- User must optimize input batch size to get best
throughput (e.g. Magpie sends HMM 500 at a time) - No job progress reporting
10Strengths Weaknesses Paracel
- Smith-Watermans are fastest available
- Performs GeneWise searches to match HMMs to
genomic data (intron spanning) - Smart queuing system lumps together jobs that can
run in parallel (no need to optimize input size) - Runs cluster-accelerated software BLAST,
PSI-BLAST, and MEGABLAST with nearly the exact
output expected from a NCBI search
11Strengths Weaknesses Paracel
- HMM Profile searches are moderately
accelerated, but by default produce inverted
results (HMMs vs. seq, not seq vs. HMMs) - Sometimes large jobs can have RPC timeout issues
for remote clients
12Time comparison
- 21,000 ESTs against non-redundant nucleotide
database using TeraTBLASTX on coe02 - 3 days 5 hours (13 seconds per EST)
- Sample EST vs. same database, using NCBI service
- 4.5 minutes without queue time (20x slower)
- 1000 proteins vs. Pfam using Decypher
- 1.5 minutes (0.09 seconds per protein)
- Sample protein vs. Pfam using cluster_at_WUSTL
- 16 seconds without queue time (175x slower)
13Other uses of Smith-Waterman
- As we learn more about RNA interference, we see
that microRNAs target genes with matches that
BLAST cannot find due to short sequence length
and gaps. - SW and other dynamic methods will be important in
identifying sites and minimizing off-target
activity.
Kiriakidou M, et al. Genes and Development
18(10)1165-78.
14Other Uses of Smith-Waterman
- Because only SW checks all substrings of the
query, it is the only search method that can
report a real Z-score (FastA also has an
estimate) - Z score is number of standard deviations the
alignment raw score is from the distribution mean
for this query against this particular database. - Z scores are more useful than random expectation
(e-value) scores for applications like EST
clustering, because they measure how unique
matches are. Kinases may have high e-value
against each other, but low Z-score if there are
many kinases in the database.
15Okay, theyre fast. Now how do a I use them?
- Web interfaces for small jobs using standard
parameters
http//coe04.ucalgary.ca
http//coe02.ucalgary.ca
Also through CBR Web site
16Command line interface (CLI) why?
- All commands are launched from coe01, which
remotely invokes the programs and saves output to
a local file - The command line lets you specify more options
than the Web - These searches can be automated
- Clients can be installed on your local machine
too if you do a lot of searches
17CLI Paracel BLAST (BlastMachine)
- The same command line invocation as NCBI
blastall/megablast/blastpgp, but with pb
prepended, and input must be from a file, not
standard input (keyboard) - pb blastall p blastp d nr i in.fasta o
out.txt - Running pb blastall will give command line usage
18CLI BioView Toolkit (GeneMatcher)
- A command argument hierarchy exists, all starting
with btk - Syntax help is available for each level of the
hierarchy by adding h1-4 to the end of the
command, where 1 is terse, 4 is verbose - btk h1
- btk search h1
- btk search tswx h2
- btk search tswx formatblast2 matrixblosum62
evalue_threshold0.01 dbsetall_est_aa
outputresults.txt queryquery.fasta
19CLI TimeLogic Decypher
- The TimeLogic machine searches are invoked via
network socket connection API - By default, requesting a search will take a
second and return you to the command line, with
results to be picked up some time later with
another command that fetches the output via a
socket. - To wait for results once a search is launched,
use the versions of the CLI commands that end in
_rt (Real Time), they behave more like standard
search programs
20CLI TimeLogic Decypher
- Launching Decypher jobs is based on filling in a
search template - Templates for most searches already exists, you
just need to specify a few command line options
to fill in missing template values (such as the
query file name), or to override the template
defaults (e.g. significance cutoff) - The templates consists of plain text files with
keyword/value pairs. The Keyword Reference Sheet
is good to keep nearby to customize your search.
21CLI TimeLogic Decypher
- Templates can be found on the coe01 file system
in the folders DECYPHER/templates and
DECYPHER/tmpl_newtargs - Example usage
- dc_template_rt -template hmm_aa_vs_aa
- -targ pfam -sig evalue -thresh
significance1e-10 -query query.fasta gt
results.txt
22CLI Database Uploads
- We try to provide up-to-date versions of the
common public databases for these servers - If you need to search another database my, you
must upload the data to the appropriate server
e.g. - BlastMachine pb formatdb -n my -i db.fasta
- -p F -o T -t my database
- GeneMatcher btk db load namemy seqtypedna
dst/fdf/coe05/gm0/0/nt srcdb.fasta -nt2aa
-nt2codon - Decypher dc_new_target_rt -template
format_nt_into_nt -targ nt -source db.fasta -desc
my database - Remember, disk space isnt infinite! Remove the
databases when you no longer need them.
23CLI Database Availability
- Sequences NR, NT, DBEST, SwissProt, Human,
Mouse, Arabidopsis - HMMs Pfam, TIGRFam, SMART, SuperFamily
(Structural Classification), Panther, CATH - In-house constructed HMM libraries from NCBI
Clusters of Orthologuous Genes (COGs) and Protein
Identification Resource (PIR SuperFams) - Profiles Prosite
24Other functions
- Decypher can do multiple sequence alignments,
build a HMM, then use the HMM to search for more
homologs, then refine the MSA with new hits, etc.
Like PSI-BLAST on steroids. - Paracel Transcript Assembler for automatically
clustering and assembling EST sequences, can
optionally use the GeneMatcher - Osprey (oligonucleotide design software) can use
Decyphers profile searches to model
oligonucleotide thermodynamics
25Assignment
- Run sample data through the regular channels
- Run the same data through the Paracel and
TimeLogic systems using the command line
interface. - Note the difference in speed
- Note the difference in database hit scores