Applied Computational Genomics Course - PowerPoint PPT Presentation

1 / 25

About This Presentation

Title:

Applied Computational Genomics Course

Description:

High throughput searches at the COE. Using the Decypher ... User must optimize input batch size to get best throughput (e.g. Magpie sends HMM 500 at a time) ... – PowerPoint PPT presentation

Number of Views:36

Avg rating:3.0/5.0

Slides: 26

Provided by: umani

Category:

more less

Transcript and Presenter's Notes

Title: Applied Computational Genomics Course

1
High throughput searches at the COE

Using the Decypher and TimeLogic bioinformatics
accelerators

2
Our Goals Sensitivity Selectivity
Sensitivity What proportion of the real hits are
reported? (More sensitive means more real hits)
Selectivity What proportion of the reported
hits are real? (More selective means less false
positives)
3
Why do we need all these searches?

BLAST is optimized for finding protein similarity
and strong DNA homology
Sequence with frameshifts (e.g. ESTs) will not be
aligned properly
BLAST alignments do not span introns nicely
Distant or short DNA homology may be missed
Similarity between a sequence and a family of
sequences may be a lot stronger than to any
particular member of that family
BLAST will not do global alignments (e.g. make
sure an EST matches completely against a genomic
segment)
We need special systems to find these more
difficult matches, requiring much more
computation

4
Physical Machinery

Timelogic system
a Sun v880 server with 4 PCI cards containing
reprogrammable hardware (a.k.a. Decypher, coe02)
Paracel system
a 24 CPU Linux cluster (aka BlastMachine, coe04)
a 27000 specialized processor unit (a.k.a.
GeneMatcher, coe05)

5
What can we speed up?
6
Acceleration General Principles

Bottlenecks in the computation include
Number of processing cycles available at one time
(maxing out the processors)
Getting data from disk or memory to the
processing unit (causing processor to idly wait)
Unnecessary processing instructions created by
generically compiled software
Accelerators try to reduce these bottlenecks to
improve total throughput, not always
time-to-completion of a single job.

7
Acceleration Solutions

Parallelization, including one or more of
All sequences from a batch have their own
independent processors to run on (GeneMatcher
Decypher)
A single database input stream passes through all
the processors (GeneMatcher Decypher)
The database searched is broken down into parts,
searched independently, then results are combined
(Paracel BlastMachine)

8
Acceleration Solutions

Just like 3D graphics cards use hardware to
accelerate games, hardware can speed up
bioinformatics.
Hardware logic available includes
ASIC, low-cost hardwiring with some flexibility
in the logic using microcode (used by
GeneMatcher)
FPGA, expensive hardware whose logic is
completely reconfigurable (used by Decypher),
meaning it is never obsolete

9
Strengths Weaknesses Decypher

Profiles Hidden Markov Models fastest by far
TeraBLAST algorithm performs seed finding in
hardware, alignment in software
GeneBLAST allows introns in local alignment
User must optimize input batch size to get best
throughput (e.g. Magpie sends HMM 500 at a time)
No job progress reporting

10
Strengths Weaknesses Paracel

Smith-Watermans are fastest available
Performs GeneWise searches to match HMMs to
genomic data (intron spanning)
Smart queuing system lumps together jobs that can
run in parallel (no need to optimize input size)
Runs cluster-accelerated software BLAST,
PSI-BLAST, and MEGABLAST with nearly the exact
output expected from a NCBI search

11
Strengths Weaknesses Paracel

HMM Profile searches are moderately
accelerated, but by default produce inverted
results (HMMs vs. seq, not seq vs. HMMs)
Sometimes large jobs can have RPC timeout issues
for remote clients

12
Time comparison

21,000 ESTs against non-redundant nucleotide
database using TeraTBLASTX on coe02
3 days 5 hours (13 seconds per EST)
Sample EST vs. same database, using NCBI service
4.5 minutes without queue time (20x slower)
1000 proteins vs. Pfam using Decypher
1.5 minutes (0.09 seconds per protein)
Sample protein vs. Pfam using cluster_at_WUSTL
16 seconds without queue time (175x slower)

13
Other uses of Smith-Waterman

As we learn more about RNA interference, we see
that microRNAs target genes with matches that
BLAST cannot find due to short sequence length
and gaps.
SW and other dynamic methods will be important in
identifying sites and minimizing off-target
activity.

Kiriakidou M, et al. Genes and Development
18(10)1165-78.
14
Other Uses of Smith-Waterman

Because only SW checks all substrings of the
query, it is the only search method that can
report a real Z-score (FastA also has an
estimate)
Z score is number of standard deviations the
alignment raw score is from the distribution mean
for this query against this particular database.
Z scores are more useful than random expectation
(e-value) scores for applications like EST
clustering, because they measure how unique
matches are. Kinases may have high e-value
against each other, but low Z-score if there are
many kinases in the database.

15
Okay, theyre fast. Now how do a I use them?

Web interfaces for small jobs using standard
parameters

http//coe04.ucalgary.ca
http//coe02.ucalgary.ca
Also through CBR Web site
16
Command line interface (CLI) why?

All commands are launched from coe01, which
remotely invokes the programs and saves output to
a local file
The command line lets you specify more options
than the Web
These searches can be automated
Clients can be installed on your local machine
too if you do a lot of searches

17
CLI Paracel BLAST (BlastMachine)

The same command line invocation as NCBI
blastall/megablast/blastpgp, but with pb
prepended, and input must be from a file, not
standard input (keyboard)
pb blastall p blastp d nr i in.fasta o
out.txt
Running pb blastall will give command line usage

18
CLI BioView Toolkit (GeneMatcher)

A command argument hierarchy exists, all starting
with btk
Syntax help is available for each level of the
hierarchy by adding h1-4 to the end of the
command, where 1 is terse, 4 is verbose
btk h1
btk search h1
btk search tswx h2
btk search tswx formatblast2 matrixblosum62
evalue_threshold0.01 dbsetall_est_aa
outputresults.txt queryquery.fasta

19
CLI TimeLogic Decypher

The TimeLogic machine searches are invoked via
network socket connection API
By default, requesting a search will take a
second and return you to the command line, with
results to be picked up some time later with
another command that fetches the output via a
socket.
To wait for results once a search is launched,
use the versions of the CLI commands that end in
_rt (Real Time), they behave more like standard
search programs

20
CLI TimeLogic Decypher

Launching Decypher jobs is based on filling in a
search template
Templates for most searches already exists, you
just need to specify a few command line options
to fill in missing template values (such as the
query file name), or to override the template
defaults (e.g. significance cutoff)
The templates consists of plain text files with
keyword/value pairs. The Keyword Reference Sheet
is good to keep nearby to customize your search.

21
CLI TimeLogic Decypher

Templates can be found on the coe01 file system
in the folders DECYPHER/templates and
DECYPHER/tmpl_newtargs
Example usage
dc_template_rt -template hmm_aa_vs_aa
-targ pfam -sig evalue -thresh
significance1e-10 -query query.fasta gt
results.txt

22
CLI Database Uploads

We try to provide up-to-date versions of the
common public databases for these servers
If you need to search another database my, you
must upload the data to the appropriate server
e.g.
BlastMachine pb formatdb -n my -i db.fasta
-p F -o T -t my database
GeneMatcher btk db load namemy seqtypedna
dst/fdf/coe05/gm0/0/nt srcdb.fasta -nt2aa
-nt2codon
Decypher dc_new_target_rt -template
format_nt_into_nt -targ nt -source db.fasta -desc
my database
Remember, disk space isnt infinite! Remove the
databases when you no longer need them.

23
CLI Database Availability

Sequences NR, NT, DBEST, SwissProt, Human,
Mouse, Arabidopsis
HMMs Pfam, TIGRFam, SMART, SuperFamily
(Structural Classification), Panther, CATH
In-house constructed HMM libraries from NCBI
Clusters of Orthologuous Genes (COGs) and Protein
Identification Resource (PIR SuperFams)
Profiles Prosite

24
Other functions

Decypher can do multiple sequence alignments,
build a HMM, then use the HMM to search for more
homologs, then refine the MSA with new hits, etc.
Like PSI-BLAST on steroids.
Paracel Transcript Assembler for automatically
clustering and assembling EST sequences, can
optionally use the GeneMatcher
Osprey (oligonucleotide design software) can use
Decyphers profile searches to model
oligonucleotide thermodynamics

25
Assignment

Run sample data through the regular channels
Run the same data through the Paracel and
TimeLogic systems using the command line
interface.
Note the difference in speed
Note the difference in database hit scores

Write a Comment

User Comments (0)