Applied Computational Genomics Course - PowerPoint PPT Presentation

1 / 25
About This Presentation
Title:

Applied Computational Genomics Course

Description:

High throughput searches at the COE. Using the Decypher ... User must optimize input batch size to get best throughput (e.g. Magpie sends HMM 500 at a time) ... – PowerPoint PPT presentation

Number of Views:36
Avg rating:3.0/5.0
Slides: 26
Provided by: umani
Category:

less

Transcript and Presenter's Notes

Title: Applied Computational Genomics Course


1
High throughput searches at the COE
  • Using the Decypher and TimeLogic bioinformatics
    accelerators

2
Our Goals Sensitivity Selectivity
Sensitivity What proportion of the real hits are
reported? (More sensitive means more real hits)
Selectivity What proportion of the reported
hits are real? (More selective means less false
positives)
3
Why do we need all these searches?
  • BLAST is optimized for finding protein similarity
    and strong DNA homology
  • Sequence with frameshifts (e.g. ESTs) will not be
    aligned properly
  • BLAST alignments do not span introns nicely
  • Distant or short DNA homology may be missed
  • Similarity between a sequence and a family of
    sequences may be a lot stronger than to any
    particular member of that family
  • BLAST will not do global alignments (e.g. make
    sure an EST matches completely against a genomic
    segment)
  • We need special systems to find these more
    difficult matches, requiring much more
    computation

4
Physical Machinery
  • Timelogic system
  • a Sun v880 server with 4 PCI cards containing
    reprogrammable hardware (a.k.a. Decypher, coe02)
  • Paracel system
  • a 24 CPU Linux cluster (aka BlastMachine, coe04)
  • a 27000 specialized processor unit (a.k.a.
    GeneMatcher, coe05)

5
What can we speed up?
6
Acceleration General Principles
  • Bottlenecks in the computation include
  • Number of processing cycles available at one time
    (maxing out the processors)
  • Getting data from disk or memory to the
    processing unit (causing processor to idly wait)
  • Unnecessary processing instructions created by
    generically compiled software
  • Accelerators try to reduce these bottlenecks to
    improve total throughput, not always
    time-to-completion of a single job.

7
Acceleration Solutions
  • Parallelization, including one or more of
  • All sequences from a batch have their own
    independent processors to run on (GeneMatcher
    Decypher)
  • A single database input stream passes through all
    the processors (GeneMatcher Decypher)
  • The database searched is broken down into parts,
    searched independently, then results are combined
    (Paracel BlastMachine)

8
Acceleration Solutions
  • Just like 3D graphics cards use hardware to
    accelerate games, hardware can speed up
    bioinformatics.
  • Hardware logic available includes
  • ASIC, low-cost hardwiring with some flexibility
    in the logic using microcode (used by
    GeneMatcher)
  • FPGA, expensive hardware whose logic is
    completely reconfigurable (used by Decypher),
    meaning it is never obsolete

9
Strengths Weaknesses Decypher
  • Profiles Hidden Markov Models fastest by far
  • TeraBLAST algorithm performs seed finding in
    hardware, alignment in software
  • GeneBLAST allows introns in local alignment
  • User must optimize input batch size to get best
    throughput (e.g. Magpie sends HMM 500 at a time)
  • No job progress reporting

10
Strengths Weaknesses Paracel
  • Smith-Watermans are fastest available
  • Performs GeneWise searches to match HMMs to
    genomic data (intron spanning)
  • Smart queuing system lumps together jobs that can
    run in parallel (no need to optimize input size)
  • Runs cluster-accelerated software BLAST,
    PSI-BLAST, and MEGABLAST with nearly the exact
    output expected from a NCBI search

11
Strengths Weaknesses Paracel
  • HMM Profile searches are moderately
    accelerated, but by default produce inverted
    results (HMMs vs. seq, not seq vs. HMMs)
  • Sometimes large jobs can have RPC timeout issues
    for remote clients

12
Time comparison
  • 21,000 ESTs against non-redundant nucleotide
    database using TeraTBLASTX on coe02
  • 3 days 5 hours (13 seconds per EST)
  • Sample EST vs. same database, using NCBI service
  • 4.5 minutes without queue time (20x slower)
  • 1000 proteins vs. Pfam using Decypher
  • 1.5 minutes (0.09 seconds per protein)
  • Sample protein vs. Pfam using cluster_at_WUSTL
  • 16 seconds without queue time (175x slower)

13
Other uses of Smith-Waterman
  • As we learn more about RNA interference, we see
    that microRNAs target genes with matches that
    BLAST cannot find due to short sequence length
    and gaps.
  • SW and other dynamic methods will be important in
    identifying sites and minimizing off-target
    activity.

Kiriakidou M, et al. Genes and Development
18(10)1165-78.
14
Other Uses of Smith-Waterman
  • Because only SW checks all substrings of the
    query, it is the only search method that can
    report a real Z-score (FastA also has an
    estimate)
  • Z score is number of standard deviations the
    alignment raw score is from the distribution mean
    for this query against this particular database.
  • Z scores are more useful than random expectation
    (e-value) scores for applications like EST
    clustering, because they measure how unique
    matches are. Kinases may have high e-value
    against each other, but low Z-score if there are
    many kinases in the database.

15
Okay, theyre fast. Now how do a I use them?
  • Web interfaces for small jobs using standard
    parameters

http//coe04.ucalgary.ca
http//coe02.ucalgary.ca
Also through CBR Web site
16
Command line interface (CLI) why?
  • All commands are launched from coe01, which
    remotely invokes the programs and saves output to
    a local file
  • The command line lets you specify more options
    than the Web
  • These searches can be automated
  • Clients can be installed on your local machine
    too if you do a lot of searches

17
CLI Paracel BLAST (BlastMachine)
  • The same command line invocation as NCBI
    blastall/megablast/blastpgp, but with pb
    prepended, and input must be from a file, not
    standard input (keyboard)
  • pb blastall p blastp d nr i in.fasta o
    out.txt
  • Running pb blastall will give command line usage

18
CLI BioView Toolkit (GeneMatcher)
  • A command argument hierarchy exists, all starting
    with btk
  • Syntax help is available for each level of the
    hierarchy by adding h1-4 to the end of the
    command, where 1 is terse, 4 is verbose
  • btk h1
  • btk search h1
  • btk search tswx h2
  • btk search tswx formatblast2 matrixblosum62
    evalue_threshold0.01 dbsetall_est_aa
    outputresults.txt queryquery.fasta

19
CLI TimeLogic Decypher
  • The TimeLogic machine searches are invoked via
    network socket connection API
  • By default, requesting a search will take a
    second and return you to the command line, with
    results to be picked up some time later with
    another command that fetches the output via a
    socket.
  • To wait for results once a search is launched,
    use the versions of the CLI commands that end in
    _rt (Real Time), they behave more like standard
    search programs

20
CLI TimeLogic Decypher
  • Launching Decypher jobs is based on filling in a
    search template
  • Templates for most searches already exists, you
    just need to specify a few command line options
    to fill in missing template values (such as the
    query file name), or to override the template
    defaults (e.g. significance cutoff)
  • The templates consists of plain text files with
    keyword/value pairs. The Keyword Reference Sheet
    is good to keep nearby to customize your search.

21
CLI TimeLogic Decypher
  • Templates can be found on the coe01 file system
    in the folders DECYPHER/templates and
    DECYPHER/tmpl_newtargs
  • Example usage
  • dc_template_rt -template hmm_aa_vs_aa
  • -targ pfam -sig evalue -thresh
    significance1e-10 -query query.fasta gt
    results.txt

22
CLI Database Uploads
  • We try to provide up-to-date versions of the
    common public databases for these servers
  • If you need to search another database my, you
    must upload the data to the appropriate server
    e.g.
  • BlastMachine pb formatdb -n my -i db.fasta
  • -p F -o T -t my database
  • GeneMatcher btk db load namemy seqtypedna
    dst/fdf/coe05/gm0/0/nt srcdb.fasta -nt2aa
    -nt2codon
  • Decypher dc_new_target_rt -template
    format_nt_into_nt -targ nt -source db.fasta -desc
    my database
  • Remember, disk space isnt infinite! Remove the
    databases when you no longer need them.

23
CLI Database Availability
  • Sequences NR, NT, DBEST, SwissProt, Human,
    Mouse, Arabidopsis
  • HMMs Pfam, TIGRFam, SMART, SuperFamily
    (Structural Classification), Panther, CATH
  • In-house constructed HMM libraries from NCBI
    Clusters of Orthologuous Genes (COGs) and Protein
    Identification Resource (PIR SuperFams)
  • Profiles Prosite

24
Other functions
  • Decypher can do multiple sequence alignments,
    build a HMM, then use the HMM to search for more
    homologs, then refine the MSA with new hits, etc.
    Like PSI-BLAST on steroids.
  • Paracel Transcript Assembler for automatically
    clustering and assembling EST sequences, can
    optionally use the GeneMatcher
  • Osprey (oligonucleotide design software) can use
    Decyphers profile searches to model
    oligonucleotide thermodynamics

25
Assignment
  • Run sample data through the regular channels
  • Run the same data through the Paracel and
    TimeLogic systems using the command line
    interface.
  • Note the difference in speed
  • Note the difference in database hit scores
Write a Comment
User Comments (0)
About PowerShow.com