Title: Mark Silberstein, CS, Technion
1Computational Biology Laboratory
Distributed Systems Laboratory
Superlink-Online Harnessing the worlds
computers to hunt for disease-provoking genes
- Mark Silberstein, CS, Technion
- Dan Geiger, Computational Biology Lab
- Assaf Schuster, Distributed Systems Lab
- Genetics Research Institutes in Israel, EU, US
2Purpose of disease gene hunting
- Why Search ?
- Detection of diseases before birth
- Risk assessment and corresponding life style
changes - Finding the mutant proteins and developing
medicine - Understanding basic biological functions
- How to Search ?
- Find families segregating the disease (linkage
analysis) or collect unrelated healthy and
affected persons (Association analysis or LD
mapping) - Take a simple blood test from some individuals
- Analyze the DNA in the lab
- Compute the most likely location of disease gene
3Steps in Gene Hunting
Linkageanalysis(106107 bp)
4Recombination During Meiosis
5Familial Onychodysplasia and dysplasia of distal
phalanges (ODP)
III-15
IV-10
IV-7
6Family Pedigree
7Marker Information Added
8Maximum Likelihood Evaluation
The computational problem find a value of ?
maximizing Pr(data?)
LOD score (to quantify how confident we are)
Z(?)log10Pr(data?) / Pr(data?½).
9Results of Multipoint Analysis
10The Bayesian network model
Si3f
Li2f
y2
Xi2
Li2m
Li3f
Xi3
Li3m
Y3
Li1f
Xi1
Y1
Li1m
Si3m
Locus 3
Locus 4
Locus 2 (Disease)
Locus 1
This model depicts the qualitative relations
between the variables. We need also to specify
the joint distribution over these variables.
11The Computational Task
- Computing Pr(data?) for a specific value of ?
- Exponential time and space in
- variables
- five per person
- markers
- gene loci
- values per variable
- alleles
- non-typed persons
- table dimensionality
- cycles in pedigree
12Task length distribution
- Task length unknown upon submission
- From seconds to millenniums
- Computing task length? NP hard
- Estimate task length as we go
lt3minuts lt2hours lt2days lt2weeks
lt3months gt3months
13Divisible Tasks through Variable Conditioning
non trivial parallelization overhead
14Free resource pools, grids
- Weak/no quality of service
- Random failures of execution machines
- Preemption due to higher priority tasks
- Hardware bugs may lead to incorrect results
- Potentially unbounded execution/queue waiting
time - Dynamic/abrupt changes of resource availability
- High network delays (communication over WAN)
- Multiple tasks
15Terminology
- Basic unit of execution batch job
- Non-interactive mode enqueue wait execute
return - Self-contained execution sandbox
- A linkage analysis request - a task
- A bag (of millions) of jobs
- Turnaround time is important
16Requirements
- The system must be geneticists-friendly
- Interactive experience
- Low response time for short tasks
- Prompt user feedback
- Simple, secure, reliable, stable,
overload-resistant, concurrent tasks, multiple
users... - Fast computation of previously infeasible long
tasks via parallel execution - Harness all available resources grids, clouds,
clusters - Use them efficiently!
17Grids or Clouds?
Remaining Jobs in Queue
Long tail due to failures
Time
- Small tasks are severely slow on grids
- Takes 5 minutes on 10-nodes dedicated cluster
- May take several hours on a grid
Should we move scientific loads on the cloud? YES!
18Grids or Clouds?
- Consider 3.2x106 jobs, 40 min each
- It took 21 days on 6000-8000 CPUs
- It would cost about 10K on Amazons EC2
Should we move scientific loads on the cloud? NO!
19Clouds or Grids? Clouds and Grids!
Opportunistic
Dedicated
Burst computing
Throughput computing
20Cheap and Expensive Resources
- Task sensitivity to QoS differ in different stages
Remaining jobs in queue
- Use cheap unreliable resources
- Grids
- Community grids
- Non-dedicated clusters
- Use expensive reliable resources
- Dedicated clusters
- Clouds
- Dynamically determine entering tail mode
- Switch to expensive resources (gracefully)
21Glue pools together via overlay
Submitter to Grid 2
Issues granularity, load balancing, firewalls,
failed resources, scheduler scalability
22Practical considerations
- Overlay scalability and firewall penetration
- Server may not initiate connect to the agent
- Compatibility with community grids
- The server is based on BOINC
- Agents are upgraded BOINC clients
- Elimination of failed resources from scheduling
- Performance statistics is analyzed
- Resource allocation depending on the task state
- Dynamic policy update via Condor classad mechanism
23(No Transcript)
24Superlink-online 1.0 http//bioinfo.cs.technion.a
c.il
25Task Submission
26Superlink-online statistics
- 1720 CPU years for 18,000 tasks during
2006-2008 (counting) - 37 citations (several mutations found)
- Examples Ichthyosis,"uncomplicated" hereditary
spastic paraplegia (1-9 people per 100,000) - Over 250 (counting) users Israeli and
international - Soroka H., Be'er Sheva, Galil Ma'aravi H.,
Nahariya, Rabin H., Petah Tikva, Rambam H.,
Haifa, Beney Tzion H., Haifa, Sha'arey Tzedek H.,
Jerusalem, Hadassa H., Jerusalem, Afula H. NIH,
Universities and research centers in US, France,
Germany, UK, Italy, Austria, Spain, Taiwan,
Australia, and others... - Task example
- 250 days on single computer - 7 hours on 300-700
computers - Short tasks few seconds even during severe
overload
27Using our system in Israeli Hospitals
- Rabin Hospital, by Motti Shochats group
- New locus for mental retardation
- Infantile bilateral striatal necrosis
- Soroka Hospital, by Ohad Birks group
- Lethal congenital contractural syndrome
- Congenital cataract
- Rambam Hospital, by Eli Shprechers group
- Congenital recessive ichthyosis
- CEDNIK syndrome
- Galil Maaravi Hospital, by Tzipi Faliks group
- Familial Onychodysplasia and dysplasia
- Familial juvenile hypertrophy
28Utilizing Community Computing
3.4 TFLOPs, 3000 users, from 75 countries
29Superlink-online V2(beta) deployment
Submission server
EGEE-II BIOMED VO
Dedicated cluster
UW in Madison Condor pool
12,000 hosts operational during the last month
Superlink_at_Campus
Superlink_at_Technion
OSG GLOW VO
303.1 million jobs in 21 days
60 dedicated CPUs only
31Conclusions
- Our system integrates clusters, grids, clouds,
community grids, etc. - Geneticist friendly
- Minimizes use of expensive resources while
providing QoS for tasks - Generic mechanism for scheduling policy
- Can dynamically reroute jobs from one pool to
another according to a given optimization
function (budget, energy, etc.)
32Why GPUs?
Memory BW 88 GB/s peak 56GB/s observed on
GTX8800 NVIDIA - 550 Memory BW 21GB/s peak on
3.0 Ghz Intel Core2 Quad - 1100 CPUs 1.4x
annual growth GPUs 1.7x annual growth
33NVIDIA Compute Unified Device Architecture (CUDA)
GPU
1 cycle TB/s
Global Memory
34Key ideas (Joint work with John Owens -UC Davis)
- Software-managed cache
- We implement the cache replacement policy in
software - Maximization of data reuse
- Better compute/memory access ratio
- A simple model for performance bounds
- Yes, we are (optimal)
- Use special function units for hardware-assisted
execution
35Results summary
- Experiment setup
- CPU single core Intel Core 2 2.4GHz, 4MB L2
- GPU NVIDIA G80 (GTX8800), 750MB GDDR4, 128 SP,
16K mem / 512 threads - Only kernel runtime included (no memory
transfers, no CPU setup time)
2500 2 x 25 x 25 x 2
Use of SFU expf is about 6x slower than
on GPU, but 200x slower on CPU
Hardware
Software managed Caching
36Acknowledgments
- Superlink-online team
- Alumni Anna Tzemach, Julia Stolin, Nikolay
Dovgolevsky, Maayan Fishelson, Hadar Grubman,
Ophir Etzion - Current Artyom Sharov, Oren Shtark
- Prof. Miron Livny (Condor pool UW Madison, OSG)
- EGEE BIOMED VO and OSG GLOW VO
- Microsoft TCI program, NIH grant, SciDAC
Institute for ultrascale visualization
If your grid is underutilized let us
know! Visit us at http//bioinfo.cs.technion.ac.i
l/superlink-online Superlink_at_TECHNION project
home page http//cbl-boinc-server2.cs.technion.ac
.il/superlinkattechnion
37Visit us at http//bioinfo.cs.technion.ac.il/supe
rlink-online