Title: High performance computing in biology: challenges and perspectives
1High performance computing in biology challenges
and perspectives
PNNL-SA-61363
- SciDAC 2008
- Christopher Oehmen
- Pacific Northwest National Laboratory
2Computational Biology computing at many scales
- Biological processes occur simultaneously at many
temporal and spatial scales - Molecules form working units and control systems
- Cells encapsulate information and function
- Tissues, communities can work in concertemerging
behaviors linked to survival
3Genes
- Challenge is to characterize cells by their
genetic signals - Genes turned on indicate processes that are
activated - Dynamic, complex signals result from and
contribute to many combined effects
4Proteins and proteomics
- Proteins working molecules in cells
- Understand what processes a cell is using by
taking a snapshot of molecular activity - Mass spectrometry a good way to characterize all
proteins at once - The key is effectively identifying proteins or
fragments that give rise to MS peaks.
5Modeling and simulation
- Using theory from chemistry and physics to
predict behavior of biomolecular systems. - insight into complex or unmeasurable behaviors
- parameters, boundary conditions, etc... for
higher level simulations - Can help predict protein structure from sequence
6Molecular interactions and relationships
- Graphs used in many different ways in biology
- Can we use graph theory to produce meaningful
visual metaphors and analysis techniques? - Biological graphs can be very complex,
simplifying approximations often obscure the
underlying biology
7Multicell tissues and communities
- Understanding emerging behavior requires
integration across many levels - Control systems
- Signals
- Coordinated Responses
- ...
- Needs computing at terascale, petascale and
beyond
8HPC for genome analysis
- Joint Genome Institute (JGI)
- National Center for Biotechnology Information
(NCBI) - The Institute For Genomic Research (TIGR-JCVI)
- ...
Online tools
Data resources
9Genome sequencing
consortium
Cost/genome Base pairs/day genomes
M 100K 1K-1M 100M-1B Few thousands
institute
single-investigator project
1995
2001
today
10More genomes more analysis!
11But what I really want to do with my genomes is...
- Human health
- Drug design
- Biomarkers
- Energy
- Engineered systems for renewable energy
- Carbon management
- Defense
- Rapid identification
- Forensic analysis
How can we enable users to develop, refine
hypotheses in real time?
12Conventional interface to HPC
Quality of solution related to theory used,
reliability of input and solution domain,
accuracy and precision of mathematical method.
Calculation driven by theoretical formulation and
observations. Generally the hypothesis being
evaluated maps well to the output of the
calculation.
13Biology needs a different interface to HPC
DATA MATH
Calculations often driven by data and a mix of
statistical methods and chemistry, physics, other
equations. Driving question is often a high-level
hypothesis that is not easily mapped to the
output.
14Spectrum of HPC coupling
High degree of coupling
Low degree of coupling
Remote HPC services
Local HPC integrated workflow
Local HPC services
- Website/web services interface to external,
shared large-scale facilities may require - Manual data entry
- Password/authentication
- Limit on priority for large-scale tasks
- Website/web services interface to internal,
dedicated facilities - User has more control over resource allocation
- Application-level access to dedicated HPC
- User doesnt even have to know HPC is being used
15Taking advantage of local HPC resources
- Web services, websites for launching parallel jobs
Local dedicated cluster
User or applications can integrate output into
workflow
Polygraph and ScalaBLAST both implemented using
this model at PNNL
16More tightly couple HPC resources to biological
workflows
- Case study multiple whole-genome analysis
workflow
Visualization
High performance hardware, software
Genome 1
Genome 2
Genome 3
Post-processing
Genome n
17More tightly couple HPC resources to biological
workflows
- Case study proteomics workflow
3. Public tools
Instrument output
Bioinformatics Resource Manager (BRM)
2. PQuad
1. HPC Polygraph
18Challenges for HPC users in biology
- Biological computing is different than most HPC
- Integer mathematics more important than floating
point arithmetic - Memory or memory-latency bound
- Data-driven, data-intensive
- We need different kinds of hardware
- Mathematical challenges
- Biological data is growing exponentially
- 1 false positive rate is unacceptable when you
have 1 billion items - Often want to understand space of good solutions
instead of 1 optimal solution - Scalable applications must continually evolve
with new mathematical theory
19More challenges...
- Policy/permissions
- Most HPC systems operate using multiuser, batch
model - Tightly coupling HPC into biological workflows
means applications will need more immediate
access to compute cycles, NOT batch mode
operation - Local HPC systems not normally available for
anonymous users
20Summary
- Biology sciences are just scratching the surface
of how high performance computing might be used. - HPC solutions in biology will need to accommodate
exponentially growing datasets with evolving
mathematics - Biology will likely continue to provide science
drivers for novel architectures that prioritize
integer mathematics and memory bandwidth/capacity - HPC can maximize its value to biology by more
tightly coupling with analytical pipelines and
workflowsbut this will require different access
and user models
21Acknowledgments
- Support provided by the Data Intensive Computing
For Complex Biological Systems funded by the
Office of Advanced Scientific Computing Research,
and under the LDRD Program at Pacific Northwest
National Laboratory.