High performance computing in biology: challenges and perspectives - PowerPoint PPT Presentation

1 / 21

About This Presentation

Title:

High performance computing in biology: challenges and perspectives

Description:

Biological processes occur simultaneously at many temporal and ... Polygraph and ScalaBLAST both implemented using ... 1. HPC Polygraph. Challenges for HPC ... – PowerPoint PPT presentation

Number of Views:23

Avg rating:3.0/5.0

Slides: 22

Provided by: Staf740

Category:

more less

Transcript and Presenter's Notes

Title: High performance computing in biology: challenges and perspectives

1
High performance computing in biology challenges
and perspectives
PNNL-SA-61363

SciDAC 2008
Christopher Oehmen
Pacific Northwest National Laboratory

2
Computational Biology computing at many scales

Biological processes occur simultaneously at many
temporal and spatial scales
Molecules form working units and control systems
Cells encapsulate information and function
Tissues, communities can work in concertemerging
behaviors linked to survival

3
Genes

Challenge is to characterize cells by their
genetic signals
Genes turned on indicate processes that are
activated
Dynamic, complex signals result from and
contribute to many combined effects

4
Proteins and proteomics

Proteins working molecules in cells
Understand what processes a cell is using by
taking a snapshot of molecular activity
Mass spectrometry a good way to characterize all
proteins at once
The key is effectively identifying proteins or
fragments that give rise to MS peaks.

5
Modeling and simulation

Using theory from chemistry and physics to
predict behavior of biomolecular systems.
insight into complex or unmeasurable behaviors
parameters, boundary conditions, etc... for
higher level simulations
Can help predict protein structure from sequence

6
Molecular interactions and relationships

Graphs used in many different ways in biology
Can we use graph theory to produce meaningful
visual metaphors and analysis techniques?
Biological graphs can be very complex,
simplifying approximations often obscure the
underlying biology

7
Multicell tissues and communities

Understanding emerging behavior requires
integration across many levels
Control systems
Signals
Coordinated Responses
...
Needs computing at terascale, petascale and
beyond

8
HPC for genome analysis

Joint Genome Institute (JGI)
National Center for Biotechnology Information
(NCBI)
The Institute For Genomic Research (TIGR-JCVI)
...

Online tools
Data resources
9
Genome sequencing
consortium
Cost/genome Base pairs/day genomes
M 100K 1K-1M 100M-1B Few thousands
institute
single-investigator project
1995
2001
today
10
More genomes more analysis!
11
But what I really want to do with my genomes is...

Human health
Drug design
Biomarkers
Energy
Engineered systems for renewable energy
Carbon management
Defense
Rapid identification
Forensic analysis

How can we enable users to develop, refine
hypotheses in real time?
12
Conventional interface to HPC
Quality of solution related to theory used,
reliability of input and solution domain,
accuracy and precision of mathematical method.
Calculation driven by theoretical formulation and
observations. Generally the hypothesis being
evaluated maps well to the output of the
calculation.
13
Biology needs a different interface to HPC
DATA MATH
Calculations often driven by data and a mix of
statistical methods and chemistry, physics, other
equations. Driving question is often a high-level
hypothesis that is not easily mapped to the
output.
14
Spectrum of HPC coupling
High degree of coupling
Low degree of coupling
Remote HPC services
Local HPC integrated workflow
Local HPC services

Website/web services interface to external,
shared large-scale facilities may require
Manual data entry
Password/authentication
Limit on priority for large-scale tasks

Website/web services interface to internal,
dedicated facilities
User has more control over resource allocation

Application-level access to dedicated HPC
User doesnt even have to know HPC is being used

15
Taking advantage of local HPC resources

Web services, websites for launching parallel jobs

Local dedicated cluster
User or applications can integrate output into
workflow
Polygraph and ScalaBLAST both implemented using
this model at PNNL
16
More tightly couple HPC resources to biological
workflows

Case study multiple whole-genome analysis
workflow

Visualization
High performance hardware, software
Genome 1
Genome 2
Genome 3
Post-processing

Genome n
17
More tightly couple HPC resources to biological
workflows

Case study proteomics workflow

3. Public tools
Instrument output
Bioinformatics Resource Manager (BRM)
2. PQuad
1. HPC Polygraph
18
Challenges for HPC users in biology

Biological computing is different than most HPC
Integer mathematics more important than floating
point arithmetic
Memory or memory-latency bound
Data-driven, data-intensive
We need different kinds of hardware
Mathematical challenges
Biological data is growing exponentially
1 false positive rate is unacceptable when you
have 1 billion items
Often want to understand space of good solutions
instead of 1 optimal solution
Scalable applications must continually evolve
with new mathematical theory

19
More challenges...

Policy/permissions
Most HPC systems operate using multiuser, batch
model
Tightly coupling HPC into biological workflows
means applications will need more immediate
access to compute cycles, NOT batch mode
operation
Local HPC systems not normally available for
anonymous users

20
Summary

Biology sciences are just scratching the surface
of how high performance computing might be used.
HPC solutions in biology will need to accommodate
exponentially growing datasets with evolving
mathematics
Biology will likely continue to provide science
drivers for novel architectures that prioritize
integer mathematics and memory bandwidth/capacity
HPC can maximize its value to biology by more
tightly coupling with analytical pipelines and
workflowsbut this will require different access
and user models

21
Acknowledgments

Support provided by the Data Intensive Computing
For Complex Biological Systems funded by the
Office of Advanced Scientific Computing Research,
and under the LDRD Program at Pacific Northwest
National Laboratory.

Write a Comment

User Comments (0)