GTL- Modeling

About This Presentation

Title:

GTL- Modeling

Description:

GTL Modeling – PowerPoint PPT presentation

Number of Views:221

Avg rating:3.0/5.0

Slides: 40

Provided by: roger167

Category:

more less

Transcript and Presenter's Notes

Title: GTL- Modeling

1
GTL- Modeling Simulation

Submitted slides

2
1) How DOE type big iron computing could in
principle help biology (I am leading with things
that I believe are sensible rather than starting
from within the DOEs organismic and scientific
constraints) A) Molecular dynamic type stuff,
structure prediction, etc. B) Docking small
molecules and proteins Small molecules to
proteins Guided docking of proteins from
starting compounds to easily synthesized
derivative pharmacophores using genetic and
biochemical QSAR information One grand challenge
here use computational methods to identify uM
inhibitors of all the enzymes encoded by a
microbial genome. You get new drug leads and it
helps build up national rapid response
capability, and it helps educate computer type
people in DOE more about single major industrial
application of biology Another use the
structural insight coupled with evolution to
change specificity and catalytic properties of
enzymes that make stuff (hydrogen, useful
polymers), or break it down (cleanup) C)
Vaccines Use sequence analysis and
structure prediction to pick all the good B cell
epitopes from a microbial genome. Use
sequence analysis structural information about
human Class 1 and Class 2 to pick all the good T
cell epitopes from a microbial genome. D) Any
simulation work, particularly whizzing molecule
simulations, should us would be simulationists
succeed.
Roger Brent
3
2) Barriers to above, hardware, software,
algorithms Yes 3) How would you measure
success? a) Predicted structure of
majority of proteins in newly sequenced bacterium
or virus in one week of sequence, with
predictions validated by experiment (2006)
b) Validated lead drug compounds against new
targets in virus or bacterium one year after work
start (2005) c) Predicted
vaccine one week after new microorganism
sequenced, B and T cell epitopes going into
validation steps (2006) d)
Simulation would have to work, give nontrivial
insight, be deemed to do so by majority of
academic bioloogists, NIH, HHMI,
and NAS (2010) 4) Resources a) Many
questions seem to bear on simulation. Almost
moot until simulation works. b) The
structure/ drug/ vaccine ideas would require an
increase in DOE internal competency. A 20 year
commitment would be completely
appropriate. NIH, NSF and industry fund some
efforts along these lines now. A
serious effort on tne the structure / drug/
vaccine front would require circa 1/2-1B/ year,
would probably need to be spent t a
new, urban center rather than a national lab, and
most of it wouldnt be computation.
Could be complementary with NIH. 5) Why
undertake the work? a) Better security
against biological attack on people, animals,
plants, materiel, our ecology b) This
capability is part of stewardship of the
planetary ecology, with DOE handling the
microbial ecology 6) ) A general consideration
that would help MSI interact with DOE and DOE
interact with the current research
envirnoment outside of the national labs.
ll DOE software should be open source under LGLP
or equiv, all biological and chemical reagents
freely licensed using standard academic
treaty type MTAs. JGI delays data release for
a year, not NIH or MRC/ Wellcome standard
Roger Brent
4
From the DOEs report on the GTL mathematics
workshop DOEs current responsibility for
remidiating 1.7 trillion gallonms of contaminated
groundwater and 40 million cubic meters of
contaminated soil demonstrates the
significance and scale of the need for a new
computational biology program Most academic
biomedical biologists wont buy this,
will consider it a non-sequitar.
Roger Brent
5
Larry Lok, The Molecular Sciences Institute.

Data management infrastructure. Data analysis,
knowledge infrastructure, data mining.
Flexibility facilitate development of
intelligent, domain-specific interfaces. Monod.
TIA on recent publications. Help supplant
publication?
Protein complexes are a database challenge.
Adapting in-memory techniques.
Inference tools for ... distributed biological
data.
Quantitative simulation largely unsupported by
current big piles of data.
Prediction of protein-protein interaction
kinetics via MD.
Behavioral data, experimental and from
simulation.
Inference of reductionist models. Reaction
networks and their parameters.
Qualitative modeling styles QDE, dynamic
Bayesian networks, etc.
Data analysis, modeling, visualization
facilities.
Batch uniP/SMP jobs always popular. SMP support
in tera-scale facilities?
Toward device independence?
Reaction network generation discrete-event-style
difficulties.
Reaction network simulation familiar territory
for Nat. Labs?
ODE-like approaches. Connectivity clustering to
reduce bandwidth.
Spatial approaches PDE, particle, etc. Spatial
distribution. Visualization demands both for
setup (e.g. modeling membrane or E.R.) and
analysis.

6
Prediction of Protein Structure
Very low homology T0173 Mycothiol deacetylase

Goals
- Better understanding of evolutionary
relationships
- Characterization of molecular function
- Guiding further experiments
Major challenges
Comparative modeling (homology modeling)
Reliability of sequence alignments
Identification and modeling of structural change
Refinement!
Fold recognition
Sequence alignments (potentially combinatorial)
Whole genome applications (model quality
assessment)
De novo structure prediction
Still mostly an unsolved problem!
Importance of methods development cannot be
underestimated

K. Fidelis
7
Modeling and Simulation Issues

Data flow in a heterogeneous environment
Avoid bottlenecks, archiving, distributing
Build in performance measures
Complex modeling capability
Universality of storage/compression details
Capacity may be more important than capability
Parallel paradigms
Decomposable in space/time, macro/micro?
Security
Manageability
Scalability
Systems approach to design, user involvement
Keep it simple, focused, useful, dont reinvent
Choices have costs

Stephen Elbert, IBM
8
Inference and modeling of Microbial Regulatory
and Signaling Pathways

Reverse engineering problem build pathway
models that are most consistent with genomic,
proteomic, metabolic data and general biological
knowledge
data mining is an essential first step in solving
the reverse engineering problem a great
amount of information is hidden in the often
noisy, incomplete, and sometimes conflicting data
computational prediction/modeling and data
collection through experiments should be one
integrated process computation should be a key
driver for rational design of experiments
Computational challenges
it represents a highly challenging computation
problem to rigorously reverse engineer or solve
a network model, e.g., Boolean network, Bayesian
network, petrinet, that best matches known
data/knowledge
given a list of candidate genes possibly involved
in a regulatory/signaling network, their
predicted functions, their predicted interactions
and causality relationship, their predicted
regulatory elements,
network validation problem how to design a set
of experiments that could provide maximal
amount of information, in a most economic manner,
for validation, rejection and revision of network
models

Y.Xu
9
phosphorus assimilation pathway
Y.Xu
10
Petascale Distributed Data Analysis
Important issues for mining massive biological
data sets
Distributed Existing methods work on single
centralized dataset. Data transfer is prohibitive
Scalable Popular methods do not scale in terms
of time and storage
protein structure
genomes
pathways
Raw data
regulatory elements
models
High-dimensional Need new methods that scale up
with the number of dimensions
Dynamic Most methods work with static data -
Changes lead to complete re-computation
11
Computational Feasibility on a Teraflop Computer

Biological Data Growth Trend
Genome Assembly 300TB/genome
Protein Structure Prediction PetaByte
Simulations of Bionetworks 1000s of PBs

Algorithmic Complexity
Calculate means O(n)
Calculate FFT O(n log(n)
Clustering algorithms O(n2)

Algorithm Complexity
Data size, n
n3
n2
nlog(n)
n
11 days
1 sec.
10-5 sec.
10-6 sec.
1MB
31 millenia
3 hrs
10-3 sec.
10-4 sec.
100MB
1011x age of the Universe
3 yrs.
0.1 sec.
10-2 sec.
10GB
Bottom line Bigger Computers arent going to
solve our problems We need breakthroughs in
modeling and simulation algorithms
12
100 TeraFLOP Computers Enable First Principles
MD simulations of Enzyme Mechanisms
We are starting to study enzyme mechanisms
We have been using FPMD to simulate the chemical
reactions
FPMD 500 atoms, 10-11 sec
G1
C2
G3
His162
Mg2
Glu14
Asp12
Asp167

Constrained FPMD simulations of drug with 70
water molecules (231 atoms total)
80,000 basis functions
Computational requirements
ASCII blue 1ps on 36 nodes takes 5 days
TC2K 1ps on 27 nodes takes 3.1 days

Long-term GTL applications

Design of O2 resistant hydrogenases
Re-engineering substrate specificity of
degradative enzymes
Modify properties of DNA-binding regulatory
enzymes

M. Colvin
13
Integrative Cellular models (for E.E. Selkov,
MCS, Argonne)

Integrative Cellular Models
Imperative for whole cell simulation
Expose modeling/mathematical issues
Conceptual, computational, algorithmic,
infrastructural
Integration with Bioinformatics
Uniting genomics, proteomics, metabolomics
Verification methodology
Experimental closure

Cyanobacterial Dynamics Models
Integrates genetic, MRT and regulatory data
Integrates bioinformatics data (EMP, WIT)
Practical importance
Carbon sequestration, bio-H2 via optimal
engineering
Circadian clock model fundamental problem in
biology with a lot of applications
Experimental verification
Proteomics, metabolomics
Experimental facitily (CHM, Purdue)

Modeling Challenges
Bridging the scale gap (spatial temporal)
Multi-scale/multi-model approach
Integrating micro and macro description
New bioinformatics data model DB
Parameter determination, model validation
Systemic approach

Computation and Algorithms
Large-scale parallel simulation
Scalable stiff/differential-algebraic integrators
Multi-objective constrained optimization
Combinatorial continuous
Integration with dababases
Multi-parameter bifurcation sensitivity
analysis

14
Integrative Cellular Models

Imperative for whole cell simulation
Expose modeling/mathematical issues
Conceptual, computational, algorithmic,
infrastructural
Integration with Bioinformatics
Uniting genomics, proteomics, metabolomics
Verification methodology
Experimental closure

15
Modeling Challenges

Bridging the scale gap (spatial temporal)
Multi-scale/multi-model approach
Integrating micro and macro description
New bioinformatics data model DB
Parameter determination, model validation
Systemic approach

16
Cyanobacterial Dynamics Models

Integrates genetic, MRT and regulatory data
Integrates bioinformatics data (EMP, WIT)
Practical importance
O2 production, bio-H2 via optimal engineering
Synchronous population cultivation
Experimental verification
Proteomics, metabolomics
Experimental facitily (CHM, Purdue)

17
Computation and Algorithms

Large-scale parallel simulation
Scalable stiff/differential-algebraic integrators
Multi-objective constrained optimization
Combinatorial continuous
Integration with dababases
Multi-parameter bifurcation sensitivity
analysis

See PDF File

19
Computation Biology Infrastructure for Complex
Microbial Communities From Genomes to Molecular
Machines
Daniel Van Der Lelie
20
Molecular interaction networks are
revolutionizing the study of biological pathways.
Yeast now have over 20,000 measured
protein-protein, protein-DNA, protein-small
molecule interactions. Similar networks will
soon be avail. for a variety of bacteria, worm,
fly, mouse, human. There is a pressing need for
computational models and tools able to integrate
molecular interaction networks with molecular
states on a global scale. Pathway mapping
Identify and verify pathways and complexes of
interactions (circuit modules) that correlate
with the observed changes in molecular state.
Pathway alignment Identify conserved regions
between the networks of pathogens and hosts,
commensal species, a single species under
different environmental conditions, tissues,
stages of development, etcetera.
Ideker and Lauffenburger, Trends in Biotech June
2003
21
The Scientific Demand for Modeling Simulation
High Throughput Data
Cellular Complexity
Increasing RD efficiency and productivity
22
Developing, implementing, and delivering
model-driven research methodologies

Demonstrating how microbial models can drive
biological discovery
Basic scientific understanding of energy-related
biological systems (improve efficiency of
discovery)
Bio-based economy, biomass-derived products
Bio-fuels
bioremediation
Tight integration with experimental approaches,
guide experimental design
Illustrate how models provide the biological
context for the integration of genomics,
proteomics, metabolomics (focus on biologically
driven integration as opposed to IT driven
integration)
Demonstrated case studies with real biological
impact! (Let the biology drive the math)
Provide QA/QC of biological content in models to
support Iterative Model Development
Distribution of Systems Biology/Modeling
Platforms and Methodologies (visible impact)
Scalable modeling framework for examining
cellular pathways on up to heterogeneous
microbial populations (focused on metabolism)
Expectation management with the biological
community (what data do I need?)

Metabolic biochemistry at the systems-level
23
Protein and Gene Networks Inference
1. New Science What are the underlying
principles (static and dynamic) of biological
networks ?
Dynamical attractors
Scale free static networks
Pragmatic problem search space size
Random Scale-free Networks Non-chaotic networks
networks with similar networks (100
nodes) dynamics 103010 1055 108 ?
Jean-Loup Faulon, GTL Modeling Simulation
Workshop, July 23, 2003
24
Protein and Gene Networks Inference
2. Barriers - Reaction rates (experimental) -
Static and dynamic network characterization tools
(algo math) - Data format standard (software
hardware) 2-Hybrid systems, phage display, MS,
gene microarray, protein chips, bioinformatics -
Inference algorithm with sensitivity analysis
(algo)
3. Success - Biological question answered -
Inference prediction drives experiment
Number of data points required to infer unique
parsimonious Boolean networks from microarray
data and number of clusters with similar dynamics
vs. number of networks
4. Resources - Database (hardware software) -
Manpower
Jean-Loup Faulon, GTL Modeling Simulation
Workshop, July 23, 2003
25
(No Transcript)
26
Now Gen II
Science
Technology
Pilots
?
!
?
?
!
?
!
Pilots
Pilots
!
Facilities
Computing
Workshops
!
?
1B Need Gen V
27
Now Gen II
Science
Technology
Complex Systems Interactions Active
Management Patience Focus on End-to-End
Performance On Critical Targets
Pilots
?
!
?
?
!
?
!
Pilots
Pilots
!
Facilities
Computing
Workshops
!
?
1B Need Gen V
28
Now Gen II
Science
Technology
Complex Systems Interactions Active
Management Patience Focus on End-to-End
Performance On Critical Targets
Pilots
?
!
?
?
!
?
!
Pilots
Pilots
!
Facilities
Computing
Workshops
!
?
1B Need Gen V
29
Quantitative and Computational Cell Biology the
Virtual Cell PerspectiveIon I.
MoraruNational Resource for Cell Analysis and
Modelinghttp//www.nrcam.uchc.edu
30
QCB/CCB

Scope and Goals Tools for
Analyzing and modeling cellular function /
subcellular to tissue scale
Reverse engineering and re-engineering eukaryotes
Issues Power and Sophistication !
Spatial resolution / complex geometries
Temporal resolution / stiffness
Lack of data / parameter space searching
Too much data / 5D imaging, -omics
Stochastic behavior / particles, fluctuations
Encapsulation and scalability / model reuse,
supermodels
Simulations Grand Challenges ?
Complete organelle function (mitochondria, ER)
4D pattern development (embryogenesis, tissue
repair)
Cellular programming (apoptosis, cell cycle)
Structural control (mechanics, locomotion)
Neuronal signal integration (Purkinje cells)

31
Performance Progress

Neuroblastoma Model - simulation of 20 s real
time -
32
Near-term Potential Practical Wins for Modeling
and Simulation of Microbes

Bioinformatics
Predicting Domain-Ligand Interaction using
Signature Kernel Support Vector Machines
Natural Language Processing
Gene Finding, Phylogeny
Hardware Operating Systems Research
What does the architecture of the computer look
like that can solve these problems?

33
Near-term Potential Practical Wins for Modeling
and Simulation of Microbes

Computational Molecular Biophysics
40ns Simulation of Rhodopsin Membrane Protein
System for Insight into the determination of the
light-adapted structure

Complex Systems
Network Modeling

Complex Systems
Massively Parallel Finite Elements and Meshing

Computational Technologies
Parallel Algorithm Development, Optimization,
Data Mining and Management and Visualization,
Frameworks User Interfaces

34
PGF Raw Data Organization
Project Series of Libraries that define a
genome Library Series of Plates Plate 384
Clones Clone 2 Lanes 1 Lane 1MB
each distributed into 4 files 1 FASTA file
1KB 1 scf file 50KB 1 abd
file 250KB 1 rsd/ab1file 650KB In
May-03, PGF ran 2.5 million successful lanes
2.5TB/month 10 million files
(0.75TB/month (9 TB/year) non-trace files)
This does not include any assembly, database or
metadata!
Michael Banda
35
Community Access to PGF Data

Access to these data is in demand by scientific
fields that were not anticipated by the Human
Genome Project
Microbiologists
Environmental Scientists BioGeologists
Evolutionary Scientists
GtL projects
Not everyone will want the same kind of files.

The computational sophistication of the user
community is uneven, at best.

Michael Banda
36
Data Organization Requirements
1. Metadata for the files being collected
-- schema definition development -- the
database system to support the metadata --
query interfaces to query the metadata --
possible rapid prototyping using the object based
tools 2. Data entry tools for the metadata
-- procedure to enforce metadata entry --
checks on the correctness of the metadata entered
None of this was contemplated in the Human
Project but is essential for JGI and GTL data
management
Michael Banda
37

Wide agreement on general need for new
theoretical and software infrastructure for
systems biology, beyond molecular biology,
bioinformatics, -omics.
Potential differences in details and emphasis.
Multiscale and large-scale stochastic simulation
must simultaneously deal with extreme stiffness
(Petzold), stochastics (Gillespie),
robustness/fragility, and complexity.
Simulation alone is not scalable to larger
network problems because to answer biologically
meaningful questions for complex, uncertain
systems need an exponentially large number of
simulations.
There are fundamental (i.e. necessary) laws
governing the organization of biological
networks, most remaining to be discovered.
Without exploiting them, network complexity will
eventually become overwhelming.
Dramatic progress in all areas, but lacking
accessible exposition.
There has been extraordinary developments in
mathematics of complex networks in last 2-3
years, with promising applications to engineering
and biological networks. Builds on operator
theory, control theory, dynamical systems,
computational complexity, semidefinite
programming.

John Doyle
38
Systems Simulations Needs

Most Core Simulation Technologies Available
Already existent simulators for
ODEs, SDEs, PDEs, discrete particle,
circuit-based, geometrically changing models
Models are not yet large enough for simulation to
be severely limited by hardware
Hybrid simulation systems still in VERY early
development
Mixed deterministic and stochastic
Mixed discrete and continuous
Mixed differential and algebraic (this is the
most sophisticated)
Mixed scale simulations systems also still in
early development
Combining structural and kinetic modeling e.g.
Formal methods for converting one model type to
another still lacking in many areas
For example conversion of Chemical Master
Equation to Langevin Equation still an art
ALL of these are limited by good biophysical
models of most cellular processes.
Model Deduction and Parameter Estimation
New algorithms beginning to rely on statistical
graph models stochastic optimization,
computationally intensive.
Collaborative data filtering for data constraints
on parameters large matrix manipulation,
optimization
Model Analysis
Model Reduction e.g. automated time-scale
separation, extensions of balanced truncation
Model abstraction e.g. conversion of physical
models to circuit-like descriptions

Adam Arkin
39
Computation Biology Infrastructure for the
Analysis of Complex Microbial Communities From
Genomes to Molecular Machines

Displacement
- Commodity chemicals
- Fuels
- Metabolic pathways and reactions

CO2
CO2
CO2
Nitrogenase a MoFe protein (in blue and purple
at the center) and two copies of the Fe protein
dimer bound on either end (shown in green).
Newer carbon species