Title: In silico biology: computational path toward holistic understanding of living cells
1In silico biology computational path toward
holistic understanding of living cells
Andrey A. Gorin Computer Science and
Mathematics Oak Ridge National Laboratory agor_at_or
nl.gov
2Motivation Predictive Biology
- Developments in Biological Sciences
- Experimental From Reduction to Systems Science
- Computational From Validation to Prediction
- Development in Technologies
- High Throughput Experimentation
- High Performance Computing
- Uniqueness of Biology
- First Principles Approach is Impossible/Impractica
l - Enormous Multitude of Scales
- Descriptive Models from Diverse Data
- Large Uncertainties in Data
3Scientific Drivers
- Bio-remediation
- Environmental restoration using microbial
processes requires study of - Microbial attachment to mineral environments
- Uptake of contaminant ions by microbes and metal
reduction at the microbial membrane - Conversion of toxic chemicals by microbial
systems
- Bio-energy
- Development of integrated experimental an
computational approaches for - Feedstock optimization with the goal of better
cellulose deconstruction by special bacterial
systems - Understanding of microbial communities in single
batch processes (harsh environments) - Enzyme or regulatory circuit design to increase
desirable output
4Our Research Directions
- Mass-spectrometry based proteomics protein
identification and quantification in complex
biological samples - Structural models of protein complexes docking
from known components, fundamental principles of
molecular recognition, prediction of protein
complexes in novel genomes - Network models reconstruction from protein
interactions and other sources, simple simulation
to demonstrate predictive capabilities -
5Exploring Protein Dimension of Bio Universe
Entire proteome is analyzed in a few hours
1 out of 105-1011 must be selected as the correct
peptide
Mass spectrometry process
6De novo Platform Probability Profile Method
We made several advancements in the understanding
of the mass spec mathematics. Taken together
they lead to conceptually novel platform.
Peak Assignment
m1 m2 m3 m4
510 VDDLSSLT 305
7Results Capabilities in Mass Spectrometry
Advances in the mathematical understanding and
dramatic acceleration of fundamental operations
lead to principally new capabilities
- Output gains, and especially in highly confident
identifications. Our method gives several times
more of highly confident identifications - Capability to detect unexpected phenomena in the
samples. We have found novel biological phenomena
in the legacy data and uncovered mistakes in the
data sets regarded as benchmarks.
Deamidations (6 spectra) IHPFAQTQSLVYPFPGPIPN
IHPFAQTQSLVYPFPGPIPD Incorrect peptides (7
spectra) VIPAADLSQQISTAGTEASGTGNMK -gt
VIPAADLSEQISTAGTEASGTGNMK Disulphide bond (1
spectra) AAANFFSASCVPCADQSSFPK
De novo methods can improve even manually
verified benchmark data sets obtained by the
existing technology.
8Model-dependent Science Questions
Many fundamental questions in systems biology are
hampered by the lack of reliable predictive
models.
Network Models Reliant
Structural Models Reliant
- What biochemical processes in a microbe are
related to its traits (hydrogen or ethanol
producer, ethanol resistant)? - How does a bacteria degrade lignin or cellulose?
- What are the mechanisms behind the conversion of
toxic waste to nontoxic substances by bacteria?
- What are biochemical or regulatory functions of
the proteins that are shown to be important for
hydrogen production? - What are the hot spots of cellulases that could
impair their binding? - What are the components of hydrogen producing
protein assemblies?
9Predictive Model Building
10Protein Complexes in GenomicsGTL
GTL is focused on protein interactions that make
life work
Express Bait Protein
Is the interaction real or an artifact? What is
the structure of the protein complex? What is
its function? What is its dynamic
mechanism? Can we answer these questions at
scale?
Exogenous / Endogenous
Pulldown
Mass Spectrometry Analysis
Putative Interacting Proteins at High Throughput
Xray Diffraction
11Modeling Protein Structures and Complexes
Combinatorial and optimization techniques are
applied for two areas development of knowledge
based potentials and analysis of ultra large
structural sets.
Discovery of protein complexes
?
Geometry and bioinfo libraries
Shared memory indices
Protein folding
Parallel implemenations
Ligand binding
Graph algorithms
12Computational Algorithms in Structure Modeling
Multitude of combinatorial optimization problems
with different data access patterns.
Example Ab Initio Prediction of Protein 3-d
Structure
13Results Protein Docking
- Full implementation for Bayesian potential
energy functions for protein docking - The size of the docking benchmarking set
(1,200) is unprecedented in the field. High
quality results were obtained for over 70 of all
tested complexes (native in top 5), indicating
that the method is very efficient. - Successfully modeled protein complexes using the
structures from other organisms - Docking calculations are scalable to 1000s
processor due to surface patch approach for
parallelization
14FutureProtein Docking
- Develop further Protein Interface Server (PINS)
database (pins.ornl.gov). - Develop theory and computational implementation
for docking potentials taking into account
orientation of the contacting residues
(Bayesian). Petascale implementations for docking
platforms. - Improve mathematical methods for prediction
validation. - Analyze and annotate PINS interfaces related to
the metabolic functions involving
carbon-processing pathways in cellulose-degrading
bacteria. Construct predictions for the organisms
important for Bioenergy Centers in cases when
full or partial genome sequence are available.
15Predictive Model Building
16LDRD D07-014 Modeling Cellular Mechanisms for
Efficient Bioethanol Production through Petascale
Analysis of Biological Networks
Andrey Gorin, Nabeela Ahmad, Andrew Bordner,
Robert Day, Jessie Gu, Guruprasad Kora, Chongle
Pan, Byung-Hoon Park, Nagiza Samatova, Edward
Uberbacher, Cray. Inc
Oak Ridge National Laboratory, FY2007-FY2008
17Future Graph Algorithms for Networks
- Analysis of graphs reflecting physical
interactions in structures - Very wide set of problems, but with a uniform set
of translation rules - Enormously large graphs
- Directly connected to modeling petascale
applications - Developing graph representations for real
cellular subsystems. - Nodes genes, proteins, DNA elements,
metabolites. - Edges translation, positive regulation,
production of metabolite - Flexible representations
- Integration of many tools to annotate 5 regions
- Promotor alignment
- Models to predict transcription patterns based on
promotor models
18Future Mining Networks for Bioenergy
- Switch grass paralogs/homologs has to be
identified in collections of gt250,000 partial
gene data (EST) from rice, maize, sorghum genomes - It is expected that 6000 Populus lines will be
passed through cell wall phenotyping pipeline
(expression arrays, proteomics data, etc) - Huge data sets are already (e.g. Obauashi et al.
(2007) 1,388 expression arrays)
(From Bioenergy Center proposal)
19Future Taking It to Petascale
- Port to NLCF platforms
- Scale Maximal Clique Enumeration from 100s to
1000s processors - Introduce similar implementations for other codes
in pGraph and BioGraphE
- Optimize performance on NLCF architectures
- Deliver efficient MPI-based multi-core
implementations - Minimize thread locking/unlocking overheads
- Exploit data locality in job redistribution
strategies - Minimize I/O overheads via buffering, data
compression, hierarchical parallel I/O. -
20Potential Areas of Joint Work
- Mathematics
- Statistics of pattern recognition problems
- Combinatorial optimization methods beyond dynamic
programming - Bayesian system with strong feature correlations
- Computer science
- Graph algorithms working for VERY large graphs
104-105 nodes and across many graphs (103) in
parallel - Flexible software systems for representation
transcription regulation networks in the
bacterial cells (4 genes) - Petascale implementation strategies MPI
implementations, load balancing, etc -
- Biological Applications
- Novel applications in mass-spectrometry (e.g.
gene finding in new genomes, special gene
expression in stem cells, mass-spec of
cross-linked systems) - Expression array analysis and metabolic pathway
construction for bacteria involved in bioenergy
processing