Title: Virtual Organizations: Building Interdisciplinary Collaborations
1Virtual Organizations Building Interdisciplinary
Collaborations
- Dan Reed
- reed_at_renci.org
- Chancellors Eminent Professor
- Vice Chancellor for IT
- University of North Carolina at Chapel Hill
- Director, Renaissance Computing Institute
2Acknowledgments
- Funding agencies
- NIH
- Carolina Center for Exploratory Genetic Analysis
(CCEGA) - NSF
- TeraGrid Science Gateways
- State of North Carolina
- RENCI and ancillary Bioportal support
- RENCI staff
- Alan Blatecky, Kevin Gamiel, Xiaojun Guan
- Clark Jefferies, Howard Lander
- John Magee, Ruth Marinshaw, Jeff Tilson
- Lavanya Ramakrishnan
- And a host of others
321st Century Challenges
- The three fold way
- theory and scholarship
- experiment and measurement
- computation and analysis
- Supported by
- distributed, multidisciplinary teams
- multimodal collaboration systems
- distributed, large scale data sources
- leading edge computing systems
- distributed experimental facilities
- Socialization and community
- multidisciplinary groups
- geographic distribution
- new enabling technologies
- creation of 21st century IT infrastructure
- sustainable, multidisciplinary communities
- Come as you are response
Computation
Experiment
Theory
4Exemplar 21st Century Challenges
- Population growth in sensitive areas
- severe weather sensitivity
- national impact
- geobiology and environment
- economics and finance
- sociology and policy
- Economics and health care
- longitudinal public health data
- environmental interactions
- genetic susceptibility
- heart disease, cancer, Alzheimer's
- privacy and insurance
- public policy and coordination
5Mean Onset of Alzheimers Disease
- apolipoprotein (apo)
- apoE2, apoE3 and apoE4 alleles
- on chromosome 19
- apoE4 allele
- 40 to 60 of Alzheimer's patients
- not the only cause for Alzheimers
- apo gene inheritance
- 25 inherit 1 copy of apoE4 allele
- Alzheimer's risk increases 4X
- 2 inherit 2 copies of apoE4 allele
- Alzheimer's risk increases 10X
1.0
2/3
0.8
2/4
0.6
3/3
Proportion of each
genotype unaffected
0.4
3/4
0.2
4/4
0
60 65 70 75 80 85
Age at onset
Source Alan Roses, GSK
6Big Questions
Protein sequence and regulation
DNA sequence
Sequence Annotation
Data integration
Network analysis
Pathway simulations
Multi-protein machines
Organs, Organisms and Ecologies
Metabolic pathways and regulatory networks
Bacteria and cells
7Genetics and Disease Susceptibility
Phenotype 1 Phenotype 2 Phenotype 3
Phenotype 4
Ethnicity Environment
Age Gender
Identify Genes
Pharmacokinetics
Metabolism
Endocrine
Biomarker Signatures
Physiology
Proteome
Transcriptome
Immune
Morphometrics
Predictive Disease Susceptibility
Source Terry Magnuson, UNC
8PITAC Report Contents
- Computational Science Ensuring Americas
Competitiveness - A Wake-up Call The Challenges to U.S.
Preeminence and Competitiveness - Medieval or Modern? Research and Education
Structures for the 21st Century - Multi-decade Roadmap for Computational Science
- Sustained Infrastructure for Discovery and
Competitiveness - Research and Development Challenges
- Two key appendices
- Examples of Computational Science at Work
- Computational Science Warnings A Message Rarely
Heeded - Available at www.nitrd.gov
9Life Science Lessons from Astronomy
- Historically, discoveries accrued to those
- with access to unique data
- who built next generation telescopes
- Two things changed
- growing costs and complexity of telescopes
- emergence of whole sky surveys
- The result virtual astronomy
- discovering significant patterns
- analysis of rich image/catalog databases
- understanding complex astrophysical systems
- integrated data/large numerical simulations
10International Virtual Observatory
3.
X-ray and Optical Images retrieved via SIA
interface
Chandra SIA
NED Cone Search
Skyview SIA
CADC CNOC Cone Search
DSS SIA
5.
Initial Galaxy Catalog generated via Cone Search
DSS SIA
CNOC SIA
Cluster Galaxy Morphology Analysis Portal
6.
Image cutout pointers merged into catalog
2.
Look up cluster in internally stored catalog
clusters
Morphology Calculation Service
Morphological parameters calculated on grid for
each galaxy
7.
Users Machine
1.
User selects a cluster
User downloads final table and images for
analysis visualization
4.
User launches distributed analysis
8.
web browser
Source Ray Plante, NCSA
11The Bioinformatics Challenge
- Challenge
- the rise of quantitative biology
- burgeoning bioinformatics data
- complex analysis and modeling problems
- education and training in new technologies
- Reality
- diverse tools with idiosyncratic interfaces
- steep learning curves
- software development by diverse groups
- distributed, databases with diverse metadata
- Need
- integrated, easy-to-use toolset with standard
interfaces - extensible mechanisms that hide idiosyncrasies
- tool and bioinformatics training
- The solution
- bioinformatics infrastructure and coupled
training
12Need Simple, Easy-To-Use Tools
- Genome. Bought the book. Hard to read.
- Eric Lander
13Web and Social Processes
- Google
- its a search engine, its a verb,
- Blogs
- published self-expression
- Instant Messenger
- social networks
- Wireless messaging
- semi-synchronous
- Internet commerce
- the dot.com boom/bust
- EBay, Amazon
- Spam, phishing,
- anti-social behavior
14Benefits of Standards
- Interoperability
- Separation of concerns
- Reuse
- Independence
- Dependability
- Sharing
- Commonality
- Shared knowledge base
- knowledge reuse
- simplification (one hopes)
15Grids of All Flavors
16Whats A Grid/Web Service?
http//
Web Uniform access to documents
http//
Software catalogs
Grid/Web Services Flexible, high-performance
access to resources and services for distributed
communities
Computers
Sensors and instruments
Colleagues
Data archives
17Grid History I-Way at SC95
- A prototype national infrastructure
- 17 sites, connected by
- vBNS and six other ATM networks
- 60 applications
- Features
- I-POPs for site access
- Kerberos authentication
- manual scheduling
- distributed communication libraries
- Experiences
- led to Globus Grid toolkit
- Concurrent industry needs
- led to web services for B2B interoperation
18Web Services Commercial Grids
- From browser-centric to service-centric
- from human-computer to computer-computer
- structured negotiation and response
- Workflow creation and management
- end-to-end service negotiation
- inter-organizational interaction
- Prerequisites
- metadata standard for service descriptions
- standard communication mechanisms
- resource discovery and registration
19eBay Web Services Architecture
- Over 40 of eBay's listings are now via API calls
Source IBM
20Web Services A Definition
- A web service is designed to support
interoperable machine-to-machine interaction over
a network. It has an interface described in a
machine-processable format (specifically WSDL).
Other systems interact using its description
using SOAP-messages, using HTTP with an XML
serialization .... - W3C Working Draft, August 2003
SOAP
SOAP
WSDL
UDDI
SOAP
- SOAP (Simple Object Access Protocol)
- WSDL (Web Services Description Language)
- UDDI (Universal Description, Discovery and
Integration)
21Technology Push
Source Gartner Group
22European myGrid Architecture
Source www.mygrid.org
23The Bioinformatics Challenges
- Complex, multilevel models
- integration and in silico designs
- Information visualization
- complexity and scale
- Data models and ontologies
- community definition
- Data federation, storage and management
- shared access and support
- User access portals
- web-based tool and service interfaces
- Packaging, distribution and deployment
- community building
24Multilevel Cellular Models
- Signaling networks
- environmental triggers and behavior
- e.g., cell lifecycle
- different pathways in each tissue type
- Metabolic networks
- measurable products in pathway
- many systems are steady state
- negative feedback leads to stabilization
- Protein interaction networks
- localization of proteins that interact for
function - protein-protein interactions for specific actions
- Gene regulatory networks
- many things affect gene product concentration
- nucleic-nucleic, protein-nucleic interactions
- Computing, physics, engineering and biology
- control theory, mathematical models, phase spaces
- from biological cartoons to predictive models
- e.g., microRNAs and gene expression controls
25Biological Models
- Simulation and prediction
- structures and dynamics
- Reasoning and discovery
- reverse engineering
Temporal (seconds)
Spatial (nM3)
26Biophysical and Environmental Modeling
Airway/flow
Mucus
Disease, Environment and Medicine
Cilia
Cell biochemistry and structure
Proteomics
Genomics
Source Ric Boucher, UNC
27Data Heterogeneity and Complexity
Genomic, proteomic, transcriptomic, metabalomic,
protein-protein interactions, regulatory
bio-networks, alignments, disease, patterns and
motifs, protein structure, protein
classifications, specialist proteins (enzymes,
receptors),
Proteome
Source Carole Goble (Manchester)
28Sensor Data Overload
Source Chris Johnson, Utah Art
Toga, UCLA
Source Robert Morris, IBM
- High resolution brain imaging
- 4.5 petabytes (PB) per brain
29RENCI What Is It?
- Statewide objectives
- create broad benefit in a competitive world
- engage industry, academia, government and
citizens - Four target areas
- public benefit
- supporting urban planning, disaster response,
- economic development
- helping companies and people with innovative
ideas - research engagement across disciplines
- catalyzing new projects and increasing success
- building multidisciplinary partnerships
- education and outreach
- providing hands on experiences and broadening
participation - Mechanisms and approaches
- partnerships and collaborations
- infrastructure as needed to accomplish goals
30Carolina Center for Exploratory Genetic Analysis
(CCEGA)
Interoperable Data Management
Faculty, Staff Students
Driving Problems
Promoting Mutual Awareness
Experimental Genetics Portal
Analysis Techniques
Statistical Computational Techniques
Extant Data Models
Virtuous Cycle
Interdisciplinary Research Education
31CCEGA Participants
- Coordination team
- Dan Reed, RENCI
- Terry Magnuson, CCGS
- Alan Blatecky, RENCI
- Kirk Wilhelmsen, CCGS
- Eleven departments/institutes
- Biostatistics
- Cancer Center
- Genetics
- Computer Science
- Epidemiology
- Genetics
- Health Science Library
- Information and Library Science
- Pharmacy
- RENCI
- Statistics
- Campus wide support
- from many sources
- Project participants
- Brad Hemminger, Information Library Science
- James Evans, Genetics
- Kevin Gamiel, RENCI
- Xiaojun Guan, RENCI
- Barrie Hays, Health Science Library
- Clark Jefferies, RENCI
- Ethan Lange, Genetics
- Andrew Nobel, Statistics
- Karen Mohlke, Genetics
- Kari North, Epidemiology
- Susan Paulsen, Computer Science
- Fernando Manuel Pardo, Genetics
- Charles Perou, Cancer Center
- Lavanya Ramakrishnan, RENCI
- Jan Prins, Computer Science
- Patrick Sullivan, Genetics
- Lisa Susswein, Cancer Center
- David Threadgill, Genetics
32Data From Lab and Clinic to Analysis
- Independent data management
- data security
- version control
- redundancy
- controlled access
ELSI
Clinical
ELSI
Analysis
Analysis
Laboratory
Integration Informatics
LAB
Clinic
Analysis
- NIH CCEGA
- Carolina Center for Exploratory Genetic Analysis
Source Brad Hemmenger, UNC
33Data Management and Information Viz
Published Domain Literature
Taxonomy Annotation
Ontology Annotation
..
DB Schema Ontology Annotation
Annotated Domain Literature
Information Mining Module
Information Visualization Module
34From SNPs to HapMap
- Single Nucleotide Polymorphisms (SNPs)
- one in 1200 bases differ across individuals
- SNPs act as markers to locate genes
- Common groups of SNPs are shared
- i.e., form a haplotype
- HapMap data sources
- 90 Yoruba individuals (30 trios) from Nigeria
(YRI) - 90 individuals (30 trios) of European descent
from Utah (CEU) - 45 Han Chinese individuals from Beijing (CHB)
- 45 Japanese individuals from Tokyo (JPT)
- 3,500,000 SNPs typed
- basis for association studies for disease
identification
35CCEGA HapMap Simulator
- Synthetic data
- disease models
- model testing
- mining bakeoffs
36Carolina Bioportal
- Three overlapping target groups
- undergraduate education
- graduate education and research
- academic/industrial research
- Features
- access to common bioinformatics tools
- extensible toolkit and infrastructure
- OGCE and National Middleware Initiative (NMI)
- leverages emerging international standards
- remotely accessible or locally deployable
- packaged and distributed with documentation
- National reach and community
- TeraGrid deployment
- science gateway
- Education and training
- hands-on workshops
- clusters, Grids, portals and bioinformatics
37(No Transcript)
38Distributed Grid and Web Services
Launch, configure and control
Grid Portals
Open Grid Service Infrastructure (web service
component model)
Online instruments
Source Dennis Gannon, Indiana
39Bioportal Architecture
Bioportal
Interface Generator
HTML Files
PISE
Application XML Description
Application Processing
www.ncbioportal.org
Velocity Files
User Profile
Job Submission
Remote File Access
Job Records
Authentication, Grid Credential
Application Databases
Command Files
OGCE User Databases
Job History Database
Application Processing
MyProxy
GridFTP
Gatekeeper
Local cluster
- OGCE toolkit
- used by cyberinfrastructure projects
- LEAD, NEES, PACI, DOE, TeraGrid
40Putting the Technologies Together
NC Bioportal
OGCE Toolkit (Grid middleware)
PISE (XML Wrapper)
Tomcat (Apache servlet container)
Chef (collaboration/standard portlets)
Jakarta Jetspeed (enterprise portal)
Bio Applications
Velocity (template engine)
Turbine (web app framework)
Grid Portlets, CoG
VMC
Databases
41Community Software Toolkit Lessons
- NSF PACI Alliance In a Box toolkits
- cluster software (aka OSCAR)
- Grid infrastructure (aka NMI)
- Access Grid for distributed collaboration
- tiled display walls for visualization
- Distribution materials
- software and training materials
- CDs and web
- Community workshops and training
- Linux Clusters Institute
- MSI HPC workshops
- hands on training
- Lowering the entry barrier
- usage and deployment
- Bioportal distribution
- workshops, tutorials
- training materials
- road shows
Bioportal Distribution
42NC Bioportal Whats Next
- Engagement
- workshops, experiences and deployments
- Infrastructure
- dynamic job scheduling across multiple sites
- migration to OGCE 2.0
- fully automated database updates
- workflow construction and processing
- Portal tool suite
- expanded applications and databases
- phylogeny, morphology, microarray analysis,
- Training materials
- additional modules based on user feedback
- workshop materials packaged for self-study
- Leverage national presence
- TeraGrid/NCSA bioinformatics portal
43The Vision of Grid/Web Services
- Behold, the people is one, and they have
all one language and this they begin to do and
now nothing will be restrained from them, which
they have imagined to do. - Book of Genesis
Peter Bruegel The Tower of Babel (1563)
We're Not There Yet ...
44Interdisciplinary Collaborations
- Appropriate reward structures
- well-matched time constants
- Intellectual equality
- balanced recognition of contributions
- Research/infrastructure distinctions
- timelines and people needs differ
- Confidentiality and openness
- academic/industry collaboration perspectives
- Intellectual property
- background IP and differential disciplinary
models
45Some Thoughts on the Future
- Grids/web services are not a panacea
- we have seen this movie before
- standards debates can be endless
- make new mistakes, not the same old ones
- code is shifted from modules to interfaces
- Danger of Death by CS Abstraction
- all problems can be solved by another level of
indirection - Appropriate decomposition is a challenge
- performance, usability, flexibility
- Generality and extensibility really matter
- incremental aggregation and interoperability
- data management and federation
- Better questions, not just private capabilities
- limited by creativity not resources
46The Cambrian Explosion
- Most phyla appear
- sponges, archaeocyathids, brachiopods
- trilobites, primitive mollusks, echinoderms
- Indeed, most appeared quickly!
- Tommotian and Atdbanian
- as little as five million years
- Lessons for computing
- it doesnt take long when conditions are right
- raw materials and environment
- leave fossil records if you want to be
remembered!
47Thanks for the Invitation!