Title: IBM UK
1 IBM UK Ireland Technical Consultancy
Group Prof. Malcolm Atkinson Director www.nesc.a
c.uk 22nd May 2003
2Outline
- What is e-Science?
- UK e-Science
- UK e-Science Roles and Resources
- Scientific Data Curation
- Data Access Integration
- Data Analysis Interpretation
- e-Science driving Disruptive Technology
- Economic impact, Mobile Code, Decomposition
- Global infrastructure, optimisation management
- Dont care where computing
3What is e-Science?
4Foundation for e-Science
- e-Science methodologies will rapidly transform
science, engineering, medicine and business - driven by exponential growth (1000/decade)
- enabling a whole-system approach
sensor nets
5Convergence Ubiquity
Multi-national, Multi-discipline,
Computer-enabled Consortia, Cultures Societies
New Opportunities, New Results, New Rewards
6UCSF
UIUC
From Klaus Schulten, Center for Biomollecular
Modeling and Bioinformatics, Urbana-Champaign
7global in-flight engine diagnostics
100,000 engines 2-5 Gbytes/flight 5 flights/day
2.5 petabytes/day
Distributed Aircraft Maintenance Environment
Universities of Leeds, Oxford, Sheffield York
8Tera ? Peta Bytes
- RAM time to move
- 15 minutes
- 1Gb WAN move time
- 10 hours (1000)
- Disk Cost
- 7 disks 5000 (SCSI)
- Disk Power
- 100 Watts
- Disk Weight
- 5.6 Kg
- Disk Footprint
- Inside machine
- RAM time to move
- 2 months
- 1Gb WAN move time
- 14 months (1 million)
- Disk Cost
- 6800 Disks 490 units 32 racks 7 million
- Disk Power
- 100 Kilowatts
- Disk Weight
- 33 Tonnes
- Disk Footprint
- 60 m2
Now make it secure reliable!
May 2003 Approximately Correct See also
Distributed Computing Economics Jim Gray,
Microsoft Research, MSR-TR-2003-24
9e-Science in the UK
10Additional UK e-Science Funding
- First Phase 2001 2004
- Application Projects
- 74M
- All areas of science and engineering
- gt60 Projects
- 340 at first All Hands Mtg
- Core Programme
- 35M
- Collaborative industrial projects
- 80 Companies
- gt 30 Million
- Second Phase 2003 2006
- Application Projects
- 96M
- All areas of science and engineering
- Core Programme
- 16M 25M (?)
- Core Grid Middleware
EU money ! 40M Janet upgrade HPC(x) 55M
11e-Science and SR2002
- Research Council 2004-6 2001-4
- Medical 13.1M (8M)
- Biological 10.0M (8M)
- Environmental 8.0M (7M)
- Eng Phys 18.0M (17M)
- HPC 2.5M (9M)
- Core Prog. 16.2M ? (15M) 20M
- Particle Phys Astro 31.6M (26M)
- Economic Social 10.6M (3M)
- Central Labs 5.0M (5M)
12NeSC in the UK
You are here
Edinburgh
Glasgow
Newcastle
Belfast
Manchester
Daresbury Lab
Cambridge
Oxford
Hinxton
RAL
Cardiff
London
Southampton
13UK Grid Operational Heterogeneous
- Currently a Level-2 Grid based on Globus Toolkit
2 - Transition to OGSI/OGSA will prove worthwhile
- There are still issues to be resolved
- OGSA definition / delivery
- Hosting environments Platforms
- Combinations of Services supported
- Material and grids to support adopters
- A schedule of transitions should be
(approximately provisionally) published - Expected time line
- Now GT2 L2 service GT3 M/W development
evaluation - Q3 Q4 2003 GT2 L3 GT3 L1
- Q1 Q2 2004 significant project transitions to
GT3 L2/L3 - Late Q4 2004 most projects have transitioned
end GT2 L3
14Data Access Integration
15Biology Medicine
- Extensive Research Community
- gt1000 per research university
- Extensive Applications
- Many people care about them
- Health, Food, Environment
- Interacts with virtually every discipline
- Physics, Chemistry, Nanoengineering,
- 450 Databases relevant to bioinformatics
- Heterogeneity, Interdependence, Complexity,
Change, - Wonderful Scientific Questions
- How does a cell work?
- How does a brain work?
- How does an organism develop?
- Why is the biosphere so stable?
- What happens to the biosphere when the earth
warms up?
1 petabyte digital data / hospital / year
gt Lothian Region Hospitals produce more data
than CERN
16Database Growth
PDB Content Growth
39,856,567,747
17Infrastructure Architecture
Data Intensive X Scientists
Data Intensive Applications for Science X
Simulation, Analysis Integration Technology for
Science X
Generic Virtual Data Access and Integration Layer
OGSA
OGSI Interface to Grid Infrastructure
Compute, Data Storage Resources
Distributed
Virtual Integration Architecture
18Data Access Integration Services
19Data Access and Integration Services
1a. Request to Registry for
sources of data about x
Data
y
Registry
1b. Registry
responds with
Factory handle
2a. Request to Factory for access and
integration from resources Sx and Sy
Factory
2c. Factory
returns handle of GDS to client
3b. Client
2b. Factory creates
tells
GridDataServices network
analyst
Client
3a. Client submits sequence of
scripts each has a set of queries
GDTS
to GDS with XPath, SQL, etc
1
XML
Analyst
GDS
GDTS
database
GDS
2
S
x
GDS
S
y
3c. Sequences of result sets returned to
Relational
analyst as formatted binary described in
GDTS
GDS
GDS
2
3
a standard XML notation
database
1
GDS
GDTS
20ODD-Genes
PSE
21Scientific Data
- Challenges
- Data Huggers
- Meagre metadata
- Ease of Use
- Optimised integration
- Dependability
- Opportunities
- Global Production of Published Data
- Volume? Diversity?
- Combination ? Analysis ? Discovery
- Opportunities
- Specialised Indexing
- New Data Organisation
- New Algorithms
- Varied Replication
- Shared Annotation
- Intensive Data Computation
- Challenges
- Fundamental Principles
- Approximate Matching
- Multi-scale optimisation
- Autonomous Change
- Legacy structures
- Scale and Longevity
- Privacy and Mobility
22Disruptive e-Science Drivers?
23Mohammed Mountains
- Petabytes of Data cannot be moved
- It stays where it is produced or curated
- Hospitals, observatories, European Bioinformatics
Institute, - Distributed collaborating communities
- Expertise in curation, simulation analysis
- Distributed diverse data collections
- Discovery depends on insights
- Tested by combining data from many sources
- Using sophisticated models algorithms
- What can you do?
24Move computation to the data
- Assumption code size ltlt data size
- Develop the database philosophy for this?
- Queries are dynamically re-organised bound
- Develop the storage architecture for this?
- Compute closer to disk?
- System on a Chip using free space in the on-disk
controller - Safe hosting of arbitrary computation
- Proof-carrying code for data and compute
intensive tasks robust hosting environments - Provision combined storage compute resources
- Decomposition of applications
- To ship behaviour-bounded sub-computations to
data - Co-scheduling co-optimisation
- Data Code (movement), Code execution
- Recovery and compensation
Dave Patterson SIGMOD 98
25Software Changes
- Integrated Problem Solving Environments
- Users application developers see
- Abstract computer and storage system
- Where and how things are executed can be ignored
- Diversity, detail, ownership, dependability, cost
- Explicit and visible
- Increasing sophistication of description
- Metadata for discovery
- Metadata for management and optimisation
- Raising the semantic level of discourse
- Applications developed dynamically by composition
- Mobile, Safe Re-organisable Code
- Predictable Guaranteed behaviour
- Decomposition re-composition
- New programming languages understanding needed
26Organisational Cultural Changes
- Access to Computation Data must be simple
- All use a computational, semantic, data-rich web
- i.e. its invisible the portal / browser lets
you do more - Responsibility of data publishers
- Cost, dependability, trustworthy, capable,
flexibility, - Shared contributions compose indefinitely
- Knowledge accumulation and interdependence
- Contributor recognition and IPR
- Complexity and management of infrastructure
- Always on
- Must be sustained
- Paid for
- Hidden
Health, Energy, Finance, Government , Education
Games _at_ Home
27Comments Questions Please
www.ogsadai.org.uk
www.nesc.ac.uk