Title: Discovery Net
1Discovery Net
- Yike Guo1, John Darlington (Dept. of
Computing), - John Hassard (Depts. of Physics and
Bioengineering) - Bob Spence (Dept. of Electrical Engineering)
- Tony Cass (Department of Biochemistry),
- Sevket Durucan (T. H. Huxley School of
Environment) - 1 Contact address yg_at_doc.ic.ac.uk
-
-
2Meeting the LTTR challenge
Sharing Data Information to Create Knowledge
3High Throughput Sensing Characteristics
- Different Devices but same computational
characteristics - Data intensive
- Data dispersive
- large scale,
- heterogenous
- distributed data
- Real-time data manipulation Need to
- calibrate
- Integrate
- analyse
Discovery issues  Distributed Knowledge
Discovery, Management Incremental, Interactive
Discovery Collaborative Discovery
Information issues annotations semantics,
reference, integrated view of data
Data issues different measurements for same
object Data registration, normalisation,
calibration quality control
GRID issues wide area, high volume,
scalability (data, users), collaboration
4DNet Architecture
High Throughput Sensing (HTS) Applications
Large-scale Dynamic Real- time Decision support
Large-scale Dynamic System Knowledge Discovery
Based on Kensington Discovery Platform
Grid-based Knowledge Discovery Grid-based Data
Mining, Collaborative Visualisation
Information Structuring Information Integration
Composition, Semantics Domain-based
Ontologies, Sharing
Distributed Data Engineering Data Registration,
Data Normalisation, Data Quality
Based on Globus ORB Infrastructure
High Throughput Computing Services
Utilising Grid Infrastructure for HT Computing
Grid Basic Infrastructure Globus/Cordon/SRB
5Test bed Performance Requirements
Giga Byte mining gt100,000,000,000
GBytes Mega column feature gt1,000,000
columns Tera Byte warehousing gt 20
TB Tera Flop processing gt 1
Tflops Real-time deployment lt 1 msecs
(power grid reaction time ) Device
scalability gt 10,000 HTD
(e.g. sensors) User scalability gt 100
scientists performing concurrent analysis
6The IC Advantage
The IC infrastructure microgird for the testbed
Over than 12000 end devices
10 Mb/s 1Gb/s to end devices
ICPC Resource
1 Gb/s between floors
150 Gflops Processing
10 Gb/s to backbone
gt100 GB Memory
10 Gb/s between backbone router matrix and
wireless capability
5 TB of disk storage
3m SRIF funding
Network upgrade
20 TB of disk storage
2x1Gb/s to LMAN II (10Gb/s scheduled 2004)
25 TB of tape storage
3 Clusters (gt 1 Tera Flops)
7Testbed Applications
Throughput (GB/s) Size (petabytes) Node
Number operations
HTS Applications
Large-scale Dynamic Real- time Decision support
Large-scale Dynamic System Knowledge Discovery
1-10 1-10 gt20000 Structuring Mining Optimisat
ion RT decisions
- Bio Chip Applications
- Protein-folding chips SNP chips, Diff. Gene
chips using LFII - Protein-based fluorescent micro arrays
- Renewable energy Applications
- Tidal Energy
- Connections to other renewable initiatives
- (solar, biomass, fuel cells), to CHP and
baseload stations
- Remote Sensing Applications
- Air Sensing, GUSTO
- Geological, geohazard analysis
1-100 10-100 gt50000 Image Registration Visual
isation Predictive Modelling RT decisions
1-1000 10-1000 gt10000 Data Quality Visualisation
Structuring Clustering Distributed Dynamic
Knowledge Management
8Biotechnology and Discovery Net
Protein Data Bank (PDB), which maintains data on
the three-dimensional (3D) structure of
biological macromolecules, is doubling in size
every 18 months.
Genbank, the DNA sequence database maintained by
the National Center for Biotechnology Information
(NCBI), is doubling in size every 21 months.
9Protein and gene databases
Our LFII approach will enlarge the number and
size (x100?) of these dBs. Our goal will be to
establish QC, and backward compatibility with
legacy Dbs
10Geo-hazard prediction
Each pixel of a radar image contains information
on the phase of the signal backscattered from the
underlying surface. By utilizing the geometry
provided by two marginally displaced, coherent
observations of the surface, phase difference
between the two observations can be related to
surface height. Furthermore, by repeated
observation, it is possible to measure surface
displacements of scattering features that have
been slightly shifted (due to an earthquake for
example), or that are moving continuously but
relatively slowly (such as ice sheets and
glaciers).
Monitoring geo-hazards we analyse temporal
changes on soil erosion to predict land slides
and floods 5-6 Gbytes/day of image data sensed at
different wavelengths at 30meters/pixel for
180x180 Km area Terabytes 1 meter/pixel to cover
UK
The useful information comes from time-resolved
correlations from other sensors, and with other
environmental data sets LANDSAT ASTER IKONOS ERS
(5 Gbytes/scene) Airborne Radar
INSAR
11Large-scale urban air sensing applications
Each GUSTO air pollution system produces 1kbit
per second, or 1010 bits per year. We expect to
increase the number (from the present 2 systems)
to over 20,000 over next 3 years, to reach a
total of 0.6 petabytes of data within the 3-year
ramp-up.
The useful information comes from time-resolved
correlations among remote stations, and with
other environmental data sets.
NO simulant 6.7.2001
You are here
12Electrical grid
- There is large potential in embedded
- generation renewable sources
- they will dominate in new build over
- usual baseline (nuc., hydro and carbon)
- power stations. Decentralised power
- is the new paradigm.
- Renewable sources include solar, wind,
- tidal, biomass and must be combined
- with baseload and CHP etc.
- Renewables characterised by
- large number of small units,
- often in remote areas
- wireless connectivity
- fluctuating,unpredictable loading
- As total exceeds 12 grid control
- becomes very difficult
- without RT e-grid.
Grid structure, the current regulatory and
charging regimes for the electricity supply
industry were set up to cater for centralised
generation and are often not appropriate for
smaller plant, connected directly into the
distribution network. Deregulation, pollution
control standards, and need for great power
quality have severe implications for seven
nines users eg ISPs, requiring 99.99999 uptime
(3s down p.a.)! Sun Microsystems 1m per
minute power downtime. EPRI shows that 100
transients per month cost US 30-50bn per year.
- active management,
- dispatch metering RT monitoring,
- RT control,
- minute to minute security,
- pan network optimisation.
- This requires very high bandwidth
- RT remote station data acquisition,
- warehousing and analysis.
13Deliverables I Testbed
- High Throughput Computing Services
- Transparent Utilisation of Distributed Processing
- and Storage facilities
- High Volume Computation based on grid software
(globus cordon) - Object abstraction framework
- Resource Pooling and Sharing
- Efficient Resource Discovery and Utilisation
(brokering, application level scheduling) - Utilisation of Grid Services
- Condor cycle-stealing
- Globus
14Deliverables II Testbed
- Knowledge Discovery Services
- Distributed data engineering
- Data registration, Normalisation, Quality
Control - Information Structuring Composition
- Application-oriented information structuring
- Domain Specific Ontologies and Information
Composition - Large-scale distributed mining
- Grid-based data mining algorithm
- Knowledge management and auditing
- Collaborative visualisation
15Deliverables IIIService Applications
- Virtual Cell cell function modelling based on
functional genomics (gene chip), protein
expression and protein-protein interaction data
(John Hassard, Tony Cass, Jeff Harford) -
- Environment Vista Remote sensing analysis
environment real time pollution analysis,
modelling and visualisation for Urban (London)
milieu with upgrade path mapped for pan-Europe
roll-out (Ray Wrigley). - Global Geohazard e-Grid Optimised high bandwidth
analysis and visualisation framework for European
hazard monitoring (Steve Durucan). - UK Regional Power Quality e-Grid Load balancing
and grid optimisation for simulated increasing
renewable power loading (Geoff Rochester).
16The Consortium
- Industry Connection 4 Spin-off companies gt
100 related companies (AstraZeneca, Pfizer, GSK,
Cisco, IBM, HP, Fujitsu, Gene Logic, Applera,
Evotec, International Power, Hydro Quebec, BP,
British Energy, .) - Innovation gt 30 patents world class cross
disciplinary research - Wide coverage of LLTR fields
- Long lasting working relationship close
collaboration for making deliverables
17Milestones
- 9 months
- Basic middleware, DAQ, collection data
registration, information structuring for at
least two fields (demo in Supercomputing 2002) - 18 months
- Scalable middleware, data normalisation, data
integration, Information composition, distributed
mining and visualisation structure - Integrated with USA infrastructrue
- 27 months
- online data quality control, ontologies and data
reference, scalable mining, demo of virtual cell
and environment vista - 36 months
- Packged D-Net as an e-science platform for
national grid deployment, demo of all
applications
18Industrial Contribution
- Hardware sensors (photodiode arrays, hybrid
photodiodes, PMTs), systems (optics, mechanical
systems, DSPs, FPGAs) - Software (analysis packages, algorithms, data
warehousing and mining systems) - Intellectual Property access to IP portfolio
suite at no cost (starting with 32 international
patents) - Data raw and processed data from biotechnology,
pharmacogenomic, remote sensing (GUSTO
installations, satellite data from geo-hazard
programmes) and renewable energy data (from our
own remote tidal power systems) - People gt 8 scientists
19Project Management
- Project PI and co-ordinator Yike Guo
- Project director and strategist John Darlington
- Applications co-ordinator John Hassard
- LFII biotechnology J Harford
- Protein chips T Cass
- Geohazard S Durucan
- Remote sensing R Wrigley
- Renewable energy G Rochester
- Project operation manager Moustafa Ghanem
- We will establish a Scientific Advisory Board by
month 6.