Discovery Net - PowerPoint PPT Presentation

1 / 19
About This Presentation
Title:

Discovery Net

Description:

of Physics and Bioengineering) Bob Spence (Dept. of Electrical Engineering) ... disparate off-campus sites: IC hospitals, Wye College etc. workstation cluster ... – PowerPoint PPT presentation

Number of Views:45
Avg rating:3.0/5.0
Slides: 20
Provided by: jha103
Category:
Tags: discovery | net

less

Transcript and Presenter's Notes

Title: Discovery Net


1
Discovery Net
  • Yike Guo1, John Darlington (Dept. of
    Computing),
  • John Hassard (Depts. of Physics and
    Bioengineering)
  • Bob Spence (Dept. of Electrical Engineering)
  • Tony Cass (Department of Biochemistry),
  • Sevket Durucan (T. H. Huxley School of
    Environment)
  • 1 Contact address yg_at_doc.ic.ac.uk

2
Meeting the LTTR challenge
Sharing Data Information to Create Knowledge
3
High Throughput Sensing Characteristics
  • Different Devices but same computational
    characteristics
  • Data intensive
  • Data dispersive
  • large scale,
  • heterogenous
  • distributed data
  • Real-time data manipulation Need to
  • calibrate
  • Integrate
  • analyse

Discovery issues  Distributed Knowledge
Discovery, Management Incremental, Interactive
Discovery Collaborative Discovery
Information issues annotations semantics,
reference, integrated view of data
Data issues different measurements for same
object Data registration, normalisation,
calibration quality control
GRID issues wide area, high volume,
scalability (data, users), collaboration
4
DNet Architecture
High Throughput Sensing (HTS) Applications
Large-scale Dynamic Real- time Decision support
Large-scale Dynamic System Knowledge Discovery
Based on Kensington Discovery Platform
Grid-based Knowledge Discovery Grid-based Data
Mining, Collaborative Visualisation
Information Structuring Information Integration
Composition, Semantics Domain-based
Ontologies, Sharing
Distributed Data Engineering Data Registration,
Data Normalisation, Data Quality
Based on Globus ORB Infrastructure
High Throughput Computing Services
Utilising Grid Infrastructure for HT Computing
Grid Basic Infrastructure Globus/Cordon/SRB
5
Test bed Performance Requirements
Giga Byte mining gt100,000,000,000
GBytes Mega column feature gt1,000,000
columns Tera Byte warehousing gt 20
TB Tera Flop processing gt 1
Tflops Real-time deployment lt 1 msecs
(power grid reaction time ) Device
scalability gt 10,000 HTD
(e.g. sensors) User scalability gt 100
scientists performing concurrent analysis
6
The IC Advantage
The IC infrastructure microgird for the testbed
Over than 12000 end devices
10 Mb/s 1Gb/s to end devices
ICPC Resource
1 Gb/s between floors
150 Gflops Processing
10 Gb/s to backbone
gt100 GB Memory
10 Gb/s between backbone router matrix and
wireless capability
5 TB of disk storage
3m SRIF funding
Network upgrade
20 TB of disk storage
2x1Gb/s to LMAN II (10Gb/s scheduled 2004)
25 TB of tape storage
3 Clusters (gt 1 Tera Flops)
7
Testbed Applications
Throughput (GB/s) Size (petabytes) Node
Number operations
HTS Applications
Large-scale Dynamic Real- time Decision support
Large-scale Dynamic System Knowledge Discovery
1-10 1-10 gt20000 Structuring Mining Optimisat
ion RT decisions
  • Bio Chip Applications
  • Protein-folding chips SNP chips, Diff. Gene
    chips using LFII
  • Protein-based fluorescent micro arrays
  • Renewable energy Applications
  • Tidal Energy
  • Connections to other renewable initiatives
  • (solar, biomass, fuel cells), to CHP and
    baseload stations
  • Remote Sensing Applications
  • Air Sensing, GUSTO
  • Geological, geohazard analysis

1-100 10-100 gt50000 Image Registration Visual
isation Predictive Modelling RT decisions
1-1000 10-1000 gt10000 Data Quality Visualisation
Structuring Clustering Distributed Dynamic
Knowledge Management
8
Biotechnology and Discovery Net
Protein Data Bank (PDB), which maintains data on
the three-dimensional (3D) structure of
biological macromolecules, is doubling in size
every 18 months.
Genbank, the DNA sequence database maintained by
the National Center for Biotechnology Information
(NCBI), is doubling in size every 21 months.
9
Protein and gene databases
Our LFII approach will enlarge the number and
size (x100?) of these dBs. Our goal will be to
establish QC, and backward compatibility with
legacy Dbs
10
Geo-hazard prediction
Each pixel of a radar image contains information
on the phase of the signal backscattered from the
underlying surface. By utilizing the geometry
provided by two marginally displaced, coherent
observations of the surface, phase difference
between the two observations can be related to
surface height. Furthermore, by repeated
observation, it is possible to measure surface
displacements of scattering features that have
been slightly shifted (due to an earthquake for
example), or that are moving continuously but
relatively slowly (such as ice sheets and
glaciers).
Monitoring geo-hazards we analyse temporal
changes on soil erosion to predict land slides
and floods 5-6 Gbytes/day of image data sensed at
different wavelengths at 30meters/pixel for
180x180 Km area Terabytes 1 meter/pixel to cover
UK
The useful information comes from time-resolved
correlations from other sensors, and with other
environmental data sets LANDSAT ASTER IKONOS ERS
(5 Gbytes/scene) Airborne Radar
INSAR
11
Large-scale urban air sensing applications
Each GUSTO air pollution system produces 1kbit
per second, or 1010 bits per year. We expect to
increase the number (from the present 2 systems)
to over 20,000 over next 3 years, to reach a
total of 0.6 petabytes of data within the 3-year
ramp-up.
The useful information comes from time-resolved
correlations among remote stations, and with
other environmental data sets.
NO simulant 6.7.2001
You are here
12
Electrical grid
  • There is large potential in embedded
  • generation renewable sources
  • they will dominate in new build over
  • usual baseline (nuc., hydro and carbon)
  • power stations. Decentralised power
  • is the new paradigm.
  • Renewable sources include solar, wind,
  • tidal, biomass and must be combined
  • with baseload and CHP etc.
  • Renewables characterised by
  • large number of small units,
  • often in remote areas
  • wireless connectivity
  • fluctuating,unpredictable loading
  • As total exceeds 12 grid control
  • becomes very difficult
  • without RT e-grid.

Grid structure, the current regulatory and
charging regimes for the electricity supply
industry were set up to cater for centralised
generation and are often not appropriate for
smaller plant, connected directly into the
distribution network. Deregulation, pollution
control standards, and need for great power
quality have severe implications for seven
nines users eg ISPs, requiring 99.99999 uptime
(3s down p.a.)! Sun Microsystems 1m per
minute power downtime. EPRI shows that 100
transients per month cost US 30-50bn per year.
  • active management,
  • dispatch metering RT monitoring,
  • RT control,
  • minute to minute security,
  • pan network optimisation.
  • This requires very high bandwidth
  • RT remote station data acquisition,
  • warehousing and analysis.

13
Deliverables I Testbed
  • High Throughput Computing Services
  • Transparent Utilisation of Distributed Processing
  • and Storage facilities
  • High Volume Computation based on grid software
    (globus cordon)
  • Object abstraction framework
  • Resource Pooling and Sharing
  • Efficient Resource Discovery and Utilisation
    (brokering, application level scheduling)
  • Utilisation of Grid Services
  • Condor cycle-stealing
  • Globus

14
Deliverables II Testbed
  • Knowledge Discovery Services
  • Distributed data engineering
  • Data registration, Normalisation, Quality
    Control
  • Information Structuring Composition
  • Application-oriented information structuring
  • Domain Specific Ontologies and Information
    Composition
  • Large-scale distributed mining
  • Grid-based data mining algorithm
  • Knowledge management and auditing
  • Collaborative visualisation

15
Deliverables IIIService Applications
  • Virtual Cell cell function modelling based on
    functional genomics (gene chip), protein
    expression and protein-protein interaction data
    (John Hassard, Tony Cass, Jeff Harford)
  • Environment Vista Remote sensing analysis
    environment real time pollution analysis,
    modelling and visualisation for Urban (London)
    milieu with upgrade path mapped for pan-Europe
    roll-out (Ray Wrigley).
  • Global Geohazard e-Grid Optimised high bandwidth
    analysis and visualisation framework for European
    hazard monitoring (Steve Durucan).
  • UK Regional Power Quality e-Grid Load balancing
    and grid optimisation for simulated increasing
    renewable power loading (Geoff Rochester).

16
The Consortium
  • Industry Connection 4 Spin-off companies gt
    100 related companies (AstraZeneca, Pfizer, GSK,
    Cisco, IBM, HP, Fujitsu, Gene Logic, Applera,
    Evotec, International Power, Hydro Quebec, BP,
    British Energy, .)
  • Innovation gt 30 patents world class cross
    disciplinary research
  • Wide coverage of LLTR fields
  • Long lasting working relationship close
    collaboration for making deliverables

17
Milestones
  • 9 months
  • Basic middleware, DAQ, collection data
    registration, information structuring for at
    least two fields (demo in Supercomputing 2002)
  • 18 months
  • Scalable middleware, data normalisation, data
    integration, Information composition, distributed
    mining and visualisation structure
  • Integrated with USA infrastructrue
  • 27 months
  • online data quality control, ontologies and data
    reference, scalable mining, demo of virtual cell
    and environment vista
  • 36 months
  • Packged D-Net as an e-science platform for
    national grid deployment, demo of all
    applications

18
Industrial Contribution
  • Hardware sensors (photodiode arrays, hybrid
    photodiodes, PMTs), systems (optics, mechanical
    systems, DSPs, FPGAs)
  • Software (analysis packages, algorithms, data
    warehousing and mining systems)
  • Intellectual Property access to IP portfolio
    suite at no cost (starting with 32 international
    patents)
  • Data raw and processed data from biotechnology,
    pharmacogenomic, remote sensing (GUSTO
    installations, satellite data from geo-hazard
    programmes) and renewable energy data (from our
    own remote tidal power systems)
  • People gt 8 scientists

19
Project Management
  • Project PI and co-ordinator Yike Guo
  • Project director and strategist John Darlington
  • Applications co-ordinator John Hassard
  • LFII biotechnology J Harford
  • Protein chips T Cass
  • Geohazard S Durucan
  • Remote sensing R Wrigley
  • Renewable energy G Rochester
  • Project operation manager Moustafa Ghanem
  • We will establish a Scientific Advisory Board by
    month 6.
Write a Comment
User Comments (0)
About PowerShow.com