Title: OGSA-DAI data access and integration
1OGSA-DAIdata access and integration
- NERC GridGIS workshop
- eSI, 1 February 2006
2Overview
- The Data Deluge
- challenges of increasing data availability
- benefits of bringing data together
- OGSA-DAI
- overview
- use as a data integration base layer
3The Data Deluge
- Entering an age of data
- Data Explosion
- CERN LHC will generate 1GB/s 10PB/y
- VLBA (NRAO) generates 1GB/s today
- Pixar generate 100 TB/Movie
- Storage getting cheaper
- Data stored in many different ways
- Data resources
- Relational databases
- XML databases / files
- Result files
- Need ways to facilitate
- Data discovery
- Data access
- Data integration
- Empower e-Business and e-Science
- The Grid is a vehicle for achieving this
4Composing Observations in Astronomy
- No. sizes of data sets as of mid-2002,
grouped by wavelength - 12 waveband coverage of large areas of the
sky - Total about 200 TB data
- Doubling every 12 months
- Largest catalogues near 1B objects
Data and images courtesy
Alex Szalay, John Hopkins
5Data Services motives
- Key to Integration of Scientific Methods
- Publication and sharing of results
- Primary data from observation, simulation
experiment - Encourages novel uses
- Allows validation of methods and derivatives
- Enables discovery by combining data collected
independently - Key to Large-scale Collaboration
- Economies data production, publication
management - Sharing cost of storage, management and curation
- Many researchers contributing increments of data
- Pooling annotation leads to rapid incremental
publication - Accommodates global distribution
- Data code travel faster and more cheaply
- Accommodates temporal distribution
- Researchers assemble data
- Later (other) researchers access data
6Data Services challenges to management
- Scale
- Many sites, large collections, many uses
- Longevity
- Research requirements outlive technical decisions
- Diversity
- No one size fits all solutions will work
- Primary Data, Data Products, Meta Data,
Administrative data, - Many Data Resources
- Independently owned managed
- No common goals
- No common design
- Work hard for agreements on foundation types and
ontologies - Autonomous decisions change data, structure,
policy, - Geographically distributed
- and I havent even mentioned security yet!
7Small problems
- Not just Grand Challenges!
- Also the small problems
- For instance
- What happens to data when a researcher leaves a
team? - How can a research leader point to popular data
when a new researcher joins? - How can you manage your data when you start to
run out of local storage space? - How do I get my data from one format/database to
another? - How do I combine my data with your data?
- You need to manage your data
8What is a data service?
- An interface to a stored collection of data
- e.g. Google and Amazon
- web services
- But the data could be
- replicated
- shared
- federated
- virtual
- incomplete
- Dont care about the underlying representation
- do care about the information it represents
- Adding a service layer to existing data sources
can improve composability
9Examples of Data Services
- Many Data Services and applications
- Commercial databases
- Web interfaces
- Applications developed individually by groups and
projects - Also many places to get hold of public data
- Publications and citation servers
- Results servers
- But no such thing as a free lunch
- Things are not yet Plug and Play
- You need to expend some effort to use these
services effectively
10Use Cases for Data Services
- Data Filtering
- Single source producing large amounts of data
distributed to many sites downstream - Data Discovery
- many sources, many query entry points in a linked
system - Data Translation
- source to sink, conversion of data model /
structure - Data Federation
- many sources, linked to provide view as a single
source - Data Replication
- full or partial copies to improve throughput
- Data Integration (model aggregation)
- e.g. integration of time variant data, streams,
files - Data Integration (knowledge expansion)
- forming links between databases to increase
knowledge
11Trade Offs
- Speed vs completeness
- do you require the exact answer or an answer?
- Application specific vs language specific queries
- how will users interrogate a data service?
- Static system vs Dynamic Discovery
- do you actually have dynamic resources?
- Static vs Dynamic data
- READ only, READ/INSERT only, UPDATE permitted
- Static vs Dynamic queries
- optimisation over flexibility
- Intranet vs Internet
- speed over security
- Single data model versus mixed data models
- ease/speed over integration
- Queries vs Questions
- assume that we know the structure when we form
the query
12Requirements on Data Services?
- Common Data Model e.g. RowSet
- Common Query Language(s) e.g. XQuery, SQL
- Standard access to
- data resource schema information for schema
mapping - physical data resource information for
optimisation purposes - data resource descriptive information for
discovery / integration - Single, seamless security model
- Dynamic publication and discovery
- Multiple, efficient delivery methods
- Move computation towards data
- Data aggregation functionality
- Provenance information
- Replication information
13OGSA-DAI In One Slide
- An extensible framework for data access and
integration. - Expose heterogeneous data resources to a grid
through web services. - Interact with data resources
- Queries and updates.
- Data transformation / compression
- Data delivery.
- Customise for your project using
- Additional Activities
- Client Toolkit APIs
- Data Resource handlers
- A base for higher-level services
- federation, mining, visualisation,
14OGSA-DAI team
NeSC, Edinburgh
EPCC Team, Edinburgh
NEReSC, Newcastle
IBM Dissemination Team
IBM Development Team, Hursley
15OGSA-DAI Design Principles I
- Efficient client-server communication
- Minimise where possible
- One request specifies multiple operations
- No unnecessary data movement
- Move computation to the data
- Utilise third-party delivery
- Apply transforms (e.g., compression)
- Build on existing standards
- Fill-in gaps where necessary
- DAIS specifications from DAIS WG at GGF
16OGSA-DAI Design Principles II
- Do not hide underlying data model
- Users must know where to target queries
- Data virtualisation is hard
- Extensible architecture
- Modular and customisable
- e.g., to accommodate stronger security
- Extensible activity framework
- Cannot anticipate all desired functionality
- Activity unit of functionality
- Allow users to plug-in their own
17The OGSA-DAI Framework
Application
Client Toolkit
OGSA-DAI service
Engine
SQLQuery
Activities
GZip
GridFTP
XPath
readFile
XSLT
JDBC
Data Resources
XMLDB
File
MySQL
DB2
XIndice
SWISS PROT
SQL Server
Data- bases
18Intermediary
- Simple intermediary
- potential to accelerate development, logging, or
filtering - Persistent intermediary
- e.g. to allow efficient local indexing
19Redirector, Coordinator, Network
- Allowing composition and decentralisation
20Extensibility Example
OGSA-DAI service
Engine
SQLQuery
SQLQuery
JDBC
Multiple SQL GDS
MySQL
21Map Retrieval Current
22Map Retrieval Grid Prototype
23Map Retrieval Security
- Exploit NGS infrastructure to provide secure
access layer
EDINA
NGS Authentication
Allowed users dn
SO-OGC
OGC
ODS 1
GIS
Oracle
24Map Retrieval Integration
- Exploit OGSA-DAI extensibility to add e.g. overlay
25OGSA-DAI / EDINA prototyping work
- Stage 1 Using existing OGSA-DAI technology
- Stage 2 Extending OGSA-DAI
OGSA-DAI service
Input Parameters
URL
GIS Client
DeliverFrom URL
GIS Activities
Image/XML File
WMS Server
HTTP Request
HTTP Data Resource
HTTP Response
26Core features of OGSA-DAI I
- A framework for building applications
- Supports data access, insert and update
- Relational MySQL, Oracle, DB2, SQL Server,
Postgres - XML Xindice, eXist
- Files CSV, BinX, EMBL, OMIM, SWISSPROT,
- Supports data delivery
- SOAP over HTTP
- FTP GridFTP
- E-mail
- Inter-service
- Supports data transformation
- XSLT
- ZIP GZIP
- Supports security
- X.509 certificate based security
27Core features of OGSA-DAI II
- A framework for building data clients
- Client toolkit library for application developers
- A framework for developing functionality
- Extend existing activities, or implement your own
- Mix and match activities to provide functionality
you need - Highly-extensible
- Customise our out-of-the-box product
- Provide your own services, client-side support
and data-related functionality - Comprehensive documentation and tutorials
- Latest release supports GT4.0 and Axis 1.2 /
OMII_2 using Java 1.4
28Distributed Query Processing
- Higher level services building on OGSA-DAI
- specialised metadata extraction
- Execute queries in parallel over multiple data
resources - Queries mapped to algebraic expressions for
evaluation - Parallelism represented by partitioning queries
- Use exchange operators
- Equality based joins in current release
- supported types long, integer, string, double
and float
29DQP architecture
30GridMiner Data Mediation Service
- Principles
- Tight Federation
- global (relational) schema
- Virtual integration
- leave the data where it is
- always up-to-date data
- Build on data access from OGSA-DAI
- Not bound to special architecture
- Supported data sources
- RDBMS (via JDBC), XMLDB (Xindice), CSV files
- Operators Union all and inner join
- Operators are XQuery based (using SAXON)
31Data Integration Scenario
- Heterogeneities
- Name in A is First Last (as the target format)
- Name in C has to be combined
- Distribution
- 3 data sources
- Java based schema mapping to global schema
- types limited by WebRowSet
32Data Integration Scenario (cont.)
- Query
- SELECT p_name FROM patient WHERE id10
Standard
to
optimized
33caBIG
- Object-Oriented view of data
- Data types are well-defined and registered in a
repository - Standardized metadata facilitates discovery
- custom query language implemented as an activity
34LEAD
35FirstDIG
- Data mining with the First Transport Group, UK
- Example When buses are more than 10 minutes
late there is an 82 chance that revenue drops by
at least 10 - "The results of this exercise will revolutionise
the way we do things in the bus industry.,
Darren Unwin, Divisional Manager, First South
Yorkshire. - Client based joins, using temporary tables
OGSA-DAI
OGSA-DAI
OGSA-DAI
OGSA-DAI
OGSA-DAI Client Application
Data Mining Application
36OGSA-DAI Challenges
- Metadata extraction
- define a common model for e.g. database schema?
- Intermediate representation
- between multiple models (relational, XML,)
- XML WebRowSet is flexible (c.f. GridMiner) but
expansive - DFDL and GridFTP/parallel HTTP?
- Query definition
- translation of queries
- aggregation of results
- Data transport and workflow
- workflow is typically compute driven
- Move computation to data
- mobile code activities?
- data services hosted on DBMS?
37Contributing to OGSA-DAI
- Additional functionality
- Provide activities which implement specific
functionality - Provide extra client functionality
- Provide different security mechanisms
- Provide higher level components and applications
- Different levels of contributions
- Based on OGSA-DAI?
- Works with OGSA-DAI?
- Part of OGSA-DAI?
38In the near future
- A new version of the OGSA-DAI Engine
- should look mostly the same externally
- better support for concurrency, sessions and
monitoring - Implementing new versions of specifications
- DAIS Specifications
- Key things that we will be addressing
- Performance
- A Security Model which can be applied across
platforms - Full Transactions framework, distributed
transactions - More data integration facilities
- Better abstraction over DBMS variation
- Application centric queries
- collaborating with other projects
- Research projects looking at
- schema mapping
- extended data resources
39Associated Meetings and Workshops
- DIALOGUE Workshops (http//www.datagrids.org)
- Data Integration Applications Linking
Organisations to Gain Understanding and
Experience - Bringing together Data Integration middleware and
application providers with users - Next one at NeSC 9-10th February 2006
- http//www.nesc.ac.uk/esi/events/636/
- Next Generation Distributed Data Management
(HPDC15, Paris) - http//www.isi.edu/annc/distributedDataWorkshop.h
tml - Data Management on Grids (VLDB06, Seoul)
40Conclusions
- The benefits of trying to integrate data are
hindered by challenges such as heterogeneity,
scale and distribution - A common data service layer should make data
integration easier - OGSA-DAI provides an extensible, data service
based framework which makes it easier to
implement data integration - GIS data is amenable to integration using data
services
41Further information
- The OGSA-DAI Project Site
- http//www.ogsadai.org.uk
- The DAIS-WG site
- http//forge.gridforum.org/projects/dais-wg/
- OGSA-DAI Users Mailing list
- users_at_ogsadai.org.uk
- General discussion on grid DAI matters
- Formal support for OGSA-DAI releases
- http//bugs.ogsadai.org.uk/
- OGSA-DAI training courses