Title: Managing Grids with Information and
1MAGIK-I
- Managing Grids with Information and
- Knowledge that are Incomplete
- Andy Cooke, Alasdair Gray, Lisha Ma and
- Werner Nutt
- 8th June 2004
- Royal Observatory, Edinburgh
2Who are we?
- Part of the Database Group at Heriot-Watt
University - Interested in information integration
- Grid Monitoring
- in collaboration with DataGrid/EGEE
- funded by EPSRC
- Theoretical work
- e.g. query answering using views
- Integrating distributed data streams
- Query languages for streams (Lisha Ma)
- Managing views over streams (Alasdair Gray)
- Integrating biological data
3What is MAGIK-I?
- How can one handle incompleteness in an
- information-integration setting?
- An EPSRC-funded project (until Sept. 2007)
- part of Semantic Grid Initiative
- Collaborating with
- Bob Mann (ROE, AstroGrid project)
- Steve Fisher (RAL, EGEE project)
- Objectives
- develop logical framework, to back-up code
solutions - extract requirements from collaborators
- use test-bed to try out ideas (and get feedback)
4A Common Problem
- Users want to obtain information
- published on a Grid but
- many sources to find!
- how is their data related?
- which sources have relevant data?
- what query should be posed?
- Also
- possibly different data models to interact with
- distributed query processing is hard!
5Information Integrationthe Paradigm
- Addresses problems 1 - 4 on previous slide
- not concerned with distributed query processing,
- nor how to accommodate different data models
- The general approach
- define a global schema,
- users query virtual database,
- mediator translates query into
- distributed query over sources
6What do Mediators need?
- A mapping that relates each source schema with
the global schema, e.g. - global database described as view over sources
- sources described as view over global database
- a combination of these.
- A global query language
- A common source query language
- Information about capabilities of sources
- what queries do they support?
- how complete are they?
7Example Grid Monitoring with the R-GMA System
- Allows Grid middleware, e.g. a broker, to find
out about the state of Grid resources -
- Offers APIs to users
- Producer (for publishing information streams),
or - Consumer (for posing queries against a global
schema) - APIs are supported by
- agents (work on behalf of producers and
consumers) - smart registry (can match queries with views)
- republishers (collect information together to
optimize queries)
8R-GMA Birds Eye View
9Producers Register their View
Stream Producer 1 publishes and registers
SELECT FROM CPULoad WHERE country UK and
site RAL
Stream Producer 2 publishes and registers
SELECT FROM CPULoad WHERE country UK and
site HW
10Views Map to a Global Table
11How does R-GMAs Mediator work?
- Find relevant producers using a satisfiability
check - Query SELECT WHERE site RAL
- View WHERE site HW producer can never
contribute! - Choose the best plan
- e.g. contact one republisher, rather than 40
producers! - Execute plan
- switch to alternative plan if first fails
- currently no distributed query processor is used
-
- Return answer, and report if incomplete
- by appending a warning to a result set
- but what about more
complex views?
12Information ManifoldSupports Join Views
- Global schema for employers virtual db
- employees emp(eName)
- phone numbers phone(eName, phoneNo)
- managers mgr(eName, mName)
- departments dept(eName, dept)
- office office(eName, office)
- Three source relations
-
- S1(E,M,P) employees with managers and phones
- S2(E,O,D) employess with office and department
- S3(E,P) employees of toy department and
phone
13How Can we Describe the Sources?
- S1(E, M, P) (employees with managers
and phones) -
- contains answers to the query
-
- SELECT E.eName, M.mName, P.phoneNo
- FROM emp E, mgr M, phone P
- WHERE E.eName M.eNAME AND
- E.eName P.eName
- Shorthand notation
- S1(E,M,P) lt- emp(E) mgr(E,M) phone(E,P)
-
14Shorthand Notation for Source Descriptions and
Queries
- Sources
- S1(E,M,P) lt- emp(E) mgr(E,M)
phone(E,P) -
- S2(E,O,D) lt- emp(E) office(E,O)
dept(E,D) - S3(E,P) lt- emp(E) phone(E,P)
dept(E,toys) - Query What is Sallys phone and office?
- q(P,O) lt- phone(sally,P) office(sally,O)
15Query Plans
- Two plans are possible
- p1(P,O) lt- s1(sally,P,M) s2(sally,O,D)
- p2(P,O) lt- s3(sally,P) s2(sally,O,D)
- How good are these plans?
- How complete are s1 and s3 wrt phone numbers?
- What if s1 and s3 return different phone numbers?
- What if a source contains nulls?
- Matching queries and sources is harder than in
todays R-GMA - but some research systems have been
built - e.g. Information Manifold
16Could AstroGrid benefit from a Mediator?
- Typical query
- Find objects that appear in x-ray but dont
appear in infra-red, for these sky
coordinates - A mediator could
- identify relevant sources.
- dont bother considering this db
its sky coverage never overlaps
with any x-ray db - build local queries/ workflows on behalf of user
- estimate coverage of a users query
- your query would only cover 5 of what you
want! - you can only get answers from the southern
hemisphere
(intentional
answer) - so what is needed?
17What would AstroGrids Mediator Need?
- A global schema
- seems work has started, e.g. UCDs
- A common query language
- you are working on this
- do sources have a uniform interface?
- Mappings that relate source schemas to global
schema - source descriptions (UCDs) are registered
- but may need to be more expressive
- Other information would help
- how complete is the view description?
- is the source republishing data?
- what queries/algorithms can be processed?
- are there access restrictions?
- challenging, as views and queries are
complex!
18Incompleteness when Integrating Data
- Databases cover different areas of sky
- DB1 contains all optical objects in its area
- DB2 contains all x-ray objects in its area
- Give me all objects in optical that are not in
x-ray -
- Useful concepts like certain answer and
possible answer arise from information-integrati
on setting - can compare global concepts with what is in
database - now consider types of incompleteness
in AstroGrid
19UCD Imprecision
Incompleteness due to UCD imprecision, e.g. UCD
list cannot enumerate all optical
bandpasses (from Bob Mann)
- Make UCDs more expressive, to describe more
precisely what databases contain - e.g. source view where x lt wavelength lt z and
" - Mediator can then reason, and accurately identify
relevant databases - e.g. whenever user asks where wavelength
between (x, y) - Makes it easier to construct good workflows
automatically
20Sky Coverage Problem
- Incompleteness in sky coverage
- I want You
have - Registry cant contain a full description of db
- So approximately describe coverage?
- Mediator could estimate quality
- your query plan would only give 2 coverage
21Precision Problem
- Incompleteness in spectral coverage
- I want a flux at 1.5mm, you have fluxes at 1.4mm
and 1.6mm is that good enough? - Users specify precision in query?
- I want flux at 1.5mm /- 0.2mm
- Enhance source views (UCDs)
- source A where flux 1.4mm /- 0.1mm
- Mediator can then reason
- yes, source A is suitable!
22Service Unavailability
- Incompleteness due to service unavailability
- Registry says theres relevant data, but data
centre is currently offline - Mediator needs to keep track of availability
- e.g. send ping messages?
- Could suggest alternative workflows
- e.g. use republisher (warehouse) if possible.
23Null Values
- Incompleteness in the Registry
- Allows Not applicable, Unknown and Not
provided entries - Do these express a projection view?
- e.g. view select a,b from aTable where
- What if query refers to missing attribute?
- e.g. select pad tuples with nulls?
- e.g. where c val mark possible answers?
- would need to provide semantics of any nulls that
are added - also, nulls in databases
24Conclusions
- A mediator for AstroGrid?
- access to sources is planned manually just now
- looks like a natural setting for a mediator
- many relevant pieces in place (query languages,)
- would be interesting to explore idea!
- Mediator could help with incompleteness problems
- as incompleteness is w.r.t. something else,
- and so something else must be provided!
25Proposal
- Set up a toy scenario to explore ideas
- we would need technical support with this
- plug trial components into AstroGrid?
- Iterative development
- users prioritise new features
- we develop solutions
- users give us feedback
- we develop supporting logical model