Title: Quarrying Unfamiliar DataSpaces
1Quarrying Unfamiliar DataSpaces
- Bill Howe
- David Maier
- Nick Rayner
- Sponsored by the NSF ITR Program 2001-2006
- In collaboration with Antonio Baptista, Paul
Turner, and the entire CORIE Environmental
Science Team at OGI - http//www.cs.pdx.edu/howe
2Dataspaces
- DataSpace (DS)
- Autonomous, heterogeneous data sources
- grouped by an identifiable scope
- with respect to a set of requirements
- DataSpace Support Platform (DSSP)
- A collection of best effort services
- Catalog and Browse
- Search and Query
- Workflow (Events, Actions, and Monitoring)
- Integrity checks/guarantees
From Databases to Dataspaces A New
Abstraction for Information Management, Michael
Franklin, Alon Halevy, David Maier, SIGMOD Record
December 2005.
3Databases vs. Dataspaces
- Data Coexistence
- Autonomous Sources
- Search, Browse, Approximate Answer
- Best Effort Guarantees
- Single Schema
- Centralized Administration
- Structured Query
- Strict Integrity Constraints
4Dataspaces vs. Data Integration Systems
DB1
Integrated DB
DB2
Insular Databases
DB3
May use Local-as-view or Global-as-view
always requires semantic integration
5Semantic Web vs. Dataspaces
- Autonomous agents
- crawling richly described autonomous data
sources - which are related formally via ontologies
- No ontology
- Probably no inferencing
- Pay as you go
6Dataspace Timeline
Insular, application-specific databases
Autonomous agents crawling richly described data
sources integrated via an ontology
time, scope
7Example Scientific Data Repository
Atmospheric forcings River forcings Global ocean
forcings
Sensor Data
Ocean simulation results Configuration and log
files Annotations Data Products
salinity
/anim-sal_estuary_7.gif
8Example Pharmacology
RxNav Interface developed by the National
Library of Medicine
9Dataspace Timeline
utility
time, scope, effort
10Unfamiliar Dataspaces
- No schema is available
- No query workload is available
- Browse is the dominant interaction
- keys, ids, URIs not directly useful
- properties and values alone convey meaning
- Goal Maximize return on effort when working with
an unfamiliar dataspace
11Green Field Tools for Unfamiliar Dataspaces
- Goal A working, extensible application with the
least possible effort - We need at least
- a Data Model
- Lowest Common Denominator
- minimal modeling decisions
- an API
- narrow interface
- uniformly efficient
12Outline
- Dataspaces
- Quarry Data Model
- Quarry Storage
- Quarry API
- Experimental Results
- Wrap up
13Quarry Data Model
- resource, property, value
- (subject, predicate, object) if you prefer
- no explicit distinction between literal values
and resource values - no explicit types or classes
- no variables (no inference)
- no context/provenance (yet)
14Example Pharmacology
Concept
Relationship
Atom
15Example Scientific Data Repository
/anim-sal_estuary_7.gif
16Outline
- Dataspaces
- Quarry Data Model
- Quarry Storage
- Quarry API
- Experimental Results
- Wrap up
17Some Storage Models
- Schema dependent storage (RDFS)
- We assume schema is unavailable
- Indexed Triple Store
- Property Tables
18Simple Idea
- Signatures
- resources expressing the same properties
clustered together - Posit that Signature ltlt Resource
- Queries evaluated over Signature Extents
191) Triple Store
A Query in SPARQL/RDQL
Triples
select ?v where (?r, ltsregiongt,
ltsestuarygt), (?r, ltsvariablegt, ltssaltgt),
(?r, ltsdepthgt, lts7gt), (?r, ltspathgt, ?v)
rsrc
prop
value
101
depth
7
336
variable
temp
101
path
/iso_e_s_7.gif
101
variable
salt
and in SQL
843
channel
north
SELECT p.value as path FROM Triples r, Triples
v, Triples d, Triples p WHERE r.property
region AND v.property variable AND
d.property depth AND p.property path
AND r.rsrc v.rsrc AND v.rsrc d.rsrc
AND d.rsrc p.rsrc
843
variable
salt
336
path
/trans_s_t.gif
843
path
/trans_n_s.gif
336
channel
south
101
region
estuary
One join per condition
201) Triple Store
select ?v where (?r, ltsregiongt,
ltsestuarygt), (?r, ltsvariablegt, ltssaltgt),
(?r, ltsdepthgt, lts7gt) (?r, ltspathgt, ?v)
SELECT MAX(CASE WHEN propertypath' THEN value
END) as path FROM TriplesGROUP BY rsrc HAVING
MAX(CASE WHEN propertyregion' THEN value
END) estuary AND MAX(CASE WHEN
propertyvariable' THEN value END) salt AND
MAX(CASE WHEN propertyregion' THEN value END)
7
but cant exploit indexes
212) Property Tables
depth
select ?p where (?r, ltsregiongt,
ltsestuarygt), (?r, ltsvariablegt, ltssaltgt),
(?r, ltsdepthgt, lts7gt) (?r, ltspathgt,
?p)
value
rsrc
101
7
region
variable
rsrc
value
rsrc
value
101
estuary
336
temp
select p.value from region r, variable v,
depth d, path p where r.value estuary
and v.value salt and d.value 7 and
r.rsrc v.rsrc and v.rsrc d.rsrc and
d.rsrc p.rsrc
101
salt
path
843
salt
rsrc
value
channel
101
/iso_e_s_7.gif
336
/trans_s_t.gif
rsrc
value
843
/trans_n_s.gif
843
north
336
south
223) Signature Tables
select ?p where (?r, ltsregiongt,
ltsestuarygt), (?r, ltsvariablegt, ltssaltgt),
(?r, ltsdepthgt, lts7gt) (?r, ltspathgt,
?p)
S1 variable, channel, path
variable
channel
path
rsrc
336
temp
south
/trans_s_t.gif
north
843
salt
/trans_n_s.gif
select path from S2 where region estuary
and variable salt and depth 7
S2 depth, region, variable, path
rsrc
depth
variable
region
path
101
7
salt
estuary
/iso_e_s_7.gif
23Choosing a Storage Model
- Sources of information
- A priori knowledge (schema)
- Query workload (learning)
- The data
24Computing Signatures
r0
p0
v(0,0)
r0
p0
v(0,0)
r2
p1
v(2,1)
p1
v(0,1)
r0
p2
v(0,2)
p2
v(0,2)
External Sort
r0
p1
v(0,1)
r1
p1
v(1,1)
r1
p3
v(1,3)
p3
v(1,3)
r1
p1
v(1,1)
r2
p1
v(1,1)
r2
p3
v(2,3)
p3
v(1,3)
Nest
r0
p0, p1, p2
v(0,0), v(0,1), v(0,2)
hash(S0)
r1
p1, p3
v(1,1), v(1,3)
hash(S1)
r2
p1, p3
v(1,1), v(1,3)
hash(S2)
25Computing Signatures
r0
p0, p1, p2
v(0,0), v(0,1), v(0,2)
hash(S0)
r1
p1, p3
v(1,1), v(1,3)
hash(S1)
r2
v(1,1), v(1,3)
signatures
hash(S0)
rsrc
p0
p1
p2
signature
sighash
r0
p0, p1, p2
hash(S0)
v(0,0)
v(0,1)
v(0,2)
p1, p3
hash(S1)
hash(S1)
rsrc
p1
p3
r1
v(1,1)
v(1,3)
r2
v(1,1)
v(1,3)
26Outline
- Dataspaces
- Quarry Data Model
- Quarry Storage
- Quarry API
- Experimental Results
- Wrap Up
27Quarry API Describe
- Describe(r)
- Property, Value pairs describing resource r
Describe(/2005-002//anim-sal_plume_5.gif)
year2005, day002, runid2005-002, anim,
regionplume, variablesalt, depth5,
plottypeisoline Describe(/2005-002//anim-sa
l_channel_transects.gif) year2005,
day002, runid2005-002, anim, channelplume,
variablesalt, plottypetransect
28Quarry API Values
- Values(B, p)
- Unique values of property p associated with any
resource that satisfies B
Values(varsalt, day) 1,2,3,4,5,6,7
29Quarry API Properties
- Properties(B)
- The set of properties that describe any resource
satisfying B
GetProperties(variablesalt) plottype, year,
region, depth, channel, GetProperties(plottype
isoline) region, depth, year,
30Quarry API
- Applications use sequences of Prop and Val calls
to explore the Dataspace
31Quarry API Canonical Application
all unique properties
p
all unique values of parent property
v
all properties of resources satisfying pv
Every path from a root represents a conjunctive
query
32Expressiveness
- Mostly subsumed by RDF Query Languages
- Were limited to queries of the form
- shooting for the minimal possible useful API
?s LANGUAGECODE en . ?s DESCRIPTIONTYPE 2 . ?s
UMLSAUI A3711025 . ?s string Sodium_lactate_0.16_
molar_infusion .
33Outline
- Dataspaces
- Quarry Data Model
- Quarry Storage
- Quarry API
- Experimental Results
- Wrap Up
34Experimental Results
- Yet Another RDF Store (YARS)
- Several B-Tree indexes
- rpv ? _, pv ? r, vr ? p, etc.
- authors report good performance against Redland
and Sesame - 3M triples, single term queries
- We investigate simple multi-term queries
?s ltp0gt lto0gt ?s ltp1gt lto1gt ?s ltpngt ltongt
35Experimental Results Queries
3.6M triples 606k resources 149 signatures
36Frequent YARS Access Plan
?s LANGUAGECODE en . ?s DESCRIPTIONTYPE 2 . ?s
UMLSAUI A3711025 . ?s string Sodium_lactate_0.16_
molar_infusion .
spo
ltsgt string Sodium
spo
ltsgt LANGUAGECODE en
po ? s
spo
ltsgt DESCRIPTIONTYPE 2
UMLSAUIA3711025 ? ltsgt
37YARS Plan Speed
time (s)
cardinality of first join
38Related Work
- RDF Redland, Sesame, Jena, YARS, Forth, KAON
- Primarily Indexed Triple Stores rich query
support - Path Indexes Lorel, DataGuides
- Data Mining for Structure
- Ding, Wilkinson, Sayers, Kuno _at_ HP Labs
Application-specific Schema Design for Storing
Large RDF Datasets
39Conclusions
- Dataspaces
- Smoothing the ROI curve for data management
- Quarry
- very simple Data Model
- very simple API
- no need to sacrifice efficiency, utility
40Questions?
41Quarry API
- /2004/2004-001//anim-tem_estuary_bottom.gif
- aggregate bottom animation isotem day
001 directory images plottype
isotem region estuary runid
2004-001 year 2004 -
- /2004/2004-001//amp_plume_2d.gif day 001
- directory images plottype 2d
- region plume
- runid 2004-001 year 2004
42Quarry Query Processing
- Props(B)
- B (regionestuary and day136 and
variablesalt) - let cover region, day, variable
- Ans
- for Sig in Signatures
- if cover in Sig
- if exists tup (tup in Extent(Sig) and B(tup))
- Ans Ans U Sig
select rho from Extent(Sig1) where B limit 1
43Quarry Query Processing
- Val(B, rho)
- B (regionestuary and day136 and
variablesalt) - let cover region, day, variable
- Ans
- for Sig in Signatures
- if cover in Sig
- for tuple in Extent(Sig)
- if B(tuple)
- insert tuplerho in Ans
select rho from Extent(Sig1) where
B union select rho from Extent(Sig1) where B
44Scaling Up
- Queries covered by many signatures can be
inefficient
SELECT orig_code FROM sig1 WHERE va_class_name
DE820 UNION SELECT orig_code FROM sig2 WHERE
va_class_name DE820 UNION SELECT orig_code
FROM sig3 WHERE va_class_name
DE820 UNION SELECT orig_code FROM sig4 WHERE
va_class_name DE820
45Scaling Up
S1(a,b,c)
S12(a,b,c,d)
S2(a,b,d)
pad with nulls
46Scaling Up
- Extract
- Find commonly accessed property sets and
materialize them separately
S1(a,b,c)
S1(a,b,c)
S2(a,b,d)
S2(a,b,d)
Sab(a,b)
47Updates
S1 variable, channel, path
variable
channel
value
rsrc
336
temp
south
/trans_s_t.gif
north
843
salt
/trans_n_s.gif
S2 depth, region, variable, path
rsrc
depth
variable
path
region
101
7
salt
estuary
/iso_e_s_7.gif
48Data Models
expressive power in terms of structure,
operations, and constraints
49Some Storage Models
- Schema dependent storage (RDFS)
- We assume schema is unavailable
- Indexed Triple Store
- Logically, one large table of (r, p, v) triples
- Physically, multiple indices for various access
patterns - Property Tables
- Some properties get their own (r, v) extents
(basically isomorphic to a pso index) - Selection of properties depends on query workload
50Data Management Solutions
Web Search
Virtual Organization
Far
Enterprise potal
Ontology
Administrative Proximity
Data Integration System
Near
Desktop Search
Scientific Respository
DBMS
Low
High
Semantic Integration
Diagram adapted, with permission, from Figure 1
in the paper From Databases to Dataspaces A New
Abstraction for Information Management, Michael
Franklin, Alon Halevy, David Maier, SIGMOD Record
December 2005.
51Query Languages
- RQL
- RDQL
- RDFQL
- RxPath
- N3
- SeRQL
- Triple
- Versa
52Facts
- Environmental Observation and Forecasting System
- 7.5M triples describing 1M files
- Integrated Pharmacological Database
- 23M triples describing 0.6M concepts
53Dataspace Components
- Catalog and Browse
- Search and Query
- Global Query
- Structured Query
- Provenance Query
- Continuous Query (Monitoring)
- Local Store and Index
- Discovery
- Source Extension
54Scaling Way Up
55Integrity Constraints and Normalization
56Growing a Query Language
- Desc(k)
- Prop(B)
- Val(B, p)
57Pharmacological Database
- Signature ltgt ptty in 85 of the cases