Quarrying Unfamiliar DataSpaces - PowerPoint PPT Presentation

1 / 57
About This Presentation
Title:

Quarrying Unfamiliar DataSpaces

Description:

... runid=2005-002, anim=, region=plume, variable=salt, depth=5, plottype=isoline] ... GetProperties(plottype=isoline) = [region, depth, year,...] 9/19/09 ... – PowerPoint PPT presentation

Number of Views:34
Avg rating:3.0/5.0
Slides: 58
Provided by: Bil9147
Category:

less

Transcript and Presenter's Notes

Title: Quarrying Unfamiliar DataSpaces


1
Quarrying Unfamiliar DataSpaces
  • Bill Howe
  • David Maier
  • Nick Rayner
  • Sponsored by the NSF ITR Program 2001-2006
  • In collaboration with Antonio Baptista, Paul
    Turner, and the entire CORIE Environmental
    Science Team at OGI
  • http//www.cs.pdx.edu/howe

2
Dataspaces
  • DataSpace (DS)
  • Autonomous, heterogeneous data sources
  • grouped by an identifiable scope
  • with respect to a set of requirements
  • DataSpace Support Platform (DSSP)
  • A collection of best effort services
  • Catalog and Browse
  • Search and Query
  • Workflow (Events, Actions, and Monitoring)
  • Integrity checks/guarantees

From Databases to Dataspaces A New
Abstraction for Information Management, Michael
Franklin, Alon Halevy, David Maier, SIGMOD Record
December 2005.
3
Databases vs. Dataspaces
  • Data Coexistence
  • Autonomous Sources
  • Search, Browse, Approximate Answer
  • Best Effort Guarantees
  • Single Schema
  • Centralized Administration
  • Structured Query
  • Strict Integrity Constraints

4
Dataspaces vs. Data Integration Systems
DB1
Integrated DB
DB2
Insular Databases
DB3
May use Local-as-view or Global-as-view
always requires semantic integration
5
Semantic Web vs. Dataspaces
  • Autonomous agents
  • crawling richly described autonomous data
    sources
  • which are related formally via ontologies
  • No ontology
  • Probably no inferencing
  • Pay as you go

6
Dataspace Timeline
Insular, application-specific databases
Autonomous agents crawling richly described data
sources integrated via an ontology
time, scope
7
Example Scientific Data Repository
Atmospheric forcings River forcings Global ocean
forcings
Sensor Data
Ocean simulation results Configuration and log
files Annotations Data Products
salinity
/anim-sal_estuary_7.gif
8
Example Pharmacology
RxNav Interface developed by the National
Library of Medicine
9
Dataspace Timeline
utility
time, scope, effort
10
Unfamiliar Dataspaces
  • No schema is available
  • No query workload is available
  • Browse is the dominant interaction
  • keys, ids, URIs not directly useful
  • properties and values alone convey meaning
  • Goal Maximize return on effort when working with
    an unfamiliar dataspace

11
Green Field Tools for Unfamiliar Dataspaces
  • Goal A working, extensible application with the
    least possible effort
  • We need at least
  • a Data Model
  • Lowest Common Denominator
  • minimal modeling decisions
  • an API
  • narrow interface
  • uniformly efficient

12
Outline
  • Dataspaces
  • Quarry Data Model
  • Quarry Storage
  • Quarry API
  • Experimental Results
  • Wrap up

13
Quarry Data Model
  • resource, property, value
  • (subject, predicate, object) if you prefer
  • no explicit distinction between literal values
    and resource values
  • no explicit types or classes
  • no variables (no inference)
  • no context/provenance (yet)

14
Example Pharmacology
Concept
Relationship
Atom
15
Example Scientific Data Repository
/anim-sal_estuary_7.gif
16
Outline
  • Dataspaces
  • Quarry Data Model
  • Quarry Storage
  • Quarry API
  • Experimental Results
  • Wrap up

17
Some Storage Models
  • Schema dependent storage (RDFS)
  • We assume schema is unavailable
  • Indexed Triple Store
  • Property Tables

18
Simple Idea
  • Signatures
  • resources expressing the same properties
    clustered together
  • Posit that Signature ltlt Resource
  • Queries evaluated over Signature Extents

19
1) Triple Store
A Query in SPARQL/RDQL
Triples
select ?v where (?r, ltsregiongt,
ltsestuarygt), (?r, ltsvariablegt, ltssaltgt),
(?r, ltsdepthgt, lts7gt), (?r, ltspathgt, ?v)
rsrc
prop
value
101
depth
7
336
variable
temp
101
path
/iso_e_s_7.gif
101
variable
salt
and in SQL
843
channel
north
SELECT p.value as path FROM Triples r, Triples
v, Triples d, Triples p WHERE r.property
region AND v.property variable AND
d.property depth AND p.property path
AND r.rsrc v.rsrc AND v.rsrc d.rsrc
AND d.rsrc p.rsrc
843
variable
salt
336
path
/trans_s_t.gif
843
path
/trans_n_s.gif
336
channel
south
101
region
estuary
One join per condition
20
1) Triple Store
select ?v where (?r, ltsregiongt,
ltsestuarygt), (?r, ltsvariablegt, ltssaltgt),
(?r, ltsdepthgt, lts7gt) (?r, ltspathgt, ?v)
SELECT MAX(CASE WHEN propertypath' THEN value
END) as path FROM TriplesGROUP BY rsrc HAVING
MAX(CASE WHEN propertyregion' THEN value
END) estuary AND MAX(CASE WHEN
propertyvariable' THEN value END) salt AND
MAX(CASE WHEN propertyregion' THEN value END)
7
but cant exploit indexes
21
2) Property Tables
depth
select ?p where (?r, ltsregiongt,
ltsestuarygt), (?r, ltsvariablegt, ltssaltgt),
(?r, ltsdepthgt, lts7gt) (?r, ltspathgt,
?p)
value
rsrc
101
7
region
variable
rsrc
value
rsrc
value
101
estuary
336
temp
select p.value from region r, variable v,
depth d, path p where r.value estuary
and v.value salt and d.value 7 and
r.rsrc v.rsrc and v.rsrc d.rsrc and
d.rsrc p.rsrc
101
salt
path
843
salt
rsrc
value
channel
101
/iso_e_s_7.gif
336
/trans_s_t.gif
rsrc
value
843
/trans_n_s.gif
843
north
336
south
22
3) Signature Tables
select ?p where (?r, ltsregiongt,
ltsestuarygt), (?r, ltsvariablegt, ltssaltgt),
(?r, ltsdepthgt, lts7gt) (?r, ltspathgt,
?p)
S1 variable, channel, path
variable
channel
path
rsrc
336
temp
south
/trans_s_t.gif
north
843
salt
/trans_n_s.gif
select path from S2 where region estuary
and variable salt and depth 7
S2 depth, region, variable, path
rsrc
depth
variable
region
path
101
7
salt
estuary
/iso_e_s_7.gif
23
Choosing a Storage Model
  • Sources of information
  • A priori knowledge (schema)
  • Query workload (learning)
  • The data

24
Computing Signatures
r0
p0
v(0,0)
r0
p0
v(0,0)
r2
p1
v(2,1)
p1
v(0,1)
r0
p2
v(0,2)
p2
v(0,2)
External Sort
r0
p1
v(0,1)
r1
p1
v(1,1)
r1
p3
v(1,3)
p3
v(1,3)
r1
p1
v(1,1)
r2
p1
v(1,1)
r2
p3
v(2,3)
p3
v(1,3)
Nest
r0
p0, p1, p2
v(0,0), v(0,1), v(0,2)
hash(S0)
r1
p1, p3
v(1,1), v(1,3)
hash(S1)
r2
p1, p3
v(1,1), v(1,3)
hash(S2)
25
Computing Signatures
r0
p0, p1, p2
v(0,0), v(0,1), v(0,2)
hash(S0)
r1
p1, p3
v(1,1), v(1,3)
hash(S1)
r2
v(1,1), v(1,3)
signatures
hash(S0)
rsrc
p0
p1
p2
signature
sighash
r0
p0, p1, p2
hash(S0)
v(0,0)
v(0,1)
v(0,2)
p1, p3
hash(S1)
hash(S1)
rsrc
p1
p3
r1
v(1,1)
v(1,3)
r2
v(1,1)
v(1,3)
26
Outline
  • Dataspaces
  • Quarry Data Model
  • Quarry Storage
  • Quarry API
  • Experimental Results
  • Wrap Up

27
Quarry API Describe
  • Describe(r)
  • Property, Value pairs describing resource r

Describe(/2005-002//anim-sal_plume_5.gif)
year2005, day002, runid2005-002, anim,
regionplume, variablesalt, depth5,
plottypeisoline Describe(/2005-002//anim-sa
l_channel_transects.gif) year2005,
day002, runid2005-002, anim, channelplume,
variablesalt, plottypetransect
28
Quarry API Values
  • Values(B, p)
  • Unique values of property p associated with any
    resource that satisfies B

Values(varsalt, day) 1,2,3,4,5,6,7
29
Quarry API Properties
  • Properties(B)
  • The set of properties that describe any resource
    satisfying B

GetProperties(variablesalt) plottype, year,
region, depth, channel, GetProperties(plottype
isoline) region, depth, year,
30
Quarry API
  • Applications use sequences of Prop and Val calls
    to explore the Dataspace

31
Quarry API Canonical Application
all unique properties
p
all unique values of parent property
v
all properties of resources satisfying pv
Every path from a root represents a conjunctive
query
32
Expressiveness
  • Mostly subsumed by RDF Query Languages
  • Were limited to queries of the form
  • shooting for the minimal possible useful API

?s LANGUAGECODE en . ?s DESCRIPTIONTYPE 2 . ?s
UMLSAUI A3711025 . ?s string Sodium_lactate_0.16_
molar_infusion .
33
Outline
  • Dataspaces
  • Quarry Data Model
  • Quarry Storage
  • Quarry API
  • Experimental Results
  • Wrap Up

34
Experimental Results
  • Yet Another RDF Store (YARS)
  • Several B-Tree indexes
  • rpv ? _, pv ? r, vr ? p, etc.
  • authors report good performance against Redland
    and Sesame
  • 3M triples, single term queries
  • We investigate simple multi-term queries

?s ltp0gt lto0gt ?s ltp1gt lto1gt ?s ltpngt ltongt
35
Experimental Results Queries
3.6M triples 606k resources 149 signatures
36
Frequent YARS Access Plan
?s LANGUAGECODE en . ?s DESCRIPTIONTYPE 2 . ?s
UMLSAUI A3711025 . ?s string Sodium_lactate_0.16_
molar_infusion .
spo
ltsgt string Sodium
spo
ltsgt LANGUAGECODE en
po ? s
spo
ltsgt DESCRIPTIONTYPE 2
UMLSAUIA3711025 ? ltsgt
37
YARS Plan Speed
time (s)
cardinality of first join
38
Related Work
  • RDF Redland, Sesame, Jena, YARS, Forth, KAON
  • Primarily Indexed Triple Stores rich query
    support
  • Path Indexes Lorel, DataGuides
  • Data Mining for Structure
  • Ding, Wilkinson, Sayers, Kuno _at_ HP Labs
    Application-specific Schema Design for Storing
    Large RDF Datasets

39
Conclusions
  • Dataspaces
  • Smoothing the ROI curve for data management
  • Quarry
  • very simple Data Model
  • very simple API
  • no need to sacrifice efficiency, utility

40
Questions?
41
Quarry API
  • /2004/2004-001//anim-tem_estuary_bottom.gif
  • aggregate bottom animation isotem day
    001 directory images plottype
    isotem region estuary runid
    2004-001 year 2004
  • /2004/2004-001//amp_plume_2d.gif day 001
  • directory images plottype 2d
  • region plume
  • runid 2004-001 year 2004

42
Quarry Query Processing
  • Props(B)
  • B (regionestuary and day136 and
    variablesalt)
  • let cover region, day, variable
  • Ans
  • for Sig in Signatures
  • if cover in Sig
  • if exists tup (tup in Extent(Sig) and B(tup))
  • Ans Ans U Sig

select rho from Extent(Sig1) where B limit 1
43
Quarry Query Processing
  • Val(B, rho)
  • B (regionestuary and day136 and
    variablesalt)
  • let cover region, day, variable
  • Ans
  • for Sig in Signatures
  • if cover in Sig
  • for tuple in Extent(Sig)
  • if B(tuple)
  • insert tuplerho in Ans

select rho from Extent(Sig1) where
B union select rho from Extent(Sig1) where B
44
Scaling Up
  • Queries covered by many signatures can be
    inefficient

SELECT orig_code FROM sig1 WHERE va_class_name
DE820 UNION SELECT orig_code FROM sig2 WHERE
va_class_name DE820 UNION SELECT orig_code
FROM sig3 WHERE va_class_name
DE820 UNION SELECT orig_code FROM sig4 WHERE
va_class_name DE820
45
Scaling Up
  • Union(S1, S2)

S1(a,b,c)
S12(a,b,c,d)
S2(a,b,d)
pad with nulls
46
Scaling Up
  • Extract
  • Find commonly accessed property sets and
    materialize them separately

S1(a,b,c)
S1(a,b,c)
S2(a,b,d)
S2(a,b,d)
Sab(a,b)
47
Updates
S1 variable, channel, path
variable
channel
value
rsrc
336
temp
south
/trans_s_t.gif
north
843
salt
/trans_n_s.gif
S2 depth, region, variable, path
rsrc
depth
variable
path
region
101
7
salt
estuary
/iso_e_s_7.gif
48
Data Models
expressive power in terms of structure,
operations, and constraints
49
Some Storage Models
  • Schema dependent storage (RDFS)
  • We assume schema is unavailable
  • Indexed Triple Store
  • Logically, one large table of (r, p, v) triples
  • Physically, multiple indices for various access
    patterns
  • Property Tables
  • Some properties get their own (r, v) extents
    (basically isomorphic to a pso index)
  • Selection of properties depends on query workload

50
Data Management Solutions
Web Search
Virtual Organization
Far
Enterprise potal
Ontology
Administrative Proximity
Data Integration System
Near
Desktop Search
Scientific Respository
DBMS
Low
High
Semantic Integration
Diagram adapted, with permission, from Figure 1
in the paper From Databases to Dataspaces A New
Abstraction for Information Management, Michael
Franklin, Alon Halevy, David Maier, SIGMOD Record
December 2005.
51
Query Languages
  • RQL
  • RDQL
  • RDFQL
  • RxPath
  • N3
  • SeRQL
  • Triple
  • Versa

52
Facts
  • Environmental Observation and Forecasting System
  • 7.5M triples describing 1M files
  • Integrated Pharmacological Database
  • 23M triples describing 0.6M concepts

53
Dataspace Components
  • Catalog and Browse
  • Search and Query
  • Global Query
  • Structured Query
  • Provenance Query
  • Continuous Query (Monitoring)
  • Local Store and Index
  • Discovery
  • Source Extension

54
Scaling Way Up
55
Integrity Constraints and Normalization
56
Growing a Query Language
  • Desc(k)
  • Prop(B)
  • Val(B, p)

57
Pharmacological Database
  • Signature ltgt ptty in 85 of the cases
Write a Comment
User Comments (0)
About PowerShow.com