Emergent Semantics: Towards Self-Organizing Scientific Metadata - PowerPoint PPT Presentation

About This Presentation
Title:

Emergent Semantics: Towards Self-Organizing Scientific Metadata

Description:

Emergent Semantics: Towards Self-Organizing Scientific Metadata Bill Howe, David Maier Oregon Health and Science University – PowerPoint PPT presentation

Number of Views:85
Avg rating:3.0/5.0
Slides: 18
Provided by: washi103
Category:

less

Transcript and Presenter's Notes

Title: Emergent Semantics: Towards Self-Organizing Scientific Metadata


1
Emergent Semantics Towards Self-Organizing
Scientific Metadata
  • Bill Howe, David Maier
  • Oregon Health and Science University

2
  • The file anim-sal_estuary_7.gif is a data
    product derived from the output of the ELCIRC
    simulation program run for the period January
    8-15 2002. The image shows salinity (practical
    salinity units) in the estuary region of the
    domain. Its actually an animation, where each
    frame is a horizontal slice 7 meters below the
    mean sea level. There are 96 frames, each
    representing 15 minutes.

program ELCIRC simStart 1/8/02 simEnd
1/15/02 region estuary variable
salinity timesteps 96 plottype animation
3
Environmental Observation and Forecasting System
  • Daily forecasts and 1000s of ad hoc hindcasts
  • One simulation involves 20k files
  • inputs, parameters, outputs, derived data
    products
  • This scale mandates
  • query access rather than simple filesystem
    browsing
  • Automation everywhere

4
Tasks
  1. Collect metadata.
  2. Organize collected metadata.
  3. Publish organized metadata for querying.

5
Challenges
  • Metadata is scattered
  • in file paths
  • within file headers
  • in nearby files
  • Metadata requirements change frequently
  • new simulation codes
  • new data product types
  • new users, internal and external

Depth 7
Variable Salinity
/anim-sal_estuary_7.gif
Type Animation
Region Estuary
6
Obvious Solution
  • Data Managers work with Domain Experts
  • design a relational schema, load data, test,
    repeat

file
  • But
  • Large up-front cost to DB design
  • Slow return on investment
  • Use cases unknown
  • Significant change is anticipated
  • DB languages/APIs not necessarily within
    scientists skill set

data product
region
7
Alternative Solution Steps 1-3
  1. Harvest metadata via simple collection scripts
    written by the domain experts
  2. Use RDF as a schema-independent metadata
    representation
  3. Use RDBMS technology for storage and management

1. Collection scripts
filesystem
3. db
2. rdf
8
A Narrower Interface
SQL statements Database APIs Load Strategies Data
formats/models
rich schema
filesystem
Collection scripts
generic schema
filesystem
RDF triples
9
Generic RDF Schema
subject property object
file//forecasts/2003-184/images/anim-sal_estuary_7.gif propertyregion estuary
file//forecasts/2003-184/images/anim-sal_estuary_7.gif propertyvariable salt
file//forecasts/2003-184/images/anim-sal_estuary_7.gif propertyplottype animation
file//forecasts/2003-184/images/anim-sal_estuary_7.gif propertysource file//forecasts/2003-184/run/1_salt.63
10
Is Generic RDF Good Enough?
  • Find files with region, plottype, and variable
    descriptors

SELECT r.subject as file, r.object as region,
p.object as plottype, v.object as
variable FROM statements r, statements p,
statements v WHERE r.subject p.subject AND
p.subject v.subject AND r.property
propertyregion AND p.property
propertyplottype AND v.property
propertyvariable
3 self-joins!
11
Decomposed Data
  • So we can query the RDF directly, but
  • no grouping structures to aid query formulation
    and processing.
  • Automatically infer groupings from the RDF data,
    observing that related files often share
    signatures.
  • Let users impose groupings using a web interface
    (like views)

db
... ltisofar.gif, type, isolinegt, ltisofar.gif,
region, fargt, ltanimsal.gif, timesteps,
10gt, ltanimsal.gif, var, saltgt, ...
filesystem
plot
animation
12
Alternative Solution Steps 4-6
  1. Partition descriptors into equivalence classes
    based on file signatures
  2. Expose signatures via the web to facilitate
    browsing and querying
  3. Recompute signature extents as new metadata is
    integrated

4. partition data
5. publish to the web
db
website
6. query and browse via profiles
13
  • The set of properties defined for a particular
    file

14
Signatures
  • A files signature is just the set of properties
    used to describe it.
  • If signatures were fixed, we might derive a
    relational schema from them. Instead, we need to
    respond to changes

4. partition data
db
find signatures
compute signature extents
15
Example Consolidate Files with Similar Signatures
  • Modify schema (DM)
  • Transfer tuples from A to B (DM)
  • Modify collection programs
  • Modify extraction routines (DE)
  • Modify Internal organization (DE)
  • Modify SQL statements (DM)

16
Alternative
  • Change two lines in a collection script (DE)
  • Assert(fileA, animation, )
  • Assert(fileA, plottype, animation)
  • Assert(fileB, plottype, animation)
  • Reload data (Automatic)
  • Recompute Signatures (Automatic)
  • Republish data (Automatic)

17
Benefits
  • Narrow interface between data creators and data
    managers
  • Metadata exploitable prior to finalizing a
    thorough schema
  • Derived schema can adapt to changing requirements
    automatically
  • Profiles constitute emergent semantics meaning
    is assigned after data is collected.
Write a Comment
User Comments (0)
About PowerShow.com