iTrails: Pay-as-you-go Information Integration in Dataspaces - PowerPoint PPT Presentation

About This Presentation
Title:

iTrails: Pay-as-you-go Information Integration in Dataspaces

Description:

... you should also query for the TrainCompany s website with origin at ETH Uni and destination ... DB Server Temperatures city celsius date Bern 24-Sep 24 ... – PowerPoint PPT presentation

Number of Views:105
Avg rating:3.0/5.0
Slides: 43
Provided by: Artur87
Category:

less

Transcript and Presenter's Notes

Title: iTrails: Pay-as-you-go Information Integration in Dataspaces


1
iTrails Pay-as-you-go Information Integration in
Dataspaces
  • Marcos Vaz Salles Jens Dittrich Shant
    Karakashian Olivier Girard Lukas Blunschi
  • ETH Zurich
  • VLDB 2007

2
Outline
  • Motivation
  • iTrails
  • Experiments
  • Conclusions and Future Work

3
Problem Querying Several Sources
What is the impact of global warming in Zurich?
Query
?
?
?
?
Systems
Data Sources
Email Server
Web Server
DB Server
Laptop
4
Solution 1 Use a Search Engine
Job!
Query
global warming zurich
Graph IR Search Engine
System
Drawback Query semantics are not precise!
TopX VLDB05, FleXPath SIGMOD04, XSearch
VLDB03, XRank SIGMOD03
text, links
text, links
text, links
text, links
Data Sources
Email Server
Web Server
DB Server
Laptop
5
Solution 2 Use an Information Integration System
//Temperatures/city zurich
Query
Information Integration System
System
GAV (e.g. ICDE95), LAV (e.g. VLDB96), GLAV
AAAI99, P2P (e.g. SIGMOD04)
missing schema mapping
missing schema mapping
schema mapping
schema mapping
Data Sources
Email Server
Web Server
DB Server
Laptop
6
Research ChallengeIs There an Integration
Solution in-between These Two Extremes?
global warming zurich
?
Dataspace System
Pay-as-you-go Information Integration
text, links
text, links
text, links
text, links
Dataspace Vision by Franklin, Halevy, and Maier
SIGMOD Record 05
Email Server
Web Server
DB Server
Laptop
7
Outline
  • Motivation
  • iTrails
  • Experiments
  • Conclusions and Future Work

8
iTrails Core Idea Add Integration Hints
Incrementally
  • Step 1 Provide a search service over all the
    data
  • Use a general graph data model (see VLDB 2006)
  • Works for unstructured documents, XML, and
    relations
  • Step 2 Add integration semantics via hints
    (trails) on top of the graph
  • Works across data sources, not only between
    sources
  • Step 3 If more semantics needed, go back to step
    2
  • Impact
  • Smooth transition between search and data
    integration
  • Semantics added incrementally improve precision /
    recall

9
iTrails Defining Trails
  • Basic Form of a Trail
  • QL .CL ? QR .CR
  • Intuition
  • When I query for QL .CL, you should also query
    for QR .CR

Queries NEXI-like keyword and path expressions
Attribute projections
10
Trail Examples Global Warming Zurich
DB Server
  • Trail for Implicit Meaning When I query for
    global warming, you should also query for
    Temperature data above 10 degrees
  • Trail for an Entity When I query for zurich,
    you should also query for references of zurich as
    a region

global warming zurich
Temperatures
city
celsius
region
date
global warming ? //Temperatures/celsius gt
10
20
BE
Bern
24-Sep
15
ZH
24-Sep
Uster
14
ZH
Zurich
25-Sep
Zurich
9
ZH
26-Sep
zurich ? //region ZH
11
Trail Example Deep Web Bookmarks
Web Server
train home
  • Trail for a Bookmark When I query for train
    home, you should also query for the
    TrainCompanys website with origin at ETH Uni and
    destination at Seilbahn Rigiblick

train home ? //trainCompany.com//originETH
Uni and dest Seilbahn Rigiblick
12
Trail Examples Thesauri, Dictionaries,
Language-agnostic Search
Email Server
Laptop
  • Trail for Thesauri When I query for car, you
    should also query for auto
  • Trails for Dictionary When I query for car, you
    should also query for carro and vice-versa

auto
car
car ? auto
car
carro
car ? carro carro ? car
13
Trail Examples Schema Equivalences
DB Server
  • Trail for schema match on names When I query
    for Employee.empName, you should also query for
    Person.name
  • Trail for schema match on salaries When I query
    for Employee.salary, you should also query for
    Person.income

Employee
empName
empId
salary
//Employee//.tuple.empName ?
//Person//.tuple.name
Person
name
age
SSN
income
//Employee//.tuple.salary ?
//Person//.tuple.income
14
Outline
  • Core Idea
  • Trail Examples
  • How are Trails Created?
  • Uncertainty and Trails
  • Rewriting Queries with Trails
  • Recursive Matches
  • Motivation
  • iTrails
  • Experiments
  • Conclusion and Future Work

15
How are Trails Created?
  • Given by the user
  • Explicitly
  • Via Relevance Feedback
  • (Semi-)Automatically
  • Information extraction techniques
  • Automatic schema matching
  • Ontologies and thesauri (e.g., wordnet)
  • User communities (e.g., trails on gene data,
    bookmarks)

16
Uncertainty and Trails
  • Probabilistic Trails
  • model uncertain trails
  • probabilities used to rank trails
  • QL .CL ? QR .CR, 0 p 1
  • Example car ? auto

p
p 0.8
17
Certainty and Trails
  • Scored Trails
  • give higher value to certain trails
  • scoring factors used to boost scores of query
    results obtained by the trail
  • QL .CL ? QR .CR, sf gt 1
  • Examples
  • T1 weather ? //Temperatures/
  • T2 yesterday ? //date today() 1

sf
p 0.9, sf 2
p 1, sf 3
18
Rewriting Queries with Trails
U
U
(3) Merging
Query
U
weather
weather
yesterday
//date today() 1
yesterday
T2 matches
T2 yesterday ? //date today() 1
Trail
(1) Matching
(2) Transformation
19
Replacing Trails
  • Trails that use replace instead of union semantics

U
U
(3) Merging
Query
weather
//date today() 1
yesterday
weather
T2 matches
T2 yesterday //date today() 1
Trail
(2) Transformation
(1) Matching
20
Problem Recursive Matches (1/2)
U
New query still matches T2, so T2 could be
applied again
U
weather
//date today() 1
yesterday
T2 matches
U
U
weather
//date today() 1
T2 yesterday ? //date today() 1
U
//date today() 1
U
...
//date today() 1
...
U
//date today() 1
T2 matches
yesterday
Infinite recursion
21
Problem Recursive Matches (2/2)
U
T3 matches
Trails may be mutually recursive
U
weather
//date today() 1
yesterday
U
U
weather
T10 matches
U
T3 //.tuple.date ? //.tuple.modified
yesterday
//modified today() 1
//date today() 1
T10 //.tuple.modified ? //.tuple.date
U
We again match T3 and enter an infinite loop
U
weather
U
yesterday
U
//date today() 1
//date today() 1
//modified today() 1
22
Solution Multiple Match Coloring Algorithm
U
T3, T4 match
U
First Level
//date today() 1
U
yesterday
U
weather
//Temperatures/
Second Level
yesterday
weather
U
T1 matches
T2 matches
U
U
U
U
yesterday
weather
//date today() 1
//Temperatures/
//received today() 1
//modified today() 1
T1 weather ? //Temperatures/ T2 yesterday ?
//date today() 1 T3 //.tuple.date ?
//.tuple.modified T4 //.tuple.date ?
//.tuple.received
23
Multiple Match Coloring Algorithm Analysis
  • Problem MMCA is exponential in number of levels
  • Solution Trail Pruning
  • Prune by number of levels
  • Prune by top-K trails matched in each level
  • Prune by both top-K trails and number of levels

24
Outline
  • Motivation
  • iTrails
  • Experiments
  • Conclusion and Future Work

25
iTrails Evaluation in iMeMex
  • iMeMex Dataspace System Open-source prototype
    available at http//www.imemex.org
  • Main Questions in Evaluation
  • Quality Top-K Precision and Recall
  • Performance Use of Materialization
  • Scalability Query-rewrite Time vs. Number of
    Trails

26
iTrails Evaluation in iMeMex
  • Scenario 1 Few High-quality Trails
  • Closer to information integration use cases
  • Obtained real datasets and indexed them
  • 18 hand-crafted trails
  • 14 hand-crafted queries
  • Scenario 2 Many Low-quality Trails
  • Closer to search use cases
  • Generated up to 10,000 trails

27
iTrails Evaluation in iMeMex Scenario 1
  • Configured iMeMex to act in three modes
  • Baseline Graph / IR search engine
  • iTrails Rewrite search queries with trails
  • Perfect Query Semantics-aware query
  • Data shipped to central index

sizes in MB
Email Server
Laptop
Web Server
DB Server
28
Quality Top-K Precision and Recall
perfect query
K 20
Perfect Query always has precision and recall
equal to 1
Scenario 1 few high-quality trails (18 trails)
Search Query is partially semantics-aware
Search Engine misses relevant results
Queries
Q13 to raimund.grube_at_enron.com
Q3 pdf yesterday
29
Performance Use of Materialization
Scenario 1 few high-quality trails (18 trails)
Trail merging adds overhead to query execution
Trail Materialization provides interactive
times for all queries
response times in sec.
30
Scalability Query-rewrite Time vs. Number of
Trails
Scenario 2 many low-quality trails
Query-rewrite time can be controlled with
pruning
31
Conclusion Pay-as-you-go Information Integration
global warming zurich
Dataspace System
  • Step 1 Provide a search service over all the
    data
  • Step 2 Add integration semantics via trails

text, links
Data Sources
  • Step 3 If more semantics needed, go back to step
    2
  • Our Contributions
  • iTrails generic method to model semantic
    relationships (e.g. implicit meaning, bookmarks,
    dictionaries, thesauri, attribute matches, ...)
  • We propose a framework and algorithms for
    Pay-as-you-go Information Integration
  • Smooth transition between search and data
    integration

32
Future Work
  • Trail Creation
  • Use collections (ontologies, thesauri, wikipedia)
  • Work on automatic mining of trails from the
    dataspace
  • Other types of trails
  • Associations
  • Lineage

33
Questions?Thanks in advance for your feedback!
? marcos.vazsalles_at_inf.ethz.ch
http//www.imemex.org
34
Backup Slides
35
Problem Global Warming in Zurich
  • Query What is the impact of global warming in
    Zurich?
  • Search for
  • global warming zurich
  • Meaning of keyword query
  • global warming should lead to query on
    Temperatures
  • zurich should lead to a query for a city

36
Problem PDF Yesterday
  • Query Retrieve all PDF documents added/modified
    yesterday
  • Search for
  • pdf yesterday
  • Meaning of keywords pdf and yesterday
  • Different sources, different schemas
  • Laptop modified
  • Email received
  • DBMS changed

37
Related Work Search vs. Data Integration vs.
Dataspaces
Features Integration Solution Integration Solution Integration Solution Integration Solution
Features Search Dataspaces Data Integration
Features Integration Effort Low Pay-as-you-go High
Features Query Semantics Precision / Recall Precision / Recall Precise
Features Need for Schema Schema-never Schema-later Schema-first
38
Personal Dataspaces Literature
  • Dittrich, Salles, Kossmann, Blunschi. iMeMex
    Escapes from the Personal Information Jungle
    (Demo Paper). VLDB, September 2005.
  • Dittrich, Salles. iDM A Unified and Versatile
    Data Model for Personal Dataspace Management.
    VLDB, September 2006
  • Dittrich. iMeMex A Platform for Personal
    Dataspace Management. SIGIR PIM, August 2006.
  • Blunschi, Dittrich, Girard, Karakashian, Salles.
    A Dataspace Odyssey The iMeMex Personal
    Dataspace Management System (Demo Paper). CIDR,
    January 2007.
  • Dittrich, Blunschi, Färber, Girard, Karakashian,
    Salles. From Personal Desktops to Personal
    Dataspaces A Report on Building the iMeMex
    Personal Dataspace Management System. BTW 2007,
    March 2007
  • Salles, Dittrich, Karakashian, Girard, Blunschi.
    iTrails Pay-as-you-go Information Integration in
    Dataspaces. VLDB, September 2007

39
iDM iMeMex Data Model
  • Our approach get the data model closer to
    personal information not the other way around
  • Supports
  • Unstructured, semi-structured and structured
    data, e.g., filesfolders, XML, relations
  • Clearly separation of logical and physical
    representation of data
  • Arbitrary directed graph structures, e.g.,
    section references in LaTeX documents, links in
    filesystems, etc
  • Lazily computed data, e.g., ActiveXML (Abiteboul
    et. al.)
  • Infinite data, e.g., media and data streams

See VLDB 2006
40
Data Model Options
Support for Personal Data Data Models Data Models Data Models Data Models Data Models
Support for Personal Data Bag of Words Relational XML iDM
Support for Personal Data Non-schematic data
Support for Personal Data Serialization independent
Support for Personal Data Support for Graph data
Support for Personal Data Support for Lazy Computation
Support for Personal Data Support for Infinite data
Extension XLink/ XPointer
Specific schema
View mechanism
Extension ActiveXML
Extension Document streams
Extension Relational streams
Extension XML streams
41
Data Models for Personal Information
Abstraction Level
lower
higher
42
Architectural Perspectiveof iMeMex
Complex operators (query algebra)
IndexesReplicas access (warehousing)
  • Data source access (mediation)
Write a Comment
User Comments (0)
About PowerShow.com