Query Relaxation Using Malleable Schema - PowerPoint PPT Presentation

1 / 30
About This Presentation
Title:

Query Relaxation Using Malleable Schema

Description:

Query Relaxation Using Malleable Schema ... How to query using malleable schema ? ... Duplicate Detection in Malleable Schema ... – PowerPoint PPT presentation

Number of Views:49
Avg rating:3.0/5.0
Slides: 31
Provided by: Nej7
Category:

less

Transcript and Presenter's Notes

Title: Query Relaxation Using Malleable Schema


1
Query Relaxation Using Malleable Schema
  • Xuan Zhou, Julien Gaugaz, Wolf-Tilo Balke,
    Wolfgang NejdlL3S Research CenterLeibniz
    University HanoverHannover, Germany

2
Data in Real-world Information Systems
  • Examples
  • Personal information system (desktop)
  • Enterprise information system
  • World Wide Web
  • Characteristics
  • A lot of text (unstructured information)
  • A lot of structures, e.g. title, author,
    creation-date,
  • Heterogeneity in structure
  • Different holders (applications) use different
    schemas
  • In nature, the structure of a domain is too
    complex for us to give it a clear and certain
    definitione.g. types of service departments
    long-term, short-term, vacation

3
Two Approaches to Search
  • Information Retrieval (IR)
  • Keywords as query easy to use
  • No structures less expressive
  • Can be applied to heterogeneous data flexible
  • Database Systems (DB)
  • Structured queries more expressive powerful
  • Weak in dealing with heterogeneity inflexible

4
Dealing with Heterogeneity in DB
  • Traditional Data Integration
  • Transform data into a clear and uniform structure
    before we use it
  • Cannot avoid human intervention very laborious
    and not scalable
  • New idea Malleable Schema
  • Allow overlapping and vague elements to be
    defined in a single schema
  • Automatically capture and quantify the
    correlations between schema elements, though not
    perfectly
  • Properly relax user query using schema
    correlations to obtain best-effort results
  • A trade-off between accuracy and automatism

5
Example Data in a Malleable Schema
xml search
Jack
first name
Person
Xml is the standardfor data exchange.
sur name
title
Pan
body
author
Doc
name
Person
author
John Gary
Isa book
False
writer
sender
Isa paper
subject
email
True
My paper
Doc
attachment
contents
date
body
Dear Sergey, Pleasefind attached the file.
Desktop SearchWe have many data.
25.03.2006
6
In This Talk
  • How to query using malleable schema ?
  • How to discover the schema correlations to enable
    query relaxation ?
  • How to perform query relaxation efficiently ?
  • Some experiment results.

7
The Need for Query Relaxation
  • For example, user issue queryQ1 Select Person
    Where first_name Contains Philip
  • To obtain the complete results, we should relax
    the query toQ2 Select Person Where first_name
    Contains Philip Or name Contains
    Philip
  • Principle 1 a query has to be relaxed to related
    schema elements.

8
The Need for Ranking Query Results
  • James M. Philip
    Philip Bernstein
  • Whose results should first be returned? Q2.1
    Select Person Where first_name Contains Philip
    Q2.2 Select Person Where name Contains Philip
  • To return more relevant results first, we perform
    Q2.1 prior to Q2.2.
  • Principle 2 relaxed queries should be performed
    in a sequence based on the likelihoods of
    returning right answers.

9
A Model for Query Relaxation
  • Given a query Q Select E Where A1 Contains
    a1 And And Ak Contains ak namely,
  • If Q is a relaxed query of Q
  • Then, the probability that Q return correct
    results to Q is
  • How to assess the correlation between schema
    elements ?namely

10
Roadmap
  • How to query using malleable schema ? Answer
    query relaxation.
  • How to discover the schema correlations to enable
    query relaxation ?
  • How to perform query relaxation efficiently ?
  • Some experiment results.

11
Discovering Schema Correlations
  • Task for each pair of attributes (relationships)
    A and A, assess the conditional probability
  • For example
    and
  • A straight forward Solution1. find a number of
    entities where A and A co-occur2.
    statistically estimate the expected overlap
    between A and A.
  • Problem correlated attributes (relationships)
    seldom co-occur.
  • For example, no entity uses first_name and name
    simultaneously.
  • Remedy find duplicates which use different
    attributes.

12
Duplicate Detection in Malleable Schema
  • Problem existing duplicate detection algorithms
    assume a uniform schema, so they do not work here
  • Observation1. more duplicates better schema
    correlation discovery2. more accurate schema
    correlations better duplicate detection
  • SolutionLet schema correlation discovery and
    duplicate detection reinforce each other to
    achieve improved results

13
An Example
duplicates E1, E2, E3, E4, E5,
E6attribute matches title, subject, author,
writer, pub-date, rec-date
duplicates E1, E2, E3, E4, E5,
E6attribute matches title, subject, author,
writer, pub-date, rec-date
14
The DSCD Algorithm
  • DSCD stands for Duplication and Schema
    Correlation Discovery
  • Suppose there are n possible duplicates and m
    possible correlated attributes
  • Belief vector of duplicate candidates
  • Belief vector of attribute match candidates
  • Let S be the an evidence matrix

15
The DSCD Algorithm contd.
  • When attribute matches are known
  • When duplicates are known
  • Repeat the process until C and D converge
  • Very similar to the HITS algorithm (pagerank). It
    can be proved that will converge to the
    principal Eign vector of

16
Revisit the Example
D E1, E2, E3, E4, E5, E6 1.0
0.89 0.28 C title, subject
1.0 author, writer
1.0 pub-date, rec-date 0.56
S E1, E2,
E3, E4, E5, E6 title, subject
1 0.7 0
author, writer 1 0.7
0pub-date, rec-date 0
0.7 1
17
Steps to Discover and Quantify Schema Correlations
  • Step 1 find all pairs of attributes that are
    possibly correlated. (using schema matching
    techniques)
  • Step 2 find a number of duplicate candidates.
    (find similar entities based on TFxIDF)
  • Step 3 conduct the DSCD algorithm on the
    pre-selected duplicate candidates and attribute
    match candidates.
  • Step 4 use the resulting belief vector of
    attribute matches C to further identify more
    duplicates.
  • Step 5 use the duplicates to statistically
    quantify the correlations. (
    )

18
Roadmap
  • How to query using malleable schema ? Answer
    query relaxation.
  • How to discover the schema correlations to enable
    query relaxation ?Answer DSCD algorithm.
  • How to perform query relaxation efficiently ?
  • Some experiment results.

19
Query Processing
  • Given a query Q
  • Find all relaxed queries Q1, Q2, , Qn, using
    schema correlations
  • Permute the relaxed queries by
  • Execute the queries progressively
  • Two optimizations can be made .

20
Query Processing Optimization I
E
A
R
E
a
B
b
21
Query Processing Optimization I
E
A
R
E
a
B
A,B,R
b
A1,B,R
A,B1,R
A,B,R1
Idea A algorithm that takes advantage of this
partial order can save many efforts in evaluating
relaxed queries
A2,B,R
A1,B1,R
A1,B,R1
A,B1,R1
A2,B1,R
A2,B,R1
A1,B1,R1
A2,B1,R1
22
Query Processing Optimization II
E
A
R
E
a
B
b
The joining results
in Q1 can be reused by Q2
Idea cache the temporary results of relaxed
queries
23
Roadmap
  • How to query using malleable schema ? Answer
    query relaxation.
  • How to discover the schema correlations to enable
    query relaxation ?Answer DSCD algorithm.
  • How to perform query relaxation efficiently
    ?Answer A algorithm, caching temporary query
    results
  • Some experiment results.

24
Experiment setup
  • Dataset
  • IMDB movies TV series 850,000 items, 32
    attributes
  • Amazon DVD VHS 115,000 items, 28 attributes
  • Setup
  • DBMS MySQL 4.1.11
  • Code Java 5
  • Platform CPU 2.7GHz, RAM 1GB

25
The First Run
  • Correlations between IMDB and Amazon attributes

26
Performance of DSCD Algorithm
  • We divide the Amazon data into two sets
    training set and test set
  • We perform the DSCD Algorithm on the training set
  • We use the resulting C vector to detect
    duplicates in Amazon test set and IMDB

27
Query Relaxation
Select from Amazon where keywords contains
Titanic Disney
Select from Amazon where Title contains
Titanic And Director contains Cameron
28
Comparison Against Keywords Search
  • Create queries using Title, Directors and
    Actors in Amazon and perform the queries on IMDB

29
Performance
  • Set thresholds on attribute correlations, 0.2,
    0.1 and 0.05 respectively

30
Thank you!
Write a Comment
User Comments (0)
About PowerShow.com