Query Relaxation Using Malleable Schema - PowerPoint PPT Presentation

1 / 30

About This Presentation

Title:

Query Relaxation Using Malleable Schema

Description:

Query Relaxation Using Malleable Schema ... How to query using malleable schema ? ... Duplicate Detection in Malleable Schema ... – PowerPoint PPT presentation

Number of Views:49

Avg rating:3.0/5.0

Slides: 31

Provided by: Nej7

Category:

more less

Transcript and Presenter's Notes

Title: Query Relaxation Using Malleable Schema

1
Query Relaxation Using Malleable Schema

Xuan Zhou, Julien Gaugaz, Wolf-Tilo Balke,
Wolfgang NejdlL3S Research CenterLeibniz
University HanoverHannover, Germany

2
Data in Real-world Information Systems

Examples
Personal information system (desktop)
Enterprise information system
World Wide Web
Characteristics
A lot of text (unstructured information)
A lot of structures, e.g. title, author,
creation-date,
Heterogeneity in structure
Different holders (applications) use different
schemas
In nature, the structure of a domain is too
complex for us to give it a clear and certain
definitione.g. types of service departments
long-term, short-term, vacation

3
Two Approaches to Search

Information Retrieval (IR)
Keywords as query easy to use
No structures less expressive
Can be applied to heterogeneous data flexible
Database Systems (DB)
Structured queries more expressive powerful
Weak in dealing with heterogeneity inflexible

4
Dealing with Heterogeneity in DB

Traditional Data Integration
Transform data into a clear and uniform structure
before we use it
Cannot avoid human intervention very laborious
and not scalable
New idea Malleable Schema
Allow overlapping and vague elements to be
defined in a single schema
Automatically capture and quantify the
correlations between schema elements, though not
perfectly
Properly relax user query using schema
correlations to obtain best-effort results
A trade-off between accuracy and automatism

5
Example Data in a Malleable Schema
xml search
Jack
first name
Person
Xml is the standardfor data exchange.
sur name
title
Pan
body
author
Doc
name
Person
author
John Gary
Isa book
False
writer
sender
Isa paper
subject
email
True
My paper
Doc
attachment
contents
date
body
Dear Sergey, Pleasefind attached the file.
Desktop SearchWe have many data.
25.03.2006
6
In This Talk

How to query using malleable schema ?
How to discover the schema correlations to enable
query relaxation ?
How to perform query relaxation efficiently ?
Some experiment results.

7
The Need for Query Relaxation

For example, user issue queryQ1 Select Person
Where first_name Contains Philip
To obtain the complete results, we should relax
the query toQ2 Select Person Where first_name
Contains Philip Or name Contains
Philip
Principle 1 a query has to be relaxed to related
schema elements.

8
The Need for Ranking Query Results

James M. Philip
Philip Bernstein
Whose results should first be returned? Q2.1
Select Person Where first_name Contains Philip
Q2.2 Select Person Where name Contains Philip
To return more relevant results first, we perform
Q2.1 prior to Q2.2.
Principle 2 relaxed queries should be performed
in a sequence based on the likelihoods of
returning right answers.

9
A Model for Query Relaxation

Given a query Q Select E Where A1 Contains
a1 And And Ak Contains ak namely,
If Q is a relaxed query of Q
Then, the probability that Q return correct
results to Q is
How to assess the correlation between schema
elements ?namely

10
Roadmap

How to query using malleable schema ? Answer
query relaxation.
How to discover the schema correlations to enable
query relaxation ?
How to perform query relaxation efficiently ?
Some experiment results.

11
Discovering Schema Correlations

Task for each pair of attributes (relationships)
A and A, assess the conditional probability
For example
and
A straight forward Solution1. find a number of
entities where A and A co-occur2.
statistically estimate the expected overlap
between A and A.
Problem correlated attributes (relationships)
seldom co-occur.
For example, no entity uses first_name and name
simultaneously.
Remedy find duplicates which use different
attributes.

12
Duplicate Detection in Malleable Schema

Problem existing duplicate detection algorithms
assume a uniform schema, so they do not work here
Observation1. more duplicates better schema
correlation discovery2. more accurate schema
correlations better duplicate detection
SolutionLet schema correlation discovery and
duplicate detection reinforce each other to
achieve improved results

13
An Example
duplicates E1, E2, E3, E4, E5,
E6attribute matches title, subject, author,
writer, pub-date, rec-date
duplicates E1, E2, E3, E4, E5,
E6attribute matches title, subject, author,
writer, pub-date, rec-date
14
The DSCD Algorithm

DSCD stands for Duplication and Schema
Correlation Discovery
Suppose there are n possible duplicates and m
possible correlated attributes
Belief vector of duplicate candidates
Belief vector of attribute match candidates
Let S be the an evidence matrix

15
The DSCD Algorithm contd.

When attribute matches are known
When duplicates are known
Repeat the process until C and D converge
Very similar to the HITS algorithm (pagerank). It
can be proved that will converge to the
principal Eign vector of

16
Revisit the Example
D E1, E2, E3, E4, E5, E6 1.0
0.89 0.28 C title, subject
1.0 author, writer
1.0 pub-date, rec-date 0.56
S E1, E2,
E3, E4, E5, E6 title, subject
1 0.7 0
author, writer 1 0.7
0pub-date, rec-date 0
0.7 1
17
Steps to Discover and Quantify Schema Correlations

Step 1 find all pairs of attributes that are
possibly correlated. (using schema matching
techniques)
Step 2 find a number of duplicate candidates.
(find similar entities based on TFxIDF)
Step 3 conduct the DSCD algorithm on the
pre-selected duplicate candidates and attribute
match candidates.
Step 4 use the resulting belief vector of
attribute matches C to further identify more
duplicates.
Step 5 use the duplicates to statistically
quantify the correlations. (
)

18
Roadmap

How to query using malleable schema ? Answer
query relaxation.
How to discover the schema correlations to enable
query relaxation ?Answer DSCD algorithm.
How to perform query relaxation efficiently ?
Some experiment results.

19
Query Processing

Given a query Q
Find all relaxed queries Q1, Q2, , Qn, using
schema correlations
Permute the relaxed queries by
Execute the queries progressively
Two optimizations can be made .

20
Query Processing Optimization I
E
A
R
E
a
B
b
21
Query Processing Optimization I
E
A
R
E
a
B
A,B,R
b
A1,B,R
A,B1,R
A,B,R1
Idea A algorithm that takes advantage of this
partial order can save many efforts in evaluating
relaxed queries
A2,B,R
A1,B1,R
A1,B,R1
A,B1,R1
A2,B1,R
A2,B,R1
A1,B1,R1
A2,B1,R1
22
Query Processing Optimization II
E
A
R
E
a
B
b
The joining results
in Q1 can be reused by Q2
Idea cache the temporary results of relaxed
queries
23
Roadmap

How to query using malleable schema ? Answer
query relaxation.
How to discover the schema correlations to enable
query relaxation ?Answer DSCD algorithm.
How to perform query relaxation efficiently
?Answer A algorithm, caching temporary query
results
Some experiment results.

24
Experiment setup

Dataset
IMDB movies TV series 850,000 items, 32
attributes
Amazon DVD VHS 115,000 items, 28 attributes
Setup
DBMS MySQL 4.1.11
Code Java 5
Platform CPU 2.7GHz, RAM 1GB

25
The First Run

Correlations between IMDB and Amazon attributes

26
Performance of DSCD Algorithm

We divide the Amazon data into two sets
training set and test set
We perform the DSCD Algorithm on the training set
We use the resulting C vector to detect
duplicates in Amazon test set and IMDB

27
Query Relaxation
Select from Amazon where keywords contains
Titanic Disney
Select from Amazon where Title contains
Titanic And Director contains Cameron
28
Comparison Against Keywords Search