Title: Query Relaxation Using Malleable Schema
1Query Relaxation Using Malleable Schema
- Xuan Zhou, Julien Gaugaz, Wolf-Tilo Balke,
Wolfgang NejdlL3S Research CenterLeibniz
University HanoverHannover, Germany
2Data in Real-world Information Systems
- Examples
- Personal information system (desktop)
- Enterprise information system
- World Wide Web
- Characteristics
- A lot of text (unstructured information)
- A lot of structures, e.g. title, author,
creation-date, - Heterogeneity in structure
- Different holders (applications) use different
schemas - In nature, the structure of a domain is too
complex for us to give it a clear and certain
definitione.g. types of service departments
long-term, short-term, vacation
3Two Approaches to Search
- Information Retrieval (IR)
- Keywords as query easy to use
- No structures less expressive
- Can be applied to heterogeneous data flexible
- Database Systems (DB)
- Structured queries more expressive powerful
- Weak in dealing with heterogeneity inflexible
4Dealing with Heterogeneity in DB
- Traditional Data Integration
- Transform data into a clear and uniform structure
before we use it - Cannot avoid human intervention very laborious
and not scalable - New idea Malleable Schema
- Allow overlapping and vague elements to be
defined in a single schema - Automatically capture and quantify the
correlations between schema elements, though not
perfectly - Properly relax user query using schema
correlations to obtain best-effort results - A trade-off between accuracy and automatism
5Example Data in a Malleable Schema
xml search
Jack
first name
Person
Xml is the standardfor data exchange.
sur name
title
Pan
body
author
Doc
name
Person
author
John Gary
Isa book
False
writer
sender
Isa paper
subject
email
True
My paper
Doc
attachment
contents
date
body
Dear Sergey, Pleasefind attached the file.
Desktop SearchWe have many data.
25.03.2006
6In This Talk
- How to query using malleable schema ?
- How to discover the schema correlations to enable
query relaxation ? - How to perform query relaxation efficiently ?
- Some experiment results.
7The Need for Query Relaxation
- For example, user issue queryQ1 Select Person
Where first_name Contains Philip - To obtain the complete results, we should relax
the query toQ2 Select Person Where first_name
Contains Philip Or name Contains
Philip - Principle 1 a query has to be relaxed to related
schema elements.
8The Need for Ranking Query Results
- James M. Philip
Philip Bernstein - Whose results should first be returned? Q2.1
Select Person Where first_name Contains Philip
Q2.2 Select Person Where name Contains Philip - To return more relevant results first, we perform
Q2.1 prior to Q2.2. - Principle 2 relaxed queries should be performed
in a sequence based on the likelihoods of
returning right answers.
9A Model for Query Relaxation
- Given a query Q Select E Where A1 Contains
a1 And And Ak Contains ak namely, - If Q is a relaxed query of Q
- Then, the probability that Q return correct
results to Q is - How to assess the correlation between schema
elements ?namely
10Roadmap
- How to query using malleable schema ? Answer
query relaxation. - How to discover the schema correlations to enable
query relaxation ? - How to perform query relaxation efficiently ?
- Some experiment results.
11Discovering Schema Correlations
- Task for each pair of attributes (relationships)
A and A, assess the conditional probability - For example
and - A straight forward Solution1. find a number of
entities where A and A co-occur2.
statistically estimate the expected overlap
between A and A. - Problem correlated attributes (relationships)
seldom co-occur. - For example, no entity uses first_name and name
simultaneously. - Remedy find duplicates which use different
attributes.
12Duplicate Detection in Malleable Schema
- Problem existing duplicate detection algorithms
assume a uniform schema, so they do not work here - Observation1. more duplicates better schema
correlation discovery2. more accurate schema
correlations better duplicate detection - SolutionLet schema correlation discovery and
duplicate detection reinforce each other to
achieve improved results
13An Example
duplicates E1, E2, E3, E4, E5,
E6attribute matches title, subject, author,
writer, pub-date, rec-date
duplicates E1, E2, E3, E4, E5,
E6attribute matches title, subject, author,
writer, pub-date, rec-date
14The DSCD Algorithm
- DSCD stands for Duplication and Schema
Correlation Discovery - Suppose there are n possible duplicates and m
possible correlated attributes - Belief vector of duplicate candidates
- Belief vector of attribute match candidates
- Let S be the an evidence matrix
15The DSCD Algorithm contd.
- When attribute matches are known
- When duplicates are known
- Repeat the process until C and D converge
- Very similar to the HITS algorithm (pagerank). It
can be proved that will converge to the
principal Eign vector of
16Revisit the Example
D E1, E2, E3, E4, E5, E6 1.0
0.89 0.28 C title, subject
1.0 author, writer
1.0 pub-date, rec-date 0.56
S E1, E2,
E3, E4, E5, E6 title, subject
1 0.7 0
author, writer 1 0.7
0pub-date, rec-date 0
0.7 1
17Steps to Discover and Quantify Schema Correlations
- Step 1 find all pairs of attributes that are
possibly correlated. (using schema matching
techniques) - Step 2 find a number of duplicate candidates.
(find similar entities based on TFxIDF) - Step 3 conduct the DSCD algorithm on the
pre-selected duplicate candidates and attribute
match candidates. - Step 4 use the resulting belief vector of
attribute matches C to further identify more
duplicates. - Step 5 use the duplicates to statistically
quantify the correlations. (
)
18Roadmap
- How to query using malleable schema ? Answer
query relaxation. - How to discover the schema correlations to enable
query relaxation ?Answer DSCD algorithm. - How to perform query relaxation efficiently ?
- Some experiment results.
19Query Processing
- Given a query Q
- Find all relaxed queries Q1, Q2, , Qn, using
schema correlations - Permute the relaxed queries by
- Execute the queries progressively
- Two optimizations can be made .
20Query Processing Optimization I
E
A
R
E
a
B
b
21Query Processing Optimization I
E
A
R
E
a
B
A,B,R
b
A1,B,R
A,B1,R
A,B,R1
Idea A algorithm that takes advantage of this
partial order can save many efforts in evaluating
relaxed queries
A2,B,R
A1,B1,R
A1,B,R1
A,B1,R1
A2,B1,R
A2,B,R1
A1,B1,R1
A2,B1,R1
22Query Processing Optimization II
E
A
R
E
a
B
b
The joining results
in Q1 can be reused by Q2
Idea cache the temporary results of relaxed
queries
23Roadmap
- How to query using malleable schema ? Answer
query relaxation. - How to discover the schema correlations to enable
query relaxation ?Answer DSCD algorithm. - How to perform query relaxation efficiently
?Answer A algorithm, caching temporary query
results - Some experiment results.
24Experiment setup
- Dataset
- IMDB movies TV series 850,000 items, 32
attributes - Amazon DVD VHS 115,000 items, 28 attributes
- Setup
- DBMS MySQL 4.1.11
- Code Java 5
- Platform CPU 2.7GHz, RAM 1GB
25The First Run
- Correlations between IMDB and Amazon attributes
26Performance of DSCD Algorithm
- We divide the Amazon data into two sets
training set and test set - We perform the DSCD Algorithm on the training set
- We use the resulting C vector to detect
duplicates in Amazon test set and IMDB
27Query Relaxation
Select from Amazon where keywords contains
Titanic Disney
Select from Amazon where Title contains
Titanic And Director contains Cameron
28Comparison Against Keywords Search
- Create queries using Title, Directors and
Actors in Amazon and perform the queries on IMDB
29Performance
- Set thresholds on attribute correlations, 0.2,
0.1 and 0.05 respectively
30Thank you!