Mayssam Sayyadian, AnHai Doan presentation

About This Presentation

Transcript and Presenter's Notes

Title: Mayssam Sayyadian, AnHai Doan

1
Efficient Keyword Search across Heterogeneous
Relational Databases

Mayssam Sayyadian, AnHai Doan
University of Wisconsin - Madison
Hieu LeKhac
University of Illinois - Urbana
Luis Gravano
Columbia University

2
Key Message of Paper

Precise data integration is expensive
But we can do IR-style data integration
very cheaply, with no manual cost!
just apply automatic schema/data matching
then do keyword search across the databases
no need to verify anything manually
Already very useful

Build upon keyword search over a single database
...
3
Keyword Search over a Single Relational Database

A growing field, numerous current works
DBXplorer ICDE02, BANKS ICDE02
DISCOVER VLDB02
Efficient IR-style keyword search in databases
VLDB03,
VLDB-05, SIGMOD-06, etc.
Many related works over XML / other types of data
XKeyword ICDE03, XRank Sigmod03
TeXQuery WWW04
ObjectRank Sigmod06
TopX VLDB05, etc.
More are coming at SIGMOD-07 ...

4
A Typical Scenario
Customers
Complaints
Foreign-Key Join
Q Michael Smith Cisco
Ranked list of answers
score.8
u1 c124 Michael Smith Repair didnt work
t1 c124 Cisco Michael Jones
score.7
5
Our ProposalKeyword Search across Multiple
Databases
Employees
Complaints
tid empid name
v1 e23 Mike D. Smith
v2 e14 John Brown v3
e37 Jack Lucas
Groups
Customers
tid eid reports-to
x1 e23 e37 x2 e14
e37
Query Cisco Jack Lucas
across databases
? IR-style data integration
6
A Naive Solution

1. Manually identify FK joins across DBs
2. Manually identify matching data instances
across DBs
3. Now treat the combination of DBs as a single
DB
? apply current keyword search techniques

Just like in traditional data integration,
this is too much
manual work
7
Kite Solution

Automatically find FK joins / matching data
instances
across databases
? no manual work is required from user

Employees
Complaints
tid empid name
v1 e23 Mike D. Smith
v2 e14 John Brown v3
e37 Jack Lucas
Groups
Customers
tid eid reports-to
x1 e23 e37 x2 e14
e37
8
Automatically Find FK Joinsacross Databases
Employees
Complaints
tid empid name
v1 e23 Mike D. Smith
v2 e14 John Brown v3
e37 Jack Lucas

Current solutions analyze data values (e.g.,
Bellman)
Limited accuracy
e.g., waterfront with values yes/no
electricity with values yes/no
Our solution data analysis schema matching
improve accuracy drastically (by as much as 50
F-1)

Automatic join/data matching can be wrong ?
incorporate confidence scores into answer scores
9
Incorporate Confidence Scores into Answer Scores

Recall answer example in single-DB settings

score.8
t1 c124 Cisco Michael Jones
u1 c124 Michael Smith Repair didnt work

Recall answer example in multiple-DB settings

score 0.7 for data matching
score 0.9 for FK join
a.score_kw (A, Q) ß.score_join (A, Q)
?.score_data (A, Q)
score (A, Q)
size (A)
10
Summary of Trade-Offs
SQL queries

Precise data integration
the holy grail

IR-style data integration, naive way
manually identify FK joins, matching data
still too expensive

IR-style data integration, using Kite
automatic FK join finding / data matching
cheap
only approximates the ideal ranked list found
by naive

11
Kite Architecture
Q Smith Cisco
Index Builder
Condensed CN Generator
IR index1
IR indexn

Partial
Full
Deep

Foreign key joins
Refinement
rules
Top-k Searcher
Data instance
Foreign-Key Join Finder
matcher
Data-based Schema Join Finder
Matcher
Distributed SQL queries

D1
Dn
D1
Dn
Offline preprocessing
Online querying
12
Online Querying

What current solutions do
1. Create answer templates
2. Materialize answer templates to obtain
answers

13
Create Answer Templates
Service-DB

Find tuples that contain query keywords
Use DBs IR index
example
Q Smith Cisco
Tuple sets
Create tuple-set graph
Schema graph
Tuple set graph

HR-DB
Service-DB ComplaintsQu1, u2
CustomersQv1
Groups x1 x2
Employees t1 t2 t3
HR-DB EmployeesQt1
GroupsQ
14
Create Answer Templates (cont.)

Search tuple-set graph to generate answer
templates
also called Candidate Networks (CNs)
Each answer template
one way to join tuples to form an answer

15
Materialize Answer Templatesto Generate Answers

By generating and executing a SQL query

CN CustomersQ ? ComplaintsQ
(CustomersQ v1 , ComplaintsQ u1, u2) SQL
SELECT FROM Customers C, Complaints P
WHERE C.cust-id P.id AND
(C.tuple-id v1) AND (P.tuple-id u1 OR
tuple-id u2)
J1

Naive solution
materialize all answer templates, score, rank,
then return answers
Current solutions
find only top-k answers
materialize only certain answer templates
make decisions using refinement rules
statistics

16
Challenges for Kite Setting

More databases ? way too many answer
templates to generate
can take hours on just 3-4 databases
Materializing an answer template takes way too
long
requires SQL query execution across multiple
databases
invoking each database incurs large overhead
Difficult to obtain reliable statistics across
databases
See paper for our solutions

17
Empirical Evaluation
Domains
Domain DBs Avg tables per DB Avg attributes per schema Avg approximate FK joins tuples per table Avg approximate FK joins tuples per table Avg approximate FK joins tuples per table Avg tuples per table Total size
Domain DBs Avg tables per DB Avg attributes per schema total across DBs per pair Avg tuples per table Total size
DBLP 2 3 3 11 6 11 500K 400M
Inventory 8 5.8 5.4 890 804 33.6 2K 50M
18
Runtime Performance (1)
runtime vs. maximum CCN size
19
Runtime Performance (2)
20
Query Result Quality
Pr_at_k
Pr_at_k
k
k
OR-semantic queries
AND-semantic queries

Pr_at_k the fraction of answers that appear in the
ideal list

21
Summary

Kite executes IR-style data integration
performs some automatic preprocessing
then immediately allows keyword querying
Relatively painless
no manual work!
no need to create global schema, to understand
SQL
Can be very useful in many settings
e.g., on-the-fly, best-effort, for non-technical
people
enterprises, on the Web, need only a few answers
emergency (e.g., hospital police), need answers
quickly

22
Future Directions

Incorporate user feedback
? interactive IR-style data integration
More efficient query processing
large of databases, network latency
Extends to other types of data
XML, ontologies, extracted data, Web data

IR-style data integration is feasible
and useful extends current
works on keyword search over DB raises
many opportunities for future work
23
BACKUP
24
Other Experiments

Mayssam Sayyadian, AnHai Doan PowerPoint PPT Presentation