Title: Mayssam Sayyadian, AnHai Doan
1Efficient Keyword Search across Heterogeneous
Relational Databases
- Mayssam Sayyadian, AnHai Doan
- University of Wisconsin - Madison
- Hieu LeKhac
- University of Illinois - Urbana
- Luis Gravano
- Columbia University
2Key Message of Paper
- Precise data integration is expensive
- But we can do IR-style data integration
very cheaply, with no manual cost! - just apply automatic schema/data matching
- then do keyword search across the databases
- no need to verify anything manually
- Already very useful
Build upon keyword search over a single database
...
3Keyword Search over a Single Relational Database
- A growing field, numerous current works
- DBXplorer ICDE02, BANKS ICDE02
- DISCOVER VLDB02
- Efficient IR-style keyword search in databases
VLDB03, - VLDB-05, SIGMOD-06, etc.
- Many related works over XML / other types of data
- XKeyword ICDE03, XRank Sigmod03
- TeXQuery WWW04
- ObjectRank Sigmod06
- TopX VLDB05, etc.
- More are coming at SIGMOD-07 ...
4A Typical Scenario
Customers
Complaints
Foreign-Key Join
Q Michael Smith Cisco
Ranked list of answers
score.8
u1 c124 Michael Smith Repair didnt work
t1 c124 Cisco Michael Jones
score.7
5Our ProposalKeyword Search across Multiple
Databases
Employees
Complaints
tid empid name
v1 e23 Mike D. Smith
v2 e14 John Brown v3
e37 Jack Lucas
Groups
Customers
tid eid reports-to
x1 e23 e37 x2 e14
e37
Query Cisco Jack Lucas
across databases
? IR-style data integration
6A Naive Solution
- 1. Manually identify FK joins across DBs
- 2. Manually identify matching data instances
across DBs - 3. Now treat the combination of DBs as a single
DB - ? apply current keyword search techniques
Just like in traditional data integration,
this is too much
manual work
7Kite Solution
- Automatically find FK joins / matching data
instances
across databases - ? no manual work is required from user
Employees
Complaints
tid empid name
v1 e23 Mike D. Smith
v2 e14 John Brown v3
e37 Jack Lucas
Groups
Customers
tid eid reports-to
x1 e23 e37 x2 e14
e37
8Automatically Find FK Joinsacross Databases
Employees
Complaints
tid empid name
v1 e23 Mike D. Smith
v2 e14 John Brown v3
e37 Jack Lucas
- Current solutions analyze data values (e.g.,
Bellman) - Limited accuracy
- e.g., waterfront with values yes/no
electricity with values yes/no - Our solution data analysis schema matching
- improve accuracy drastically (by as much as 50
F-1)
Automatic join/data matching can be wrong ?
incorporate confidence scores into answer scores
9Incorporate Confidence Scores into Answer Scores
- Recall answer example in single-DB settings
score.8
t1 c124 Cisco Michael Jones
u1 c124 Michael Smith Repair didnt work
- Recall answer example in multiple-DB settings
score 0.7 for data matching
score 0.9 for FK join
a.score_kw (A, Q) ß.score_join (A, Q)
?.score_data (A, Q)
score (A, Q)
size (A)
10Summary of Trade-Offs
SQL queries
- Precise data integration
- the holy grail
- IR-style data integration, naive way
- manually identify FK joins, matching data
- still too expensive
- IR-style data integration, using Kite
- automatic FK join finding / data matching
- cheap
- only approximates the ideal ranked list found
by naive
11Kite Architecture
Q Smith Cisco
Index Builder
Condensed CN Generator
IR index1
IR indexn
Foreign key joins
Refinement
rules
Top-k Searcher
Data instance
Foreign-Key Join Finder
matcher
Data-based Schema Join Finder
Matcher
Distributed SQL queries
D1
Dn
D1
Dn
Offline preprocessing
Online querying
12Online Querying
- What current solutions do
- 1. Create answer templates
- 2. Materialize answer templates to obtain
answers
13Create Answer Templates
Service-DB
- Find tuples that contain query keywords
- Use DBs IR index
- example
- Q Smith Cisco
- Tuple sets
- Create tuple-set graph
- Schema graph
- Tuple set graph
HR-DB
Service-DB ComplaintsQu1, u2
CustomersQv1
Groups x1 x2
Employees t1 t2 t3
HR-DB EmployeesQt1
GroupsQ
14Create Answer Templates (cont.)
- Search tuple-set graph to generate answer
templates - also called Candidate Networks (CNs)
- Each answer template
one way to join tuples to form an answer
15Materialize Answer Templatesto Generate Answers
- By generating and executing a SQL query
CN CustomersQ ? ComplaintsQ
(CustomersQ v1 , ComplaintsQ u1, u2) SQL
SELECT FROM Customers C, Complaints P
WHERE C.cust-id P.id AND
(C.tuple-id v1) AND (P.tuple-id u1 OR
tuple-id u2)
J1
- Naive solution
- materialize all answer templates, score, rank,
then return answers - Current solutions
- find only top-k answers
- materialize only certain answer templates
- make decisions using refinement rules
statistics
16Challenges for Kite Setting
- More databases ? way too many answer
templates to generate - can take hours on just 3-4 databases
- Materializing an answer template takes way too
long - requires SQL query execution across multiple
databases - invoking each database incurs large overhead
- Difficult to obtain reliable statistics across
databases - See paper for our solutions
17Empirical Evaluation
Domains
Domain DBs Avg tables per DB Avg attributes per schema Avg approximate FK joins tuples per table Avg approximate FK joins tuples per table Avg approximate FK joins tuples per table Avg tuples per table Total size
Domain DBs Avg tables per DB Avg attributes per schema total across DBs per pair Avg tuples per table Total size
DBLP 2 3 3 11 6 11 500K 400M
Inventory 8 5.8 5.4 890 804 33.6 2K 50M
18 Runtime Performance (1)
runtime vs. maximum CCN size
19Runtime Performance (2)
20Query Result Quality
Pr_at_k
Pr_at_k
k
k
OR-semantic queries
AND-semantic queries
- Pr_at_k the fraction of answers that appear in the
ideal list
21Summary
- Kite executes IR-style data integration
- performs some automatic preprocessing
- then immediately allows keyword querying
- Relatively painless
- no manual work!
- no need to create global schema, to understand
SQL - Can be very useful in many settings
e.g., on-the-fly, best-effort, for non-technical
people - enterprises, on the Web, need only a few answers
- emergency (e.g., hospital police), need answers
quickly
22Future Directions
- Incorporate user feedback
? interactive IR-style data integration - More efficient query processing
- large of databases, network latency
- Extends to other types of data
- XML, ontologies, extracted data, Web data
IR-style data integration is feasible
and useful extends current
works on keyword search over DB raises
many opportunities for future work
23BACKUP
24Other Experiments
- Schema matching helps improve join discovery
algorithm drastically
- Kite also improves single-database keyword search
algorithm mHybrid