Mayssam Sayyadian, AnHai Doan - PowerPoint PPT Presentation

About This Presentation
Title:

Mayssam Sayyadian, AnHai Doan

Description:

Efficient Keyword Search across Heterogeneous Relational Databases Mayssam Sayyadian, AnHai Doan University of Wisconsin - Madison Hieu LeKhac University of Illinois ... – PowerPoint PPT presentation

Number of Views:38
Avg rating:3.0/5.0
Slides: 25
Provided by: zam98
Category:

less

Transcript and Presenter's Notes

Title: Mayssam Sayyadian, AnHai Doan


1
Efficient Keyword Search across Heterogeneous
Relational Databases
  • Mayssam Sayyadian, AnHai Doan
  • University of Wisconsin - Madison
  • Hieu LeKhac
  • University of Illinois - Urbana
  • Luis Gravano
  • Columbia University

2
Key Message of Paper
  • Precise data integration is expensive
  • But we can do IR-style data integration
    very cheaply, with no manual cost!
  • just apply automatic schema/data matching
  • then do keyword search across the databases
  • no need to verify anything manually
  • Already very useful

Build upon keyword search over a single database
...
3
Keyword Search over a Single Relational Database
  • A growing field, numerous current works
  • DBXplorer ICDE02, BANKS ICDE02
  • DISCOVER VLDB02
  • Efficient IR-style keyword search in databases
    VLDB03,
  • VLDB-05, SIGMOD-06, etc.
  • Many related works over XML / other types of data
  • XKeyword ICDE03, XRank Sigmod03
  • TeXQuery WWW04
  • ObjectRank Sigmod06
  • TopX VLDB05, etc.
  • More are coming at SIGMOD-07 ...

4
A Typical Scenario
Customers
Complaints
Foreign-Key Join
Q Michael Smith Cisco
Ranked list of answers
score.8
u1 c124 Michael Smith Repair didnt work
t1 c124 Cisco Michael Jones
score.7
5
Our ProposalKeyword Search across Multiple
Databases
Employees
Complaints
tid empid name
v1 e23 Mike D. Smith
v2 e14 John Brown v3
e37 Jack Lucas
Groups
Customers
tid eid reports-to
x1 e23 e37 x2 e14
e37
Query Cisco Jack Lucas
across databases
? IR-style data integration
6
A Naive Solution
  • 1. Manually identify FK joins across DBs
  • 2. Manually identify matching data instances
    across DBs
  • 3. Now treat the combination of DBs as a single
    DB
  • ? apply current keyword search techniques

Just like in traditional data integration,
this is too much
manual work
7
Kite Solution
  • Automatically find FK joins / matching data
    instances
    across databases
  • ? no manual work is required from user

Employees
Complaints
tid empid name
v1 e23 Mike D. Smith
v2 e14 John Brown v3
e37 Jack Lucas
Groups
Customers
tid eid reports-to
x1 e23 e37 x2 e14
e37
8
Automatically Find FK Joinsacross Databases
Employees
Complaints
tid empid name
v1 e23 Mike D. Smith
v2 e14 John Brown v3
e37 Jack Lucas
  • Current solutions analyze data values (e.g.,
    Bellman)
  • Limited accuracy
  • e.g., waterfront with values yes/no
    electricity with values yes/no
  • Our solution data analysis schema matching
  • improve accuracy drastically (by as much as 50
    F-1)

Automatic join/data matching can be wrong ?
incorporate confidence scores into answer scores
9
Incorporate Confidence Scores into Answer Scores
  • Recall answer example in single-DB settings

score.8
t1 c124 Cisco Michael Jones
u1 c124 Michael Smith Repair didnt work
  • Recall answer example in multiple-DB settings

score 0.7 for data matching
score 0.9 for FK join
a.score_kw (A, Q) ß.score_join (A, Q)
?.score_data (A, Q)
score (A, Q)
size (A)
10
Summary of Trade-Offs
SQL queries
  • Precise data integration
  • the holy grail



  • IR-style data integration, naive way
  • manually identify FK joins, matching data
  • still too expensive
  • IR-style data integration, using Kite
  • automatic FK join finding / data matching
  • cheap
  • only approximates the ideal ranked list found
    by naive

11
Kite Architecture
Q Smith Cisco
Index Builder
Condensed CN Generator
IR index1
IR indexn
  • Partial
  • Full
  • Deep

Foreign key joins
Refinement
rules
Top-k Searcher
Data instance
Foreign-Key Join Finder
matcher
Data-based Schema Join Finder
Matcher
Distributed SQL queries


D1
Dn
D1
Dn
Offline preprocessing
Online querying
12
Online Querying
  • What current solutions do
  • 1. Create answer templates
  • 2. Materialize answer templates to obtain
    answers

13
Create Answer Templates
Service-DB
  • Find tuples that contain query keywords
  • Use DBs IR index
  • example
  • Q Smith Cisco
  • Tuple sets
  • Create tuple-set graph
  • Schema graph
  • Tuple set graph

HR-DB
Service-DB ComplaintsQu1, u2
CustomersQv1
Groups x1 x2
Employees t1 t2 t3
HR-DB EmployeesQt1
GroupsQ
14
Create Answer Templates (cont.)
  • Search tuple-set graph to generate answer
    templates
  • also called Candidate Networks (CNs)
  • Each answer template
    one way to join tuples to form an answer

15
Materialize Answer Templatesto Generate Answers
  • By generating and executing a SQL query

CN CustomersQ ? ComplaintsQ
(CustomersQ v1 , ComplaintsQ u1, u2) SQL
SELECT FROM Customers C, Complaints P
WHERE C.cust-id P.id AND
(C.tuple-id v1) AND (P.tuple-id u1 OR
tuple-id u2)
J1
  • Naive solution
  • materialize all answer templates, score, rank,
    then return answers
  • Current solutions
  • find only top-k answers
  • materialize only certain answer templates
  • make decisions using refinement rules
    statistics

16
Challenges for Kite Setting
  • More databases ? way too many answer
    templates to generate
  • can take hours on just 3-4 databases
  • Materializing an answer template takes way too
    long
  • requires SQL query execution across multiple
    databases
  • invoking each database incurs large overhead
  • Difficult to obtain reliable statistics across
    databases
  • See paper for our solutions

17
Empirical Evaluation
Domains
Domain DBs Avg tables per DB Avg attributes per schema Avg approximate FK joins tuples per table Avg approximate FK joins tuples per table Avg approximate FK joins tuples per table Avg tuples per table Total size
Domain DBs Avg tables per DB Avg attributes per schema total across DBs per pair Avg tuples per table Total size
DBLP 2 3 3 11 6 11 500K 400M
Inventory 8 5.8 5.4 890 804 33.6 2K 50M
18
Runtime Performance (1)
runtime vs. maximum CCN size
19
Runtime Performance (2)
20
Query Result Quality
Pr_at_k
Pr_at_k
k
k
OR-semantic queries
AND-semantic queries
  • Pr_at_k the fraction of answers that appear in the
    ideal list

21
Summary
  • Kite executes IR-style data integration
  • performs some automatic preprocessing
  • then immediately allows keyword querying
  • Relatively painless
  • no manual work!
  • no need to create global schema, to understand
    SQL
  • Can be very useful in many settings
    e.g., on-the-fly, best-effort, for non-technical
    people
  • enterprises, on the Web, need only a few answers
  • emergency (e.g., hospital police), need answers
    quickly

22
Future Directions
  • Incorporate user feedback
    ? interactive IR-style data integration
  • More efficient query processing
  • large of databases, network latency
  • Extends to other types of data
  • XML, ontologies, extracted data, Web data

IR-style data integration is feasible
and useful extends current
works on keyword search over DB raises
many opportunities for future work
23
BACKUP
24
Other Experiments
  • Schema matching helps improve join discovery
    algorithm drastically
  • Kite also improves single-database keyword search
    algorithm mHybrid
Write a Comment
User Comments (0)
About PowerShow.com