Issues in Bridging DB - PowerPoint PPT Presentation

About This Presentation

Title:

Issues in Bridging DB

Description:

Homework III solutions online. Demos tomorrow (instructions will be mailed by the end of the class) ... Conference='WISE' Title='web technology' Title='e ... – PowerPoint PPT presentation

Number of Views:41

Avg rating:3.0/5.0

Slides: 29

Provided by: ullasn

Learn more at: https://rakaposhi.eas.asu.edu

Category:

more less

Transcript and Presenter's Notes

Title: Issues in Bridging DB

1
Issues in Bridging DB IR
4/29

Announcements
Next class Interactive Review (Come prepared)
Homework III solutions online
Demos tomorrow (instructions will be mailed by
the end of the class)

2
First did some discussion of BibFinderhow
queries are mapped etc.
3
CEAS Online Evaluations

You can do them at
https//intraweb.eas.asu.edu/eval
Will be available until the end of day May 5th
(so the exam is unfettered by what you might
think about it ?)
Instructors get to see it only after the grades
have been given
(so you dont need to feel compelled to be
particularly nice)
Your feedback would be appreciated (especially
the written comments)
Last semester I got 2,196 words of comments let
us see if we can break the record -)

4
The popularity of Web brings two broad challenges
to Databases

Supporting heterogeneous data (combining DB/IR)
This can be tackled in the presence of a single
database
The issues are
How to do effective querying in the presence of
structured and text data
E.g. Stuff I have Seen project
How to support IR-style querying on DB
Because users seem to know IR/keyword style
querying more
(notice the irony herewe said structure is good
because it supports structured querying)
How to support imprecise queries

Integration of autonomous data sources
Data/information integration
Technically has to handle heterogeneous data too
But we will sort of assume that the sources are
quasi-relational

5
DB vs. IR

DBs allow structured querying
Queries and results (tuples) are different
objects
Soundness Completeness expected
User is expected to know what she is doing

IR only supports unstructured querying
Queries and results are both documents!
High Precision Recall is hoped for
User is expected to be a dunderhead.

6
Some specific problems

How to handle textual attributes?
How to support keyword-based querying?
How to handle imprecise queries?
(Ullas Nambiars work)

7
1. Handling text fields in data tuples

Often you have database relations some of whose
fields are Textual
E.g. a movie database, which has, in addition to
year, director etc., a column called Review
which is unstructured text
Normal DB operations ignore this unstructured
stuff (cant join over them).
SQL sometimes supports Contains constraint
(e.g. give me movies that contain Rotten in the
review

8
Soft Joins..WHIRL Cohen

We can extend the notion of Joins to Similarity
Joins where similarity is measured in terms of
vector similarity over the text attributes. So,
the join tuples are output n a ranked formwith
the rank proportional to the similarity
Neat idea but does have some implementation
difficulties
Most tuples in the cross-product will have
non-zero similarities. So, need query processing
that will somehow just produce highly ranked
tuples

9
(No Transcript)
10
2. Supporting keyword search on databases
How do we answer a query like Soumen Sunita?
Issues --the schema is normalized (not
everything in one table) --How to rank multiple
tuples which contain the keywords?

11
What Banks Does
The whole DB seen as a directed graph
(edges correspond to foreign keys) Answers
are subgraphs Ranked by edge weights
12
BANKS Keyword Search in DB
13
3. Supporting Imprecise Queries

Increasing number of Web accessible databases
E.g. bibliographies, reservation systems,
department catalogs etc
Support for precise queries only exactly
matching tuples
Difficulty in extracting desired information
Limited query capabilities provided by form based
query interface
Lack of schema/domain information
Increasing complexity of types of data e.g.
hyptertext, images etc
Often times user wants about the same instead
of exact
Bibliography search find similar publications

Solution Provide answers closely matching query
constraints
14
Relaxing queries

It is obvious how to relax certain type of
attribute values
E.g. price7000 is approximately the same as
price7020
But how do we relax categorical attributes?
How should we relax MakeHonda?
Two possible approaches
Assume that domain specific information about
similarity of values is available (difficult to
satisfy in practice)
Attempt to derive the similarity between
attribute values directly from the data
Qn How do we compute similarity between
MakeHonda and MakeChevrolet
Idea Compare the set all tuples where MakeHonda
to the set of all tuples where MakeChevrolet
Consider the set of tuples as a vector of bags
(where bags correspond to the individual
attributes)
Use IR similarity techniques to compare the
vectors

15
Finding similarities between attribute values
16
5/4
17
Challenges in answering Imprecise Queries

Challenges
Extracting additional tuples with minimal domain
knowledge
Estimating similarity with minimal user input

We introduce IQE (Imprecise Query Engine)
Uses query workload to identify other precise
queries
Extracts additional tuples satisfying a query by
issuing similar precise queries
Measures distance between queries using
Answerset Similarity

18
Answerset Similarity

Answerset A(Q) Set of all answer tuples of
query Q given by relation R.
Query Similarity
Sim(Q1,Q2) - Sim(A(Q1), A(Q2))
Measuring answerset similarity
Relational model
exact match between tuples
captures complete overlap
Vector space model
match keywords
also detects partial overlaps
Problem Vector Space model representation for
answersets
Answer SuperTuple

Answerset for Q(AuthorWidom)
Answerset for Q(AuthorUllman)
19
Similarity Measures

Jaccard similarity metric with bag semantics
SimJ(Q1,Q2) Q1 n Q2 / Q1 U Q2
Doc-Doc Similarity
Equal importance to all attributes
Supertuple considered as single bag of keywords
Simdoc-doc(Q1, Q2) SimJ(STQ1, STQ2)
Weighted-Attribute Similarity
Weights assigned to attributes signify importance
to user
Simwatr(Q1,Q2) ? wi x SimJ(STQ1(Ai), STQ2(Ai))

20
Empirical Evaluation

Goal
Evaluate the efficiency and effectiveness of our
approach
Setup
A database system extending the bibliography
mediator BibFinder projecting relation
Publications( Author, Title, Conference,
Journal, Year)
Query log consists of 10K precise queries
User study
3 graduate students
90 test queries - 30 chosen by each student
Platform Java 2 on a Linux Server Intel
Celeron 2.2 Ghz, 512 MB

21
Answering Imprecise Query

Estimating query similarity
For each q ? Qlog
Compute Sim(q,q) for all q ? Qlog
Simdoc-doc(q, q) SimJ(STq, STq)
Simwatr(q,q) ? wi x SimJ(STq(Ai), STq(Ai))
Extracting similar answers
Given a query Q
Map Q to a query q ? Qlog
Identify k queries similar to q
Execute the k new queries

22
Some Results
23
Relevance of Suggested Answers
Are the results precise? Average error in
relevance estimation is around 25
24
User Study Summary

Precision for top-10 related queries is above 75
Doc-Doc similarity measure dominates
Weighted-attribute similarity
Lessons
Queries with popular keywords difficult
Efficiently and effectively capturing user
interest is difficult
A solution requiring less input more acceptable

25
Whats Next ?

Open Issues
Most similar query may not be present in the
workload.
Answers to a similar query will have varying
similarity depending on the affected attributes
Solution
Given an imprecise query generate the most
similar query.
Use attribute importance and value-value
similarity to order tuples.

Challenges
Estimating attribute importance
Estimating value-value similarity

26
Learning the Semantics of the data

Estimate for value-value similarity
Similarity between values of categorical
attribute
Sim(v11,v12) ? wi x Sim(Co-related_value(Ai,v11)
, Co-related_value(Ai,v12)) where Ai ?
Attributes(R), Ai ltgt A
Euclidean distance for numerical attributes
Use the Model of the database AFDs, Keys, Value
correlations to
Identify an implicit structure for the tuple.
Show other tuples that least break the structure.

CarDb(Make,Model, Year, Price, Mileage, Location,
Color) Approximate Keys Model, Mileage,
Location uniquely decides 90 cars in Db
Model, Mileage, Color - uniquely decides 84
cars in Db Approximate Functional
Dependencies (AFDs) Model -gt Make Year
-gt Price Mileage -gt Year
27
Query relaxation
28
Finding similarities between attribute values
29
Summary

An approach for answering imprecise queries over
Web database
Answerset Similarity using Supertuple
Workload queries
Database unaffected
Empirical evaluation showing
High relevance of identified similar queries
Applicable to any existing database

30
Conclusion

Havasu Integration System
Introduced and described its salient featuers
Elaborated StatMiner the coverage/overlap
learning component
Imprecise Query Answering
An IR-based solution described
Results showing effectiveness presented
Open questions and a new solution described

31
Publications

Imprecise Queries Data Integeration
Answering Imprecise Database Queries, ACM WIDM,
2003.
Mining Coverage Statistics for Websource
Selection in a Mediator, CIKM 2002.
Mining Source Coverage Statistics for Data
Integration, in ACM WIDM, 2001.
Optimizing Recursive Information Gathering Plans
in EMERAC, to appear in JIIS, February 2004 (to
appear).
Other
The XOO7 Benchmark, EEXTT 2002.
Efficient XML Data Management An Analysis, ECWEB
2002.
XOO7 Applying OO7 Benchmark to XML Query
Processing Tools, CIKM 2001.