Title: Issues in Bridging DB
1Issues in Bridging DB IR
4/29
- Announcements
- Next class Interactive Review (Come prepared)
- Homework III solutions online
- Demos tomorrow (instructions will be mailed by
the end of the class)
2First did some discussion of BibFinderhow
queries are mapped etc.
3CEAS Online Evaluations
- You can do them at
- https//intraweb.eas.asu.edu/eval
- Will be available until the end of day May 5th
- (so the exam is unfettered by what you might
think about it ?) - Instructors get to see it only after the grades
have been given - (so you dont need to feel compelled to be
particularly nice) - Your feedback would be appreciated (especially
the written comments) - Last semester I got 2,196 words of comments let
us see if we can break the record -)
4The popularity of Web brings two broad challenges
to Databases
- Supporting heterogeneous data (combining DB/IR)
- This can be tackled in the presence of a single
database - The issues are
- How to do effective querying in the presence of
structured and text data - E.g. Stuff I have Seen project
- How to support IR-style querying on DB
- Because users seem to know IR/keyword style
querying more - (notice the irony herewe said structure is good
because it supports structured querying) - How to support imprecise queries
- Integration of autonomous data sources
- Data/information integration
- Technically has to handle heterogeneous data too
- But we will sort of assume that the sources are
quasi-relational
5DB vs. IR
- DBs allow structured querying
- Queries and results (tuples) are different
objects - Soundness Completeness expected
- User is expected to know what she is doing
- IR only supports unstructured querying
- Queries and results are both documents!
- High Precision Recall is hoped for
- User is expected to be a dunderhead.
6Some specific problems
- How to handle textual attributes?
- How to support keyword-based querying?
- How to handle imprecise queries?
- (Ullas Nambiars work)
71. Handling text fields in data tuples
- Often you have database relations some of whose
fields are Textual - E.g. a movie database, which has, in addition to
year, director etc., a column called Review
which is unstructured text - Normal DB operations ignore this unstructured
stuff (cant join over them). - SQL sometimes supports Contains constraint
(e.g. give me movies that contain Rotten in the
review
8Soft Joins..WHIRL Cohen
- We can extend the notion of Joins to Similarity
Joins where similarity is measured in terms of
vector similarity over the text attributes. So,
the join tuples are output n a ranked formwith
the rank proportional to the similarity - Neat idea but does have some implementation
difficulties - Most tuples in the cross-product will have
non-zero similarities. So, need query processing
that will somehow just produce highly ranked
tuples
9(No Transcript)
102. Supporting keyword search on databases
How do we answer a query like Soumen Sunita?
Issues --the schema is normalized (not
everything in one table) --How to rank multiple
tuples which contain the keywords?
11What Banks Does
The whole DB seen as a directed graph
(edges correspond to foreign keys) Answers
are subgraphs Ranked by edge weights
12BANKS Keyword Search in DB
133. Supporting Imprecise Queries
- Increasing number of Web accessible databases
- E.g. bibliographies, reservation systems,
department catalogs etc - Support for precise queries only exactly
matching tuples - Difficulty in extracting desired information
- Limited query capabilities provided by form based
query interface - Lack of schema/domain information
- Increasing complexity of types of data e.g.
hyptertext, images etc - Often times user wants about the same instead
of exact - Bibliography search find similar publications
Solution Provide answers closely matching query
constraints
14Relaxing queries
- It is obvious how to relax certain type of
attribute values - E.g. price7000 is approximately the same as
price7020 - But how do we relax categorical attributes?
- How should we relax MakeHonda?
- Two possible approaches
- Assume that domain specific information about
similarity of values is available (difficult to
satisfy in practice) - Attempt to derive the similarity between
attribute values directly from the data - Qn How do we compute similarity between
MakeHonda and MakeChevrolet - Idea Compare the set all tuples where MakeHonda
to the set of all tuples where MakeChevrolet - Consider the set of tuples as a vector of bags
(where bags correspond to the individual
attributes) - Use IR similarity techniques to compare the
vectors
15Finding similarities between attribute values
165/4
17Challenges in answering Imprecise Queries
- Challenges
- Extracting additional tuples with minimal domain
knowledge - Estimating similarity with minimal user input
- We introduce IQE (Imprecise Query Engine)
- Uses query workload to identify other precise
queries - Extracts additional tuples satisfying a query by
issuing similar precise queries - Measures distance between queries using
Answerset Similarity
18Answerset Similarity
- Answerset A(Q) Set of all answer tuples of
query Q given by relation R. - Query Similarity
- Sim(Q1,Q2) - Sim(A(Q1), A(Q2))
- Measuring answerset similarity
- Relational model
- exact match between tuples
- captures complete overlap
- Vector space model
- match keywords
- also detects partial overlaps
- Problem Vector Space model representation for
answersets - Answer SuperTuple
Answerset for Q(AuthorWidom)
Answerset for Q(AuthorUllman)
19Similarity Measures
- Jaccard similarity metric with bag semantics
- SimJ(Q1,Q2) Q1 n Q2 / Q1 U Q2
- Doc-Doc Similarity
- Equal importance to all attributes
- Supertuple considered as single bag of keywords
- Simdoc-doc(Q1, Q2) SimJ(STQ1, STQ2)
- Weighted-Attribute Similarity
- Weights assigned to attributes signify importance
to user - Simwatr(Q1,Q2) ? wi x SimJ(STQ1(Ai), STQ2(Ai))
20Empirical Evaluation
- Goal
- Evaluate the efficiency and effectiveness of our
approach - Setup
- A database system extending the bibliography
mediator BibFinder projecting relation - Publications( Author, Title, Conference,
Journal, Year) - Query log consists of 10K precise queries
- User study
- 3 graduate students
- 90 test queries - 30 chosen by each student
- Platform Java 2 on a Linux Server Intel
Celeron 2.2 Ghz, 512 MB
21Answering Imprecise Query
- Estimating query similarity
- For each q ? Qlog
- Compute Sim(q,q) for all q ? Qlog
- Simdoc-doc(q, q) SimJ(STq, STq)
- Simwatr(q,q) ? wi x SimJ(STq(Ai), STq(Ai))
- Extracting similar answers
- Given a query Q
- Map Q to a query q ? Qlog
- Identify k queries similar to q
- Execute the k new queries
22Some Results
23Relevance of Suggested Answers
Are the results precise? Average error in
relevance estimation is around 25
24User Study Summary
- Precision for top-10 related queries is above 75
- Doc-Doc similarity measure dominates
Weighted-attribute similarity - Lessons
- Queries with popular keywords difficult
- Efficiently and effectively capturing user
interest is difficult - A solution requiring less input more acceptable
25Whats Next ?
- Open Issues
- Most similar query may not be present in the
workload. - Answers to a similar query will have varying
similarity depending on the affected attributes - Solution
- Given an imprecise query generate the most
similar query. - Use attribute importance and value-value
similarity to order tuples.
- Challenges
- Estimating attribute importance
- Estimating value-value similarity
-
26Learning the Semantics of the data
- Estimate for value-value similarity
- Similarity between values of categorical
attribute - Sim(v11,v12) ? wi x Sim(Co-related_value(Ai,v11)
, Co-related_value(Ai,v12)) where Ai ?
Attributes(R), Ai ltgt A - Euclidean distance for numerical attributes
- Use the Model of the database AFDs, Keys, Value
correlations to - Identify an implicit structure for the tuple.
- Show other tuples that least break the structure.
CarDb(Make,Model, Year, Price, Mileage, Location,
Color) Approximate Keys Model, Mileage,
Location uniquely decides 90 cars in Db
Model, Mileage, Color - uniquely decides 84
cars in Db Approximate Functional
Dependencies (AFDs) Model -gt Make Year
-gt Price Mileage -gt Year
27Query relaxation
28Finding similarities between attribute values
29Summary
- An approach for answering imprecise queries over
Web database - Answerset Similarity using Supertuple
- Workload queries
- Database unaffected
- Empirical evaluation showing
- High relevance of identified similar queries
- Applicable to any existing database
30Conclusion
- Havasu Integration System
- Introduced and described its salient featuers
- Elaborated StatMiner the coverage/overlap
learning component - Imprecise Query Answering
- An IR-based solution described
- Results showing effectiveness presented
- Open questions and a new solution described
31Publications
- Imprecise Queries Data Integeration
- Answering Imprecise Database Queries, ACM WIDM,
2003. - Mining Coverage Statistics for Websource
Selection in a Mediator, CIKM 2002. - Mining Source Coverage Statistics for Data
Integration, in ACM WIDM, 2001. - Optimizing Recursive Information Gathering Plans
in EMERAC, to appear in JIIS, February 2004 (to
appear). - Other
- The XOO7 Benchmark, EEXTT 2002.
- Efficient XML Data Management An Analysis, ECWEB
2002. - XOO7 Applying OO7 Benchmark to XML Query
Processing Tools, CIKM 2001.