Issues in Bridging DB - PowerPoint PPT Presentation

About This Presentation
Title:

Issues in Bridging DB

Description:

Homework III solutions online. Demos tomorrow (instructions will be mailed by the end of the class) ... Conference='WISE' Title='web technology' Title='e ... – PowerPoint PPT presentation

Number of Views:41
Avg rating:3.0/5.0
Slides: 29
Provided by: ullasn
Category:

less

Transcript and Presenter's Notes

Title: Issues in Bridging DB


1
Issues in Bridging DB IR
4/29
  • Announcements
  • Next class Interactive Review (Come prepared)
  • Homework III solutions online
  • Demos tomorrow (instructions will be mailed by
    the end of the class)

2
First did some discussion of BibFinderhow
queries are mapped etc.
3
CEAS Online Evaluations
  • You can do them at
  • https//intraweb.eas.asu.edu/eval
  • Will be available until the end of day May 5th
  • (so the exam is unfettered by what you might
    think about it ?)
  • Instructors get to see it only after the grades
    have been given
  • (so you dont need to feel compelled to be
    particularly nice)
  • Your feedback would be appreciated (especially
    the written comments)
  • Last semester I got 2,196 words of comments let
    us see if we can break the record -)

4
The popularity of Web brings two broad challenges
to Databases
  • Supporting heterogeneous data (combining DB/IR)
  • This can be tackled in the presence of a single
    database
  • The issues are
  • How to do effective querying in the presence of
    structured and text data
  • E.g. Stuff I have Seen project
  • How to support IR-style querying on DB
  • Because users seem to know IR/keyword style
    querying more
  • (notice the irony herewe said structure is good
    because it supports structured querying)
  • How to support imprecise queries
  • Integration of autonomous data sources
  • Data/information integration
  • Technically has to handle heterogeneous data too
  • But we will sort of assume that the sources are
    quasi-relational

5
DB vs. IR
  • DBs allow structured querying
  • Queries and results (tuples) are different
    objects
  • Soundness Completeness expected
  • User is expected to know what she is doing
  • IR only supports unstructured querying
  • Queries and results are both documents!
  • High Precision Recall is hoped for
  • User is expected to be a dunderhead.

6
Some specific problems
  • How to handle textual attributes?
  • How to support keyword-based querying?
  • How to handle imprecise queries?
  • (Ullas Nambiars work)

7
1. Handling text fields in data tuples
  • Often you have database relations some of whose
    fields are Textual
  • E.g. a movie database, which has, in addition to
    year, director etc., a column called Review
    which is unstructured text
  • Normal DB operations ignore this unstructured
    stuff (cant join over them).
  • SQL sometimes supports Contains constraint
    (e.g. give me movies that contain Rotten in the
    review

8
Soft Joins..WHIRL Cohen
  • We can extend the notion of Joins to Similarity
    Joins where similarity is measured in terms of
    vector similarity over the text attributes. So,
    the join tuples are output n a ranked formwith
    the rank proportional to the similarity
  • Neat idea but does have some implementation
    difficulties
  • Most tuples in the cross-product will have
    non-zero similarities. So, need query processing
    that will somehow just produce highly ranked
    tuples

9
(No Transcript)
10
2. Supporting keyword search on databases
How do we answer a query like Soumen Sunita?
Issues --the schema is normalized (not
everything in one table) --How to rank multiple
tuples which contain the keywords?

11
What Banks Does
The whole DB seen as a directed graph
(edges correspond to foreign keys) Answers
are subgraphs Ranked by edge weights
12
BANKS Keyword Search in DB
13
3. Supporting Imprecise Queries
  • Increasing number of Web accessible databases
  • E.g. bibliographies, reservation systems,
    department catalogs etc
  • Support for precise queries only exactly
    matching tuples
  • Difficulty in extracting desired information
  • Limited query capabilities provided by form based
    query interface
  • Lack of schema/domain information
  • Increasing complexity of types of data e.g.
    hyptertext, images etc
  • Often times user wants about the same instead
    of exact
  • Bibliography search find similar publications

Solution Provide answers closely matching query
constraints
14
Relaxing queries
  • It is obvious how to relax certain type of
    attribute values
  • E.g. price7000 is approximately the same as
    price7020
  • But how do we relax categorical attributes?
  • How should we relax MakeHonda?
  • Two possible approaches
  • Assume that domain specific information about
    similarity of values is available (difficult to
    satisfy in practice)
  • Attempt to derive the similarity between
    attribute values directly from the data
  • Qn How do we compute similarity between
    MakeHonda and MakeChevrolet
  • Idea Compare the set all tuples where MakeHonda
    to the set of all tuples where MakeChevrolet
  • Consider the set of tuples as a vector of bags
    (where bags correspond to the individual
    attributes)
  • Use IR similarity techniques to compare the
    vectors

15
Finding similarities between attribute values
16
5/4
17
Challenges in answering Imprecise Queries
  • Challenges
  • Extracting additional tuples with minimal domain
    knowledge
  • Estimating similarity with minimal user input
  • We introduce IQE (Imprecise Query Engine)
  • Uses query workload to identify other precise
    queries
  • Extracts additional tuples satisfying a query by
    issuing similar precise queries
  • Measures distance between queries using
    Answerset Similarity

18
Answerset Similarity
  • Answerset A(Q) Set of all answer tuples of
    query Q given by relation R.
  • Query Similarity
  • Sim(Q1,Q2) - Sim(A(Q1), A(Q2))
  • Measuring answerset similarity
  • Relational model
  • exact match between tuples
  • captures complete overlap
  • Vector space model
  • match keywords
  • also detects partial overlaps
  • Problem Vector Space model representation for
    answersets
  • Answer SuperTuple

Answerset for Q(AuthorWidom)
Answerset for Q(AuthorUllman)
19
Similarity Measures
  • Jaccard similarity metric with bag semantics
  • SimJ(Q1,Q2) Q1 n Q2 / Q1 U Q2
  • Doc-Doc Similarity
  • Equal importance to all attributes
  • Supertuple considered as single bag of keywords
  • Simdoc-doc(Q1, Q2) SimJ(STQ1, STQ2)
  • Weighted-Attribute Similarity
  • Weights assigned to attributes signify importance
    to user
  • Simwatr(Q1,Q2) ? wi x SimJ(STQ1(Ai), STQ2(Ai))

20
Empirical Evaluation
  • Goal
  • Evaluate the efficiency and effectiveness of our
    approach
  • Setup
  • A database system extending the bibliography
    mediator BibFinder projecting relation
  • Publications( Author, Title, Conference,
    Journal, Year)
  • Query log consists of 10K precise queries
  • User study
  • 3 graduate students
  • 90 test queries - 30 chosen by each student
  • Platform Java 2 on a Linux Server Intel
    Celeron 2.2 Ghz, 512 MB

21
Answering Imprecise Query
  • Estimating query similarity
  • For each q ? Qlog
  • Compute Sim(q,q) for all q ? Qlog
  • Simdoc-doc(q, q) SimJ(STq, STq)
  • Simwatr(q,q) ? wi x SimJ(STq(Ai), STq(Ai))
  • Extracting similar answers
  • Given a query Q
  • Map Q to a query q ? Qlog
  • Identify k queries similar to q
  • Execute the k new queries

22
Some Results
23
Relevance of Suggested Answers
Are the results precise? Average error in
relevance estimation is around 25
24
User Study Summary
  • Precision for top-10 related queries is above 75
  • Doc-Doc similarity measure dominates
    Weighted-attribute similarity
  • Lessons
  • Queries with popular keywords difficult
  • Efficiently and effectively capturing user
    interest is difficult
  • A solution requiring less input more acceptable

25
Whats Next ?
  • Open Issues
  • Most similar query may not be present in the
    workload.
  • Answers to a similar query will have varying
    similarity depending on the affected attributes
  • Solution
  • Given an imprecise query generate the most
    similar query.
  • Use attribute importance and value-value
    similarity to order tuples.
  • Challenges
  • Estimating attribute importance
  • Estimating value-value similarity

26
Learning the Semantics of the data
  • Estimate for value-value similarity
  • Similarity between values of categorical
    attribute
  • Sim(v11,v12) ? wi x Sim(Co-related_value(Ai,v11)
    , Co-related_value(Ai,v12)) where Ai ?
    Attributes(R), Ai ltgt A
  • Euclidean distance for numerical attributes
  • Use the Model of the database AFDs, Keys, Value
    correlations to
  • Identify an implicit structure for the tuple.
  • Show other tuples that least break the structure.

CarDb(Make,Model, Year, Price, Mileage, Location,
Color) Approximate Keys Model, Mileage,
Location uniquely decides 90 cars in Db
Model, Mileage, Color - uniquely decides 84
cars in Db Approximate Functional
Dependencies (AFDs) Model -gt Make Year
-gt Price Mileage -gt Year
27
Query relaxation
28
Finding similarities between attribute values
29
Summary
  • An approach for answering imprecise queries over
    Web database
  • Answerset Similarity using Supertuple
  • Workload queries
  • Database unaffected
  • Empirical evaluation showing
  • High relevance of identified similar queries
  • Applicable to any existing database

30
Conclusion
  • Havasu Integration System
  • Introduced and described its salient featuers
  • Elaborated StatMiner the coverage/overlap
    learning component
  • Imprecise Query Answering
  • An IR-based solution described
  • Results showing effectiveness presented
  • Open questions and a new solution described

31
Publications
  • Imprecise Queries Data Integeration
  • Answering Imprecise Database Queries, ACM WIDM,
    2003.
  • Mining Coverage Statistics for Websource
    Selection in a Mediator, CIKM 2002.
  • Mining Source Coverage Statistics for Data
    Integration, in ACM WIDM, 2001.
  • Optimizing Recursive Information Gathering Plans
    in EMERAC, to appear in JIIS,  February 2004 (to
    appear).
  • Other
  • The XOO7 Benchmark, EEXTT 2002. 
  • Efficient XML Data Management An Analysis, ECWEB
    2002.
  • XOO7 Applying OO7 Benchmark to XML Query
    Processing Tools, CIKM 2001.
Write a Comment
User Comments (0)
About PowerShow.com