On Provenance of NonAnswers for Queries over Extracted Data PowerPoint PPT Presentation

presentation player overlay
About This Presentation
Transcript and Presenter's Notes

Title: On Provenance of NonAnswers for Queries over Extracted Data


1
On Provenance of Non-Answers for Queries over
Extracted Data
  • Jiansheng Huang
  • Ting Chen
  • AnHai Doan
  • Jeffrey F. Naughton

2
Imprecise Data in Information Extraction (IE)?
Examples DBLife (WISC)Avatar (IBM)etc.
Source-1
Fuzzy
Database
Imprecise
Source-2
Imprecise

integrate
query
Incorrect Incomplete
extract
Source-n
Answers
3
State of the Art
4
Motivating Example
  • Crawl web to extract information related to
    academic job openings.
  • Store result of extraction in an RDBMS.
  • Ask SQL queries and try to interpret results

5
Extracted Jobs
CS Dept. web sites
6
Extracted Ranking
CS Ranking Web Site
7
Question Answering
What are the CS PhD programs in California (CA)
that have job openings and are in the top 25?
SELECT Jobs.school_nameFROM Jobs, Ranking
WHERE Ranking.rank lt 25
AND Jobs.school_state ca
AND Jobs.job_opening yes
AND Jobs.school_name Ranking.school_name

8
Answer
  • berkeley

9
Non-Answers
  • ucsd, uc merced, harvard, ucsc,
  • But why?
  • Data exists.
  • No mechanism.

10
Assumptions
  • Relational data model
  • Subset of SQL
  • Selection (e.g., R.a 2)?
  • Projection (e.g., return R.a)?
  • Join (e.g., R.a S.b)?
  • Conjunctive predicates (e.g., a 2 and b 3)?
  • Satisfiable (e.g., no a 2 and a 3)?

11
Provenance of Non-Answers
query
y
x
z is a potential answer, (x, y) and (x, y) are
the provenance of z.
(x - x, y) explains why z is not an answer and
how z can become an answer.
12
Example
Jobs
Ranking
ucsd is a potential answer, The set of base
tuples is a provenance of ucsd.
13
Another Example
Jobs
Ranking
14
Trust and Constraints
Satisfy constraints
Dont consider updates
Untrusted
Trusted
query
y
x
15
Example Using Trust
trust
Jobs
Ranking
16
Factors DeterminingProvenance of Non-Answers
  • Trusted Data
  • Constraints
  • Query specification

17
Algorithm
  • Start from a user query and a specific non-answer
  • Add predicates derived from the non-answer
  • Add constraint predicates
  • Retain only predicates on trusted attributes
  • For attributes of a potential tuple
  • Determine equivalent constant value (e.g., a
    2)?
  • If none, return a variable
  • Evaluate the provenance query

18
Example Why is UCSD Not Answer?
  • Assume that we trust
  • Jobs(school_name, school_state)?
  • Ranking(school_name, rank)?
  • Completeness of Jobs and Ranking

19
Computing Provenance of UCSD
SELECT Jobs.school_nameFROM Jobs, RankingWHERE
Jobs.job_opening yesAND Jobs.school_state
caAND Ranking.rank lt 25AND Jobs.school_name
Ranking.school_name
SELECT J.school_name, J.job_opening
, R.school_name, R.rankFROM Jobs AS J, Ranking
AS R
- yes
Trusted
Specifyingnon-answer
WHERE Jobs.school_name ucsd
AND J.school_state caAND R.rank lt 25AND
J.school_name R.school_name
Hypotheticalupdate
20
Provenance of UCSD
UCSD is a potential answer.
Why not an answer? because job_opening no.
How to become an answer? job_opening no - yes.
21
Provenance-Assisted Debugging
  • While implementing our job extraction example,
    actually used provenance of non-answers to find a
    bug.
  • Specifically, noticed UCSD is not an answer to
    find all dept. in top 25 with job openings
  • Informed by provenance, we checked UCSD web page
    and found it does have a job opening.
  • What happened?

22
Our Bug
  • UCSD web page has a job opening
  • Debugged extraction for UCSD instance
  • Bug a line in source longer than the line buffer
    for read
  • Fix increase line buffer size
  • Re-extract and re-query produces UCSD as answer

23
Deeper Issues
  • New records can be inserted.
  • Use an all-null tuple as a proxy in our
    provenance report (not actually inserted).
  • The join expression for a provenance query
    depends on the trust and constraints on the
    joined tables.

24
More on Join Expression
Given a join between R and S on R.c1 S.c2
andassuming R.c1 is trusted and R is complete,
the join expression for the provenance query is
  • if S.c2 is trusted,
  • if S is complete, R join S
  • if S.c2 is unique, R x S
  • Otherwise, (R join S) union (R x null, )
  • if S.c2 is not trusted
  • If S is complete, R x S
  • Otherwise, R x (S union null, )

25
Example Why is UC Santa Cruz (UCSC) Not Answer?
  • Assume that we trust
  • Jobs(school_name, school_state)?
  • Ranking(school_name, rank)?
  • Completeness of Jobs
  • Ranking.school_name is unique

26
Computing Provenance of UCSC
SELECT Jobs.school_nameFROM Jobs, RankingWHERE
Jobs.job_opening yesAND Jobs.school_state
caAND Ranking.rank lt 25AND Jobs.school_name
Ranking.school_name
Trusted
Specifyingnon-answer
Hypotheticalupdate
SELECT J.school_name, J.job_opening - yes,
R.school_name - J.school_name, R.rank
- X FROM Jobs AS J LEFT OUTER JOIN
Ranking AS RWHERE J.school_state caAND R.rank
lt 25AND J.school_name R.school_nameAND
Jobs.school_name ucsc
27
Provenance of UCSC
UCSC is a potential answer.
Why not an answer? Because no ranking for ucsc.
How to become an answer? a new ranking tuple is
inserted (null - ucsc, null - X lt 25)
28
Dataset for Experiment
  • Extracted CS Ph.D. program ranks from the CRA web
    site (108 schools).
  • Extracted job openings from department web sites
    (108 schools).
  • Assumption Trust Jobs(school_name, school_state)
    and Jobs completeness.

29
Impact of Trust/Constraints on Provenance of UCSD
  • No trust/constraints on Ranking 109 provenance
    tuples
  • Trust Ranking.school 2 provenance tuples
  • Trust Ranking.school and Ranking.school is
    unique 1 provenance tuple

30
Impact of Trust/Constraints on Provenance
Scalability
  • Scale up the database by a factor of 100, compare
    the number of provenance of tuples of UCSD.
  • No trust/constraints on Ranking x 100.
  • Trust Ranking.school no change.

31
Conclusion
  • Proposed a mechanism for explaining a non-answer
    by using data, constraints, and query.
  • Showed that trust and constraints are critical
    for getting focused provenance.
  • Some opportunities for future work
  • Formal theory (e.g., in relational algebra)?
  • Provenance ranking, etc.

32
THANKS!
33
Original Context Condor Project
  • Distributed computing research project
  • Develops and maintains the Condor system
    software, and supports a production distributed
    computing facility at UW-Madison.

34
Data Management in Condor
Central Manager
Job and system state in local log files!
Collector
Negotiator
Execute Machine
Submit Machine
Scheduler
Executor
35
CondorDB to the Rescue
CondorDBDatabase
Central Manager
Collector
Negotiator
Execute Machine
Schedule Machine
Scheduler
Executor
Query
36
CondorDB Deployments
many more
37
Imprecise Data in CondorDB
Condornodes
Out of date,Inconsistent,Incorrect
Incorrect, Incomplete
unpredictable
uncontrollable
Database
Autonomous
38
Does this problem also occur in any other
application?
39
Example Why is UC Merced Not Answer?
  • Assume that we trust
  • Jobs(school_name, school_state)?
  • Ranking(school_name, rank)?
  • Completeness of Jobs and Ranking

40
Computing Provenance of UC Merced
SELECT Jobs.school_nameFROM Jobs, RankingWHERE
Jobs.job_opening yesAND Jobs.school_state
caAND Ranking.rank lt 25AND Jobs.school_name
Ranking.school_name
SELECT J.school_name, J.job_opening - yes,
R.school_name, R.rankFROM Jobs AS J,
Ranking AS RWHERE J.school_state caAND R.rank
lt 25AND J.school_name R.school_nameAND
Jobs.school_name uc merced
Trusted
Specifyingnon-answer
Hypotheticalupdate
41
Provenance of UC Merced
UC Merced is not a potential answer. Why not a
potential answer?because the trusted data (rank)
does not satisfy the query.
42
Example Relaxing Trust for Provenance of UC
Merced
  • Assume that we trust
  • Jobs(school_name, school_state)?
  • Ranking(school_name)?
  • Completeness of Jobs and Ranking

43
Computing Provenance of UC Merced
SELECT Jobs.school_nameFROM Jobs, RankingWHERE
Jobs.job_opening yesAND Jobs.school_state
caAND Ranking.rank lt 25AND Jobs.school_name
Ranking.school_name
Trusted
SELECT J.school_name, J.job_opening - yes,
R.school_name, R.rank - X lt25FROM
Jobs AS J, Ranking AS RWHERE J.school_state
caAND J.school_name R.school_nameAND
Jobs.school_name uc merced
Specifyingnon-answer
Hypotheticalupdate
44
Provenance of UC Merced
UC Merced is a potential answer.
Why not an answer? because job_opening no and
rank null.
How to become an answer? job_opening no - yes
and rank null - X lt 25.
Write a Comment
User Comments (0)
About PowerShow.com