Title: On Provenance of NonAnswers for Queries over Extracted Data
1On Provenance of Non-Answers for Queries over
Extracted Data
- Jiansheng Huang
- Ting Chen
- AnHai Doan
- Jeffrey F. Naughton
2Imprecise Data in Information Extraction (IE)?
Examples DBLife (WISC)Avatar (IBM)etc.
Source-1
Fuzzy
Database
Imprecise
Source-2
Imprecise
integrate
query
Incorrect Incomplete
extract
Source-n
Answers
3State of the Art
4Motivating Example
- Crawl web to extract information related to
academic job openings. - Store result of extraction in an RDBMS.
- Ask SQL queries and try to interpret results
5Extracted Jobs
CS Dept. web sites
6Extracted Ranking
CS Ranking Web Site
7Question Answering
What are the CS PhD programs in California (CA)
that have job openings and are in the top 25?
SELECT Jobs.school_nameFROM Jobs, Ranking
WHERE Ranking.rank lt 25
AND Jobs.school_state ca
AND Jobs.job_opening yes
AND Jobs.school_name Ranking.school_name
8Answer
9Non-Answers
- ucsd, uc merced, harvard, ucsc,
- But why?
- Data exists.
- No mechanism.
10Assumptions
- Relational data model
- Subset of SQL
- Selection (e.g., R.a 2)?
- Projection (e.g., return R.a)?
- Join (e.g., R.a S.b)?
- Conjunctive predicates (e.g., a 2 and b 3)?
- Satisfiable (e.g., no a 2 and a 3)?
11Provenance of Non-Answers
query
y
x
z is a potential answer, (x, y) and (x, y) are
the provenance of z.
(x - x, y) explains why z is not an answer and
how z can become an answer.
12Example
Jobs
Ranking
ucsd is a potential answer, The set of base
tuples is a provenance of ucsd.
13Another Example
Jobs
Ranking
14Trust and Constraints
Satisfy constraints
Dont consider updates
Untrusted
Trusted
query
y
x
15Example Using Trust
trust
Jobs
Ranking
16Factors DeterminingProvenance of Non-Answers
- Trusted Data
- Constraints
- Query specification
17Algorithm
- Start from a user query and a specific non-answer
- Add predicates derived from the non-answer
- Add constraint predicates
- Retain only predicates on trusted attributes
- For attributes of a potential tuple
- Determine equivalent constant value (e.g., a
2)? - If none, return a variable
- Evaluate the provenance query
18Example Why is UCSD Not Answer?
- Assume that we trust
- Jobs(school_name, school_state)?
- Ranking(school_name, rank)?
- Completeness of Jobs and Ranking
19Computing Provenance of UCSD
SELECT Jobs.school_nameFROM Jobs, RankingWHERE
Jobs.job_opening yesAND Jobs.school_state
caAND Ranking.rank lt 25AND Jobs.school_name
Ranking.school_name
SELECT J.school_name, J.job_opening
, R.school_name, R.rankFROM Jobs AS J, Ranking
AS R
- yes
Trusted
Specifyingnon-answer
WHERE Jobs.school_name ucsd
AND J.school_state caAND R.rank lt 25AND
J.school_name R.school_name
Hypotheticalupdate
20Provenance of UCSD
UCSD is a potential answer.
Why not an answer? because job_opening no.
How to become an answer? job_opening no - yes.
21Provenance-Assisted Debugging
- While implementing our job extraction example,
actually used provenance of non-answers to find a
bug. - Specifically, noticed UCSD is not an answer to
find all dept. in top 25 with job openings - Informed by provenance, we checked UCSD web page
and found it does have a job opening. - What happened?
22Our Bug
- UCSD web page has a job opening
- Debugged extraction for UCSD instance
- Bug a line in source longer than the line buffer
for read - Fix increase line buffer size
- Re-extract and re-query produces UCSD as answer
23Deeper Issues
- New records can be inserted.
- Use an all-null tuple as a proxy in our
provenance report (not actually inserted). - The join expression for a provenance query
depends on the trust and constraints on the
joined tables.
24More on Join Expression
Given a join between R and S on R.c1 S.c2
andassuming R.c1 is trusted and R is complete,
the join expression for the provenance query is
- if S.c2 is trusted,
- if S is complete, R join S
- if S.c2 is unique, R x S
- Otherwise, (R join S) union (R x null, )
- if S.c2 is not trusted
- If S is complete, R x S
- Otherwise, R x (S union null, )
25Example Why is UC Santa Cruz (UCSC) Not Answer?
- Assume that we trust
- Jobs(school_name, school_state)?
- Ranking(school_name, rank)?
- Completeness of Jobs
- Ranking.school_name is unique
26Computing Provenance of UCSC
SELECT Jobs.school_nameFROM Jobs, RankingWHERE
Jobs.job_opening yesAND Jobs.school_state
caAND Ranking.rank lt 25AND Jobs.school_name
Ranking.school_name
Trusted
Specifyingnon-answer
Hypotheticalupdate
SELECT J.school_name, J.job_opening - yes,
R.school_name - J.school_name, R.rank
- X FROM Jobs AS J LEFT OUTER JOIN
Ranking AS RWHERE J.school_state caAND R.rank
lt 25AND J.school_name R.school_nameAND
Jobs.school_name ucsc
27Provenance of UCSC
UCSC is a potential answer.
Why not an answer? Because no ranking for ucsc.
How to become an answer? a new ranking tuple is
inserted (null - ucsc, null - X lt 25)
28Dataset for Experiment
- Extracted CS Ph.D. program ranks from the CRA web
site (108 schools). - Extracted job openings from department web sites
(108 schools). - Assumption Trust Jobs(school_name, school_state)
and Jobs completeness.
29Impact of Trust/Constraints on Provenance of UCSD
- No trust/constraints on Ranking 109 provenance
tuples - Trust Ranking.school 2 provenance tuples
- Trust Ranking.school and Ranking.school is
unique 1 provenance tuple
30Impact of Trust/Constraints on Provenance
Scalability
- Scale up the database by a factor of 100, compare
the number of provenance of tuples of UCSD. - No trust/constraints on Ranking x 100.
- Trust Ranking.school no change.
31Conclusion
- Proposed a mechanism for explaining a non-answer
by using data, constraints, and query. - Showed that trust and constraints are critical
for getting focused provenance. - Some opportunities for future work
- Formal theory (e.g., in relational algebra)?
- Provenance ranking, etc.
32THANKS!
33Original Context Condor Project
- Distributed computing research project
- Develops and maintains the Condor system
software, and supports a production distributed
computing facility at UW-Madison.
34Data Management in Condor
Central Manager
Job and system state in local log files!
Collector
Negotiator
Execute Machine
Submit Machine
Scheduler
Executor
35CondorDB to the Rescue
CondorDBDatabase
Central Manager
Collector
Negotiator
Execute Machine
Schedule Machine
Scheduler
Executor
Query
36CondorDB Deployments
many more
37Imprecise Data in CondorDB
Condornodes
Out of date,Inconsistent,Incorrect
Incorrect, Incomplete
unpredictable
uncontrollable
Database
Autonomous
38Does this problem also occur in any other
application?
39Example Why is UC Merced Not Answer?
- Assume that we trust
- Jobs(school_name, school_state)?
- Ranking(school_name, rank)?
- Completeness of Jobs and Ranking
40Computing Provenance of UC Merced
SELECT Jobs.school_nameFROM Jobs, RankingWHERE
Jobs.job_opening yesAND Jobs.school_state
caAND Ranking.rank lt 25AND Jobs.school_name
Ranking.school_name
SELECT J.school_name, J.job_opening - yes,
R.school_name, R.rankFROM Jobs AS J,
Ranking AS RWHERE J.school_state caAND R.rank
lt 25AND J.school_name R.school_nameAND
Jobs.school_name uc merced
Trusted
Specifyingnon-answer
Hypotheticalupdate
41Provenance of UC Merced
UC Merced is not a potential answer. Why not a
potential answer?because the trusted data (rank)
does not satisfy the query.
42Example Relaxing Trust for Provenance of UC
Merced
- Assume that we trust
- Jobs(school_name, school_state)?
- Ranking(school_name)?
- Completeness of Jobs and Ranking
43Computing Provenance of UC Merced
SELECT Jobs.school_nameFROM Jobs, RankingWHERE
Jobs.job_opening yesAND Jobs.school_state
caAND Ranking.rank lt 25AND Jobs.school_name
Ranking.school_name
Trusted
SELECT J.school_name, J.job_opening - yes,
R.school_name, R.rank - X lt25FROM
Jobs AS J, Ranking AS RWHERE J.school_state
caAND J.school_name R.school_nameAND
Jobs.school_name uc merced
Specifyingnon-answer
Hypotheticalupdate
44Provenance of UC Merced
UC Merced is a potential answer.
Why not an answer? because job_opening no and
rank null.
How to become an answer? job_opening no - yes
and rank null - X lt 25.