Improving Query Results using Answer Corroboration presentation

About This Presentation

Transcript and Presenter's Notes

Title: Improving Query Results using Answer Corroboration

1
Improving Query Results using Answer
Corroboration

Amélie Marian
Rutgers University

2
Motivations

Query on databases traditionally return exact
answer
(set of) tuples that match query exactly
Query in Information retrieval traditionally
return best documents containing the answer
(list of) documents from which users have to find
relevant information within the documents
Both query models are insufficient for todays
information needs
New models have been used and studied top-k
queries, question answering (QA)

But these model consider answers individually
(except for some QA systems)
3
Data Corroboration

Data sources cannot be fully trusted
Low quality data (e.g., data integration,
user-input data)
Web data (anybody can say anything on the web)
Non exact query models
Top-k answers are requested
Repeated information leads more credence to the
quality of the information
Aggregate similar information, and increase its
score

4
Outline

Answer Corroboration for Data Cleaning
joint work with Yannis Kotidis and Divesh
Srivastava
Motivations
Multiple Join Path Framework
Our Approach
Experimental Evaluation
Answer Corroboration for Web Search
Motivations
Our Approach
Query Interface

5
Motivating Example
Sales
Inventory
CircuitID
TN
BAN
TN
CustName
CircuitID
TN
PON
PON
TN
Ordering
CustName
BAN
ORN
TN
TN
ORN
TN Telephone Number ORN Order Number BAN
Billing Account Number PON Provisioning Order
Number SubPON Related PON
Provisioning
CustName
PON
SubPON
CustName
What is the Circuit ID associated with a
Telephone Number that appears in SALES?
6
Motivations

Data applications with overlapping features
Data integration
Web sources
Data quality issues (duplicate, null, default
values, data inconsistencies)
Data-entry problems
Data integration problems

7
Contributions

Multiple Join Path (MJP) framework
Quantifies answer quality
Takes corroborating evidence into account
Agglomerative scoring of answers
Answer computation techniques
Designed for MJP scoring methodologies
Several output options (top-k, top-few)
Experimental evaluation on real data
VIP integration platform
Quality of answers
Efficiency of our techniques

8
Multiple Join Path Framework Problem Definition

Query of the form
Given Xa find the value of Y
Examples
Given a telephone number of a customer, find the
ID of the circuit to which the telephone line is
attached.
One answer expected
Given a circuit ID, find the name of customers
whose telephones are attached to the circuit ID.
Possibly several answers

9
Schema Graph

Directed acyclic graph
Nodes are field names
Intra-application edge
Links fields in the same application
Inter-application edge
Links fields across applications

All (non-source, non-sink) nodes in schema graph
are (possibly approximate) primary or foreign
keys of their applications
10
Data Graph

Given a specific value of the source node X what
are values of the sink node Y?
Considers all join paths from X to Y in the
schema graph

X (no corresponding SALES.BAN)
X
X
Example two paths lead to answer c1
11
Scoring Answers

Which are the correct values?
Unclean data
No a priori knowledge
Technique to score data edges
What is the probability that the fields
associated by the edge is correct
Probabilistic interpretation of data edge scores
to score full join paths
Edge score aggregation
Independent on the length of the path

12
Scoring Data Edges

Rely on functional dependencies (we are
considering fields that are keys)
Data edge scores model the error in the data
Intra-application edge
Inter-application edge equals 1, unless
approximate matching

Fields A and B within the same application
A
B
(and symetrically for B - A)
Where bi are the values instantiated from
querying the application with value a
A
B
B
A
and
13
Scoring Data Paths

A single data path is scored using a simple
sequential composition of its data edges
probabilities
Data paths leading to the same answer are scored
using parallel composition

Independence Assumption
X
a
b
Y
0.5
0.8
0.6
pathScore0.50.80.60.24
c
0.4
0.5
X
a
b
Y
0.5
0.8
0.6
pathScore0.240.2-(0.240.2) pathScore0.392
14
Identifying Answers

Only interested in best answers
Standard top-k techniques do not apply
Answer scores can always be increased by new
information
We keep score range information
Return top answers when identified, may not have
complete scores (similar to NRA by Fagin et al.)
Two return strategies
Top-k
Top-few (weaker stop condition)

15
Computing Answers

Take advantage of early pruning
Only interested in best answers
Incremental data graph computation
Probes to each applications
Cost model is number of probes
Standard graph searching techniques (DFS, BFS) do
not take advantage of score information
We propose a technique based on the notion of
maximum benefit

16
Maximum Benefit

Benefit computation of a path uses two components
Known scores of the explored data edges
Best way to augment an answers scores
Uses residual benefit of unexplored schema edges
Our strategy makes choices that aim at maximizing
this benefit metric

17
VIP Experimental Platform

Integration platform developed at ATT
30 legacy systems
Real data
Developed as a platform for resolving disputes
between applications that are due to data
inconsistencies
Front-end web interface

18
VIP Queries

Random sample of 150 user queries.
Analysis shows that queries can be classified
according to the number of answers they retrieve
noAnswer(nA) 56 queries
anyAnswer(aA) 94 queries
oneLarge(oL) 47 queries
manyLarge(mL) 4 queries
manySmall(mS) 8 queries
heavyHitters(hH) 10 queries that returned
between 128 and 257 answers per query

19
VIP Schema Graph
Paths leading to an answer /paths leading to
top-1 answer (94 queries)
Not considering all paths may lead to missing
top-1 answers
20
Number of Parallel Paths Contributing to the
Top-1 Answer
Average of 10 parallel paths per answer, 2.5
significant
21
Cost of Execution
22
Related Work (Data Cleaning)

Keyword Search in DBMS (BANKS, DBXPlorer,
DISCOVER, ObjectRank)
Query is set of keywords
Top-k query model
DB as data graph
Do not agglomerate scores
Top-k query evaluation (TA, MPro, Upper)
Consider tuples as an entity
Wait for exact answer (Except for NRA)
Do not agglomerate scores
Probabilistic ranking of DB results
Queries not selective, large answer set

We take corroborative evidence into account to
rank query results
23
Contributions

Multiple Join Path Framework
Uses corroborating evidence to identify high
quality results
Looks at all paths in the schema graph
Scoring mechanism
Probabilistic interpretation
Takes schema information into account
Techniques to compute answers
Take into account agglomerative scoring
Top-k and top-few

24
Outline

Answer Corroboration for Data Cleaning
Motivations
Multiple Join Path Framework
Our Approach
Experimental Evaluation
Answer Corroboration for Web Search
Motivations
Our Approach
Challenges

25
Motivations

Information on web sources is unreliable
Erroneous
Misleading
Biased
Outdated
Users check many web sites to confirm the
information
Data corroboration
Can we do that automatically to save time?

26
Example What is the gas mileage of my Honda Civic

Query honda civic 2005 gas mileage on MSN
Search
Is the top hit the carhybrids.com site
trustworthy?
Is the Honda web site unbiased?
Are all these values refering to the correct year
of the model?

Users may check several web sites to get an answer
27
Example Aggregating Results using Data
Corroboration

Combines similar values
Use frequency of the answer as the ranking
measure
(out of the first 10 pages one page had no
answer)

28
Challenges

Designing a meaningful ranking function
Frequency of the answer in the result set
Importance of the web pages containing the answer
As measured by the search engine (e.g. Pagerank)
Importance of the answer within the page
Use of formatting information within the page
Proximity of the answer to query term
Multiple answers per page
Similarity of the page with other pages
Dampening factor
Reduce the impact of copy-paste sites
Reduce the impact of pages from same domain

29
Challenges (cont.)

Selecting the result set (web pages)
How deep in the search engine result are we
going?
Low ranked page will not contribute much to the
score use top-k pruning techniques
Extracting information from the web page
Use existing Information Extraction (IE) and
Question Answering (QA) techniques

30
Current work

Focus on numerical queries
Analysis of MSN queries show that they have a
higher clickthrough rate than general queries
Answer easier to identify in the text
Scoring function
Currently a simple aggregation of individual
parameter scores
Working on a probabilistic approach
Number of page accessed
Dynamic selection based on score information

31
Evaluation

15 million query logs from MSN
Focus on
Queries with high clickthrough rate
Numerical value queries (for now)
Compare clickthrough with best-ranked sites to
measure precision and recall
User studies

32
Interface
33
Related work

Web Search
Our interface is build on top of a standard
search engine
Question Answering Systems (START, askMSR,
MULDER)
Some have used frequency of answer to increase
score (askMSR, MULDER)
We are considering more complex scoring
mechanisms
Information Extraction (Snowball)
We can use existing technique to identify
information within a page
Our problem is much simpler than standard IE
Top-k queries (TA, Upper, MPro)
We need some pruning functionalities to stop
retrieving web search results

34
Conclusions

Large amount of low-quality data
Users have to rummage through a lot of
information
Data corroboration can improve the quality of
query results
Has not been used much in practice
Makes sense in many applications
Standard ranking techniques have to be modified
to handle corroborative scoring
Standard ranking scored each answer individually
Corroborative ranking combines answer
Pruning conditions in top-k queries do not work
on corroborative answers

Write a Comment

User Comments (0)

About PowerShow.com

Improving Query Results using Answer Corroboration PowerPoint PPT Presentation