Improving Query Results using Answer Corroboration PowerPoint PPT Presentation

presentation player overlay
About This Presentation
Transcript and Presenter's Notes

Title: Improving Query Results using Answer Corroboration


1
Improving Query Results using Answer
Corroboration
  • Amélie Marian
  • Rutgers University

2
Motivations
  • Query on databases traditionally return exact
    answer
  • (set of) tuples that match query exactly
  • Query in Information retrieval traditionally
    return best documents containing the answer
  • (list of) documents from which users have to find
    relevant information within the documents
  • Both query models are insufficient for todays
    information needs
  • New models have been used and studied top-k
    queries, question answering (QA)

But these model consider answers individually
(except for some QA systems)
3
Data Corroboration
  • Data sources cannot be fully trusted
  • Low quality data (e.g., data integration,
    user-input data)
  • Web data (anybody can say anything on the web)
  • Non exact query models
  • Top-k answers are requested
  • Repeated information leads more credence to the
    quality of the information
  • Aggregate similar information, and increase its
    score

4
Outline
  • Answer Corroboration for Data Cleaning
  • joint work with Yannis Kotidis and Divesh
    Srivastava
  • Motivations
  • Multiple Join Path Framework
  • Our Approach
  • Experimental Evaluation
  • Answer Corroboration for Web Search
  • Motivations
  • Our Approach
  • Query Interface

5
Motivating Example
Sales
Inventory
CircuitID
TN
BAN
TN
CustName
CircuitID
TN
PON
PON
TN
Ordering
CustName
BAN
ORN
TN
TN
ORN
TN Telephone Number ORN Order Number BAN
Billing Account Number PON Provisioning Order
Number SubPON Related PON
Provisioning
CustName
PON
SubPON
CustName
What is the Circuit ID associated with a
Telephone Number that appears in SALES?
6
Motivations
  • Data applications with overlapping features
  • Data integration
  • Web sources
  • Data quality issues (duplicate, null, default
    values, data inconsistencies)
  • Data-entry problems
  • Data integration problems

7
Contributions
  • Multiple Join Path (MJP) framework
  • Quantifies answer quality
  • Takes corroborating evidence into account
  • Agglomerative scoring of answers
  • Answer computation techniques
  • Designed for MJP scoring methodologies
  • Several output options (top-k, top-few)
  • Experimental evaluation on real data
  • VIP integration platform
  • Quality of answers
  • Efficiency of our techniques

8
Multiple Join Path Framework Problem Definition
  • Query of the form
  • Given Xa find the value of Y
  • Examples
  • Given a telephone number of a customer, find the
    ID of the circuit to which the telephone line is
    attached.
  • One answer expected
  • Given a circuit ID, find the name of customers
    whose telephones are attached to the circuit ID.
  • Possibly several answers

9
Schema Graph
  • Directed acyclic graph
  • Nodes are field names
  • Intra-application edge
  • Links fields in the same application
  • Inter-application edge
  • Links fields across applications

All (non-source, non-sink) nodes in schema graph
are (possibly approximate) primary or foreign
keys of their applications
10
Data Graph
  • Given a specific value of the source node X what
    are values of the sink node Y?
  • Considers all join paths from X to Y in the
    schema graph

X (no corresponding SALES.BAN)
X
X
Example two paths lead to answer c1
11
Scoring Answers
  • Which are the correct values?
  • Unclean data
  • No a priori knowledge
  • Technique to score data edges
  • What is the probability that the fields
    associated by the edge is correct
  • Probabilistic interpretation of data edge scores
    to score full join paths
  • Edge score aggregation
  • Independent on the length of the path

12
Scoring Data Edges
  • Rely on functional dependencies (we are
    considering fields that are keys)
  • Data edge scores model the error in the data
  • Intra-application edge
  • Inter-application edge equals 1, unless
    approximate matching

Fields A and B within the same application
A
B
(and symetrically for B - A)
Where bi are the values instantiated from
querying the application with value a
A
B
B
A
and
13
Scoring Data Paths
  • A single data path is scored using a simple
    sequential composition of its data edges
    probabilities
  • Data paths leading to the same answer are scored
    using parallel composition

Independence Assumption
X
a
b
Y
0.5
0.8
0.6
pathScore0.50.80.60.24
c
0.4
0.5
X
a
b
Y
0.5
0.8
0.6
pathScore0.240.2-(0.240.2) pathScore0.392
14
Identifying Answers
  • Only interested in best answers
  • Standard top-k techniques do not apply
  • Answer scores can always be increased by new
    information
  • We keep score range information
  • Return top answers when identified, may not have
    complete scores (similar to NRA by Fagin et al.)
  • Two return strategies
  • Top-k
  • Top-few (weaker stop condition)

15
Computing Answers
  • Take advantage of early pruning
  • Only interested in best answers
  • Incremental data graph computation
  • Probes to each applications
  • Cost model is number of probes
  • Standard graph searching techniques (DFS, BFS) do
    not take advantage of score information
  • We propose a technique based on the notion of
    maximum benefit

16
Maximum Benefit
  • Benefit computation of a path uses two components
  • Known scores of the explored data edges
  • Best way to augment an answers scores
  • Uses residual benefit of unexplored schema edges
  • Our strategy makes choices that aim at maximizing
    this benefit metric

17
VIP Experimental Platform
  • Integration platform developed at ATT
  • 30 legacy systems
  • Real data
  • Developed as a platform for resolving disputes
    between applications that are due to data
    inconsistencies
  • Front-end web interface

18
VIP Queries
  • Random sample of 150 user queries.
  • Analysis shows that queries can be classified
    according to the number of answers they retrieve
  • noAnswer(nA) 56 queries
  • anyAnswer(aA) 94 queries
  • oneLarge(oL) 47 queries
  • manyLarge(mL) 4 queries
  • manySmall(mS) 8 queries
  • heavyHitters(hH) 10 queries that returned
    between 128 and 257 answers per query

19
VIP Schema Graph
Paths leading to an answer /paths leading to
top-1 answer (94 queries)
Not considering all paths may lead to missing
top-1 answers
20
Number of Parallel Paths Contributing to the
Top-1 Answer
Average of 10 parallel paths per answer, 2.5
significant
21
Cost of Execution
22
Related Work (Data Cleaning)
  • Keyword Search in DBMS (BANKS, DBXPlorer,
    DISCOVER, ObjectRank)
  • Query is set of keywords
  • Top-k query model
  • DB as data graph
  • Do not agglomerate scores
  • Top-k query evaluation (TA, MPro, Upper)
  • Consider tuples as an entity
  • Wait for exact answer (Except for NRA)
  • Do not agglomerate scores
  • Probabilistic ranking of DB results
  • Queries not selective, large answer set

We take corroborative evidence into account to
rank query results
23
Contributions
  • Multiple Join Path Framework
  • Uses corroborating evidence to identify high
    quality results
  • Looks at all paths in the schema graph
  • Scoring mechanism
  • Probabilistic interpretation
  • Takes schema information into account
  • Techniques to compute answers
  • Take into account agglomerative scoring
  • Top-k and top-few

24
Outline
  • Answer Corroboration for Data Cleaning
  • Motivations
  • Multiple Join Path Framework
  • Our Approach
  • Experimental Evaluation
  • Answer Corroboration for Web Search
  • Motivations
  • Our Approach
  • Challenges

25
Motivations
  • Information on web sources is unreliable
  • Erroneous
  • Misleading
  • Biased
  • Outdated
  • Users check many web sites to confirm the
    information
  • Data corroboration
  • Can we do that automatically to save time?

26
Example What is the gas mileage of my Honda Civic
  • Query honda civic 2005 gas mileage on MSN
    Search
  • Is the top hit the carhybrids.com site
    trustworthy?
  • Is the Honda web site unbiased?
  • Are all these values refering to the correct year
    of the model?

Users may check several web sites to get an answer
27
Example Aggregating Results using Data
Corroboration
  • Combines similar values
  • Use frequency of the answer as the ranking
    measure
  • (out of the first 10 pages one page had no
    answer)

28
Challenges
  • Designing a meaningful ranking function
  • Frequency of the answer in the result set
  • Importance of the web pages containing the answer
  • As measured by the search engine (e.g. Pagerank)
  • Importance of the answer within the page
  • Use of formatting information within the page
  • Proximity of the answer to query term
  • Multiple answers per page
  • Similarity of the page with other pages
  • Dampening factor
  • Reduce the impact of copy-paste sites
  • Reduce the impact of pages from same domain

29
Challenges (cont.)
  • Selecting the result set (web pages)
  • How deep in the search engine result are we
    going?
  • Low ranked page will not contribute much to the
    score use top-k pruning techniques
  • Extracting information from the web page
  • Use existing Information Extraction (IE) and
    Question Answering (QA) techniques

30
Current work
  • Focus on numerical queries
  • Analysis of MSN queries show that they have a
    higher clickthrough rate than general queries
  • Answer easier to identify in the text
  • Scoring function
  • Currently a simple aggregation of individual
    parameter scores
  • Working on a probabilistic approach
  • Number of page accessed
  • Dynamic selection based on score information

31
Evaluation
  • 15 million query logs from MSN
  • Focus on
  • Queries with high clickthrough rate
  • Numerical value queries (for now)
  • Compare clickthrough with best-ranked sites to
    measure precision and recall
  • User studies

32
Interface
33
Related work
  • Web Search
  • Our interface is build on top of a standard
    search engine
  • Question Answering Systems (START, askMSR,
    MULDER)
  • Some have used frequency of answer to increase
    score (askMSR, MULDER)
  • We are considering more complex scoring
    mechanisms
  • Information Extraction (Snowball)
  • We can use existing technique to identify
    information within a page
  • Our problem is much simpler than standard IE
  • Top-k queries (TA, Upper, MPro)
  • We need some pruning functionalities to stop
    retrieving web search results

34
Conclusions
  • Large amount of low-quality data
  • Users have to rummage through a lot of
    information
  • Data corroboration can improve the quality of
    query results
  • Has not been used much in practice
  • Makes sense in many applications
  • Standard ranking techniques have to be modified
    to handle corroborative scoring
  • Standard ranking scored each answer individually
  • Corroborative ranking combines answer
  • Pruning conditions in top-k queries do not work
    on corroborative answers
Write a Comment
User Comments (0)
About PowerShow.com