Title: Circumventing Data Quality Problems Using Multiple Join Paths
1Circumventing Data Quality Problems Using
Multiple Join Paths
- Yannis Kotidis, Athens University of Economics
and Business - Amélie Marian, Rutgers University
- Divesh Srivastava, ATT Labs-Research
2Motivating Example
Sales
Inventory
CircuitID
TN
BAN
TN
CustName
CircuitID
TN
PON
PON
TN
Ordering
CustName
BAN
ORN
TN
TN
ORN
TN Telephone Number ORN Order Number BAN
Billing Accoung Number PON Provisoning Order
Number SubPON Related PON
Provisioning
CustName
PON
SubPON
CustName
What is the Circuit ID associated with a
Telephone Number that appears in SALES?
3Motivations
- Data applications with overlapping features
- Data integration
- Web sources
- Data quality issues (duplicate, null, default
values, data inconsistencies) - Data-entry problems
- Data integration problems
4Contributions
- Multiple Join Path (MJP) framework
- Quantifies answer quality
- Takes corroborating evidence into account
- Agglomerative scoring of answers
- Answer computation techniques
- Designed for MJP scoring methodologies
- Several output options (top-k, top-few)
- Experimental evaluation on real data
- VIP integration platform
- Quality of answers
- Efficiency of our techniques
5Outline
- Multiple Join Path Framework
- Problem Definition
- Our Approach
- Scoring Answers
- Computing Answers
- Experimental Evaluation
- Related Work
6Multiple Join Path Framework Problem Definition
- Query of the form
- Given Xa find the value of Y
- Examples
- Given a telephone number of a customer, find the
ID of the circuit to which the telephone line is
attached. - One answer expected
- Given a circuit ID, find the name of customers
whose telephones are attached to the circuit ID. - Possibly several answers
7Schema Graph
- Directed acyclic graph
- Nodes are field names
- Intra-application edge
- Links fields in the same application
- Inter-application edge
- Links fields across applications
All (non-source, non-sink) nodes in schema graph
are (possibly approximate) primary or foreign
keys of their applications
8Data Graph
- Given a specific value of the source node X what
are values of the sink node Y? - Considers all join paths from X to Y in the
schema graph
X (no corresponding SALES.BAN)
X
X
Example two paths lead to answer c1
9Scoring Answers
- Which are the correct values?
- Unclean data
- No a priori knowledge
- Technique to score data edges
- What is the probability that the fields
associated by the edge is correct - Probabilistic interpretation of data edge scores
to score full join paths - Edge score aggregation
- Independent on the length of the path
10Scoring Data Edges
- Rely on functional dependencies (we are
considering fields that are keys) - Data edge scores model the error in the data
- Intra-application edge
- Inter-application edge equals 1, unless
approximate matching
Fields A and B within the same application
A
B
(and symetrically for B - A)
Where bi are the values instantiated from
querying the application with value a
A
B
B
A
and
11Scoring Data Paths
- A single data path is scored using a simple
sequential composition of its data edges
probabilities - Data paths leading to the same answer are scored
using parallel composition
Independence Assumption
X
a
b
Y
0.5
0.8
0.6
pathScore0.50.80.60.24
c
0.4
0.5
X
a
b
Y
0.5
0.8
0.6
pathScore0.240.2-(0.240.2) pathScore0.392
12Identifying Answers
- Only interested in best answers
- Standard top-k techniques do not apply
- Answer scores can always be increased by new
information - We keep score range information
- Return top answers when identified, may not have
complete scores - Two return strategies
- Top-k
- Top-few (weaker stop condition)
13Computing Answers
- Take advantage of early pruning
- Only interested in best answers
- Incremental data graph computation
- Probes to each applications
- Cost model is number of probes
- Standard graph searching techniques (DFS, BFS) do
not take advantage of score information - We propose a technique based on the notion of
maximum benefit
14Maximum Benefit
- Benefit computation of a path uses two components
- Known scores of the explored data edges
- Best way to augment an answers scores
- Uses residual benefit of unexplored schema edges
- Our strategy makes choices that aim at maximizing
this benefit metric
15VIP Experimental Platform
- Integration platform developed at ATT
- 30 legacy systems
- Real data
- Developed as a platform for resolving disputes
between applications that are due to data
inconsistencies - Front-end web interface
16VIP Queries
- Random sample of 150 user queries.
- Analysis shows that queries can be classified
according to the number of answers they retrieve - noAnswer(nA) 56 queries
- anyAnswer(aA) 94 queries
- oneLarge(oL) 47 queries
- manyLarge(mL) 4 queries
- manySmall(mS) 8 queries
- heavyHitters(hH) 10 queries that returned
between 128 and 257 answers per query
17VIP Schema Graph
Paths leading to an answer /paths leading to
top-1 answer (94 queries)
Not considering all paths may lead to missing
top-1 answers
18Number of Parallel Paths Contributing to the
Top-1 Answer
Average of 10 parallel paths per answer, 2.5
significant
19Cost of Execution
20Related Work
- Keyword Search in DBMS (BANKS, DBXPlorer,
DISCOVER, ObjectRank) - Query is set of keywords
- Top-k query model
- DB as data graph
- Do not agglomerate scores
- Top-k query evaluation (TA, MPro, Upper)
- Consider tuples as an entity
- Wait for exact answer (Except for NRA)
- Do not agglomerate scores
- Probabilistic ranking of DB results
- Queries not selective, large answer set
We take corroborative evidence into account to
rank query results
21Conclusion
- Multiple Join Path Framework
- Uses corroborating evidence to identify high
quality results - Looks at all paths in the schema graph
- Scoring mechanism
- Probabilistic interpretation
- Takes schema information into account
- Techniques to compute answers
- Take into account agglomerative scoring
- Top-k and top-few