Title: Tuning using Synthetic Workload
1Schema Matching Systems
Modeling Schema Matching Systems
Tuning Schema Matching Systems
- Schema Matching
- Finding semantic matches between the schemas of
disparate data sources - Applications data warehousing, scientific
collaboration, e-commerce, bioinformatics, data
integration on WWW, - Current Trends
- Manually finding matches is labor intensive
- Numerous automatic matching techniques have been
developed - Each technique has its own strength and weakness
- Hence, most current matching systems adopt a
multi-component strategy - Each component employs a particular matching
technique - Highly extensible and customizable
- Example LSD, COMA, GLUE, Embley02, SimFlood,
iMAP, ProtoPlasm,
Matching tool M (L, G, k)
Given a particular matching situation, how
to select the right matching components to
execute, and how to adjust the multiple
knobs of the components?
- L Library of matching components
- (e.g. matchers, combiners, filters, etc.)
- k Collection of control variables (i.e. knobs)
- Tuning is necessary to get high matching accuracy
- Crucial in many applications automatic data
exchange, data integration, peer-to-peer systems,
- Tuning is extremely difficult
- Huge space of knobs
- Wide variety of matching techniques
- Complex interactions among the components
- No reasonable guideline for tuning
Example LSD (L, G, k)
Developing efficient techniques for tuning is now
crucial!
Generating Synthetic Workload
Formalization of Tuning Problem
The eTUNER Archietecture
- Generate synthetic workload
- Tune a matching system M using the synthetic
workload and tuning procedures stored in the
repository - Exploit user assistance to generate an even
higher quality synthetic workload, if possible
V1
V
Exploiting user assistance - Grouping
semantically equivalent attributes over S -
Adding domain specific perturbation rules
- General tuning problem
- Given
- M a schema matching tool
- Workload a set of matching scenarios (S1,T1),
(S2,T2), , (Sk,Tk) - U a utility function defined over the process
of matching two schemas - Find the knob configuration k maximizing the
utility over the workload
Perturb of tables
1
3
2
Perturb of columnsin each table
.
Split S into V and U with disjoint data tuples
.
.
EMPLOYEES
Vn
Perturb column and table names
EMPLOYEES
- Our tuning problem
- Given
- M a schema matching tool
- S a source schema
- Workload a set of matching scenarios (S,T1),
(S,T2), , (S,Tk), - (The Tis are future schemas)
- U matching accuracy
- Find the knob configuration k maximizing the
average accuracy
Perturb data tuples in each table
U
EMPS
1
3
2
EMPLOYEES
EMPS
EMPLOYEES
EMPS.emp-last EMPLOYEES.last EMPS.id
EMPLOYEES.id EMPS.wage
EMPLOYEES.salary()
O1 a set of semantic matches
V1
U
Tuning using Synthetic Workload
Experimental Results
Summary Future Work
- Efficient tuning is extremely important
-
- Our contributions
- Establish that tuning matching systems
automatically is feasible - Synthesize workload to estimate the quality of a
matching system with given knob configurations - Establish that staged tuning is a reasonable
optimization technique - Experiment extensively over 4 real-world domains
with 4 matching systems - Future Work
- Explore better search methods and more extensive
evaluation - Deploy the idea of using synthetic input/output
pairs to other applications - (e.g. wrapper maintenance)
Staged Tuning
Level 4
Level 3
Tuning direction
Level 2
Level 1
- Tune sequentially starting from the lowest-level
components - Find best knob configuration for a component
based on matching accuracy over the synthetic
workload