Title: AnHai Doan
1Learning Source Descriptionsfor Data Integration
- AnHai Doan
- Pedro Domingos
- Alon Levy
- Department of Computer Science
EngineeringUniversity of Washington
2Overview
- Problem definition
- schema matching
- Solution
- multi-strategy learning
- Prototype system
- LSD (Learning Source Descriptions)
- Experiments
- Related work
- Summary future work
3Data Integration
Find houses with four bathrooms and price under
500,000
mediated schema
source schema
source schema
source schema
superhomes.com
realestate.com
homeseekers.com
4Semantic Mappings between Schemas
- Mediated source schemas XML DTDs
house
address
num-baths
amenities
contact
name phone
house
location contact-info
full-baths
half-baths
handicap-equipped
agent-name agent-phone
5 Map of the Problem
source descriptions
schema matching
data translationscope completeness reliability qu
ery capability
1-1 mappings
complex mappings
leaf elements
higher-level elements
6Current State of Affairs
- Largely done by hand
- labor intensive error prone
- key bottleneck in building applications
- Will only be exacerbated
- data sharing XML become pervasive
- proliferation of DTDs
- translation of legacy data
- Need automatic approaches to scale up!
7Our Approach
- Use machine learning to match schemas
- Basic idea
- 1. create training data
- manually map a set of sources to mediated schema
- 2. train system on training data
- learns from
- name of schema elements
- format of values
- frequency of words symbols
- characteristics of value distribution
- proximity, position, structure, ...
- 3. system proposes mappings for subsequent sources
8Example
mediated schema
realestate.com
address phone price description
lthousegt lt locationgt Seattle, WA
lt/locationgt ltagent-phonegt (206) 729 0831
lt/agent-phonegt ltlisted-pricegt 250,000
lt/listed-pricegt ltcommentsgtFantastic
house ... lt/commentsgt lt/housegt ...
comments Fantastic house ... Great
... Hurry! ... ...
location Seattle, WA Seattle, WA Dallas,
TX ...
listed-price 250,000 162,000 180,000 ...
agent-phone (206) 729 0831 (206) 321
4571 (214) 722 4035 ...
9Multi-Strategy Learning
- Use a set of base learners
- each exploits certain types of information
- Match schema elements of a new source
- apply the learners
- combine their predictions using a meta-learner
- Meta-learner
- measures base learner accuracy on training data
- weighs each learner based on its accuracy
10Learners
- Input
- schema information name, proximity, structure,
... - data information value, format, ...
- Output
- prediction weighted by confidence score
- Examples
- Name matcher
- agent-name gt (name,0.7), (phone,0.3)
- Frequency learner
- Seattle, WA gt (address,0.8),
(name,0.2) - Great location ... gt (description,0.9),
(address,0.1)
11Training the Learners
realestate.com
mediated schema
address phone price description
lthousegt lt locationgt Seattle, WA
lt/locationgt ltagent-phonegt (206) 729 0831
lt/agent-phonegt ltlisted-pricegt 250,000
lt/listed-pricegt ltcommentsgt Fantastic
house ... lt/commentsgt lt/housegt ...
location
listed-price
agent-phone
comments
Name Matcher (location, address) (agent-phone,
phone) (listed-price, price) (comments,
description) ...
Frequency Learner (Seattle, WA,
address) ((206) 729 0831, phone) ( 250,000,
price) (Fantastic house ..., description) ...
12Applying the Learners
homes.com
mediated schema
address phone price description
area Seattle, WA Kent, WA Austin,
TX Seattle, WA
Name Matcher Frequency Learner
Meta-learner
address address description address
Combiner
Name Matcher Frequency Learner
Meta-learner
address
13The LSD System
- Base learners/modules
- name matcher
- Naive Bayesian learner
- Whirl nearest-neighbor classifier
CohenHirsh-KDD98 - county-name recognizer
- Meta-learner
- uses stacking TingWitten99, Wolpert92
- uses training data to learn weights for base
learners - combines predictions using confidence
scores/weights
14Experiments
15Related Work
- Rule-based approaches
- TRANSCM MiloZohar98, ARTEMIS
CastanoAntonellis99, Palopoli et. al. 98 - utilize only schema information
- Learner-based approaches
- SEMINT LiClifton94, ILA PerkowitzEtzioni95
- employ a single learner, limited applicability
- Multi-strategy learning in other domains
- series of workshops 91,93,96,98,00
- Freitag98, Proverb Keim et. al. 99
16Summary
- Schema matching
- automated by learning
- Multi-strategy learning is essential
- handles different types of data
- incorporates different types of domain knowledge
- easy to incorporate new learners
- alleviates effects of noise dirty data
- Implemented LSD
- promising results with initial experiments
17 Future Work
source descriptions
schema matching
data translationscope completeness reliability qu
ery capability
1-1 mappings
complex mappings
leaf elements
higher-level elements