Title: Pedro Domingos
1Data IntegrationA Killer App for
Multi-Strategy Learning
- Pedro Domingos
- Joint work with AnHai Doan Alon LevyDepartment
of Computer Science EngineeringUniversity of
Washington
2Overview
- Data integration XML
- Schema matching
- Multi-strategy learning
- Prototype system experiments
- Related work
- Future work
- Summary
3Data Integration
Find houses with four bathrooms and price under
500,000
mediated schema
source schema
source schema
source schema
superhomes.com
realestate.com
homeseekers.com
4Why Data Integration Matters
- Very active area in database AI
- research / workshops
- start-ups
- Large organizations
- multiple databases with differing schemas
- Data warehousing
- The Web HTML sources
- The Web XML sources
5XML
- Extensible Markup Language
- introduced in 1996
- The standard for data publishing exchange
- replaces HTML proprietary formats
- embraced by database/web/e-commerce communities
- XML versus HTML
- both use tags to mark up data elements
- HTML tags specify format
- XML tags define meaning
- relationships among elements provided via nesting
6Example
HTML
XML
ltresidential-listingsgt lthousegt lt locationgt
ltcitygt Seattle lt/citygt ltstategt
WA lt/stategt ltcountrygt USA lt/countrygt
lt/locationgt ltagent-phonegt (206) 729 0831
lt/agent-phonegt ltlisted-pricegt 250,000
lt/listed-pricegt ltcommentsgt Fantastic house
... lt/commentsgt lt/housegt
... lt/residential-listingsgt
lth1gt Residential Listings lt/h1gt ltulgtHouse For
Sale ltligt location Seattle, WA, USA
ltligt agent-phone (206) 729 0831 ltligt
listed-price 250,000 ltligt comments
Fantastic house ... lt/ulgt lthrgt ltulgt House For
Sale ... lt/ulgt ...
7XML DTD
- Document Type Descriptor
- BNF grammar
- constraints on element structure type, order,
of times - A real-estate DTD
lt!ELEMENT residential-listings
(house)gtlt!ELEMENT house (location?,
agent-phone, listed-price, comments?)gt lt!ELEMENT
location (city, state, country?)gt
- A DTD can be visualized as a tree
8Semantic Mappings between Schemas
- Mediated source schemas XML DTDs
house
address
num-baths
amenities
contact
name phone
house
location contact-info
full-baths
half-baths
handicap-equipped
agent-name agent-phone
9 Map of the Problem
source descriptions
schema matching
data translation scope completeness reliability qu
ery capability
1-1 mappings
complex mappings
leaf elements
higher-level elements
10Current State of Affairs
- Largely done by hand
- labor intensive error prone
- key bottleneck in building applications
- Will only be exacerbated
- data sharing XML become pervasive
- proliferation of DTDs
- translation of legacy data
- Need automatic approaches to scale up!
11Our Approach
- Use machine learning to match schemas
- Basic idea
- 1. create training data
- manually map a set of sources to mediated schema
- 2. train system on training data
- learns from
- name of schema elements
- format of values
- frequency of words symbols
- characteristics of value distribution
- proximity, position, structure, ...
- 3. system proposes mappings for subsequent sources
12Example
mediated schema
realestate.com
address phone price description
lthousegt lt locationgt Seattle, WA
lt/locationgt ltagent-phonegt (206) 729 0831
lt/agent-phonegt ltlisted-pricegt 250,000
lt/listed-pricegt ltcommentsgtFantastic
house ... lt/commentsgt lt/housegt ...
comments Fantastic house ... Great
... Hurry! ... ...
location Seattle, WA Seattle, WA Dallas,
TX ...
listed-price 250,000 162,000 180,000 ...
agent-phone (206) 729 0831 (206) 321
4571 (214) 722 4035 ...
13Multi-Strategy Learning
- Use a set of base learners
- each exploits certain types of information
- Match schema elements of a new source
- apply the learners
- combine their predictions using a meta-learner
- Meta-learner
- measures base learner accuracy on training data
- weighs each learner based on its accuracy
14Learners
- Input
- schema information name, proximity, structure,
... - data information value, format, ...
- Output
- prediction weighted by confidence score
- Example learners
- name matcher
- agent-name gt (name,0.7), (phone,0.3)
- Naive Bayes
- Seattle, WA gt (address,0.8),
(name,0.2) - Great location ... gt (description,0.9),
(address,0.1)
15Training the Learners
realestate.com
mediated schema
address phone price description
lthousegt lt locationgt Seattle, WA
lt/locationgt ltagent-phonegt (206) 729 0831
lt/agent-phonegt ltlisted-pricegt 250,000
lt/listed-pricegt ltcommentsgt Fantastic
house ... lt/commentsgt lt/housegt ...
location
listed-price
agent-phone
comments
Name Matcher (location, address) (agent-phone,
phone) (listed-price, price) (comments,
description) ...
Naive Bayes (Seattle, WA, address) ((206) 729
0831, phone) ( 250,000, price) (Fantastic
house ..., description) ...
16Applying the Learned Models
homes.com
mediated schema
address phone price description
area Seattle, WA Kent, WA Austin,
TX Seattle, WA
Name Matcher Naive Bayes
Meta-learner
address address description address
Combiner
Name Matcher Naive Bayes
Meta-learner
address
17The LSD System
- Base learners/modules
- name matcher
- Naive Bayes
- Whirl nearest-neighbor classifier
CohenHirsh-KDD98 - county-name recognizer
- Meta-learner
- stacking TingWitten99, Wolpert92
18Name Matcher
- Matches based on names
- including all names on path from root to current
node - allowing synonyms
- Good for ...
- specific, descriptive names agent-phone,
listed-price - Bad for ...
- vacuous names item, listings
- partially specified, ambiguous names office
(for office phone)
19Naive Bayes Learner
- Exploits frequencies of words symbols
- Good for ...
- elements with words/symbols that are strongly
indicative - examples
- fantastic great in house descriptions
- in prices, parentheses in phone numbers
- Bad for ...
- short, numeric elements num-baths, num-bedrooms
20WHIRL Nearest-Neighbor Classifier
- Similarity-based
- stores all examples seen so far
- classifies a new example based on similarity to
training examples - IR document similarity metric
- Good for ...
- long, textual elements house description, names
- limited, descriptive set of values color (blue,
red, ...) - Bad for ...
- short, numeric elements num-baths, num-bedrooms
21County-Name Recognizer
- Stores all county names, obtained from the Web
- Verifies if the input name is a county name
- Essential to matching a county-name element
22Meta-Learner Stacking
- Training
- uses training data to learn weights
- one for each (base learner, mediated-schema
element) - Combining predictions
- for each mediated-schema element
- computes weighted sum of base-learner confidence
scores - picks mediated-schema element with highest sum
23Experiments
24Reasons for Incorrect Matchings
- Unfamiliarity
- suburb
- solution add a suburb-name recognizer
- Insufficient information
- correctly identified the general type
- failed to pinpoint the exact type
- ltagent-namegtRichard Smithlt/agent-namegtltphonegt
(206) 234 5412 lt/phonegt - solution add a proximity learner
25Experiments Summary
- Multi-strategy learning
- better performance than any single learner
- Accuracy of 100 unlikely to be reached
- difficult even for human
- Lots of room for improvement
- more learners
- better learning algorithms
26Related Work
- Rule-based approaches
- TRANSCM MiloZohar98, ARTEMIS
CastanoAntonellis99, Palopoli et. al. 98 - utilize only schema information
- Learner-based approaches
- SEMINT LiClifton94, ILA PerkowitzEtzioni95
- employ a single learner, limited applicability
27 Future Work
source descriptions
schema matching
data translationscope completeness reliability qu
ery capability
1-1 mappings
complex mappings
leaf elements
higher-level elements
28Future Work
- Improve matching accuracy
- more learners, more domains
- Incorporate domain knowledge
- semantic integrity constraints
- concept hierarchy of mediated-schema elements
- Learn with structured data
29Learning with Structured Data
- Each example with gt1 level of structure
- Generative model for XML
- XML classifier
- XML killer app for relational learning
30Summary
- Schema matching
- automated by learning
- Multi-strategy learning is essential
- handles different types of data
- incorporates different types of domain knowledge
- easy to incorporate new learners
- alleviates effects of noise dirty data
- Implemented LSD
- promising results with initial experiments