Pedro Domingos

About This Presentation

Transcript and Presenter's Notes

Title: Pedro Domingos

1
Data IntegrationA Killer App for
Multi-Strategy Learning

Pedro Domingos
Joint work with AnHai Doan Alon LevyDepartment
of Computer Science EngineeringUniversity of
Washington

2
Overview

Data integration XML
Schema matching
Multi-strategy learning
Prototype system experiments
Related work
Future work
Summary

3
Data Integration
Find houses with four bathrooms and price under
500,000
mediated schema
source schema
source schema
source schema
superhomes.com
realestate.com
homeseekers.com
4
Why Data Integration Matters

Very active area in database AI
research / workshops
start-ups
Large organizations
multiple databases with differing schemas
Data warehousing
The Web HTML sources
The Web XML sources

5
XML

Extensible Markup Language
introduced in 1996
The standard for data publishing exchange
replaces HTML proprietary formats
embraced by database/web/e-commerce communities
XML versus HTML
both use tags to mark up data elements
HTML tags specify format
XML tags define meaning
relationships among elements provided via nesting

6
Example
HTML
XML
ltresidential-listingsgt lthousegt lt locationgt
ltcitygt Seattle lt/citygt ltstategt
WA lt/stategt ltcountrygt USA lt/countrygt
lt/locationgt ltagent-phonegt (206) 729 0831
lt/agent-phonegt ltlisted-pricegt 250,000
lt/listed-pricegt ltcommentsgt Fantastic house
... lt/commentsgt lt/housegt
... lt/residential-listingsgt
lth1gt Residential Listings lt/h1gt ltulgtHouse For
Sale ltligt location Seattle, WA, USA
ltligt agent-phone (206) 729 0831 ltligt
listed-price 250,000 ltligt comments
Fantastic house ... lt/ulgt lthrgt ltulgt House For
Sale ... lt/ulgt ...
7
XML DTD

Document Type Descriptor
BNF grammar
constraints on element structure type, order,
of times
A real-estate DTD

lt!ELEMENT residential-listings
(house)gtlt!ELEMENT house (location?,
agent-phone, listed-price, comments?)gt lt!ELEMENT
location (city, state, country?)gt

A DTD can be visualized as a tree

8
Semantic Mappings between Schemas

Mediated source schemas XML DTDs

house
address
num-baths
amenities
contact
name phone
house
location contact-info
full-baths
half-baths
handicap-equipped
agent-name agent-phone
9
Map of the Problem
source descriptions
schema matching
data translation scope completeness reliability qu
ery capability
1-1 mappings
complex mappings
leaf elements
higher-level elements
10
Current State of Affairs

Largely done by hand
labor intensive error prone
key bottleneck in building applications
Will only be exacerbated
data sharing XML become pervasive
proliferation of DTDs
translation of legacy data
Need automatic approaches to scale up!

11
Our Approach

Use machine learning to match schemas
Basic idea
1. create training data
manually map a set of sources to mediated schema
2. train system on training data
learns from
name of schema elements
format of values
frequency of words symbols
characteristics of value distribution
proximity, position, structure, ...
3. system proposes mappings for subsequent sources

12
Example
mediated schema
realestate.com
address phone price description
lthousegt lt locationgt Seattle, WA
lt/locationgt ltagent-phonegt (206) 729 0831
lt/agent-phonegt ltlisted-pricegt 250,000
lt/listed-pricegt ltcommentsgtFantastic
house ... lt/commentsgt lt/housegt ...
comments Fantastic house ... Great
... Hurry! ... ...
location Seattle, WA Seattle, WA Dallas,
TX ...
listed-price 250,000 162,000 180,000 ...
agent-phone (206) 729 0831 (206) 321
4571 (214) 722 4035 ...
13
Multi-Strategy Learning

Use a set of base learners
each exploits certain types of information
Match schema elements of a new source
apply the learners
combine their predictions using a meta-learner
Meta-learner
measures base learner accuracy on training data
weighs each learner based on its accuracy

14
Learners

Input
schema information name, proximity, structure,
...
data information value, format, ...
Output
prediction weighted by confidence score
Example learners
name matcher
agent-name gt (name,0.7), (phone,0.3)
Naive Bayes
Seattle, WA gt (address,0.8),
(name,0.2)
Great location ... gt (description,0.9),
(address,0.1)

15
Training the Learners
realestate.com
mediated schema
address phone price description
lthousegt lt locationgt Seattle, WA
lt/locationgt ltagent-phonegt (206) 729 0831
lt/agent-phonegt ltlisted-pricegt 250,000
lt/listed-pricegt ltcommentsgt Fantastic
house ... lt/commentsgt lt/housegt ...
location
listed-price
agent-phone
comments
Name Matcher (location, address) (agent-phone,
phone) (listed-price, price) (comments,
description) ...
Naive Bayes (Seattle, WA, address) ((206) 729
0831, phone) ( 250,000, price) (Fantastic
house ..., description) ...
16
Applying the Learned Models
homes.com
mediated schema
address phone price description
area Seattle, WA Kent, WA Austin,
TX Seattle, WA
Name Matcher Naive Bayes
Meta-learner
address address description address
Combiner
Name Matcher Naive Bayes
Meta-learner
address
17
The LSD System

Base learners/modules
name matcher
Naive Bayes
Whirl nearest-neighbor classifier
CohenHirsh-KDD98
county-name recognizer
Meta-learner
stacking TingWitten99, Wolpert92

18
Name Matcher

Matches based on names
including all names on path from root to current
node
allowing synonyms
Good for ...
specific, descriptive names agent-phone,
listed-price
Bad for ...
vacuous names item, listings
partially specified, ambiguous names office
(for office phone)

19
Naive Bayes Learner

Exploits frequencies of words symbols
Good for ...
elements with words/symbols that are strongly
indicative
examples
fantastic great in house descriptions
in prices, parentheses in phone numbers
Bad for ...
short, numeric elements num-baths, num-bedrooms

20
WHIRL Nearest-Neighbor Classifier

Similarity-based
stores all examples seen so far
classifies a new example based on similarity to
training examples
IR document similarity metric
Good for ...
long, textual elements house description, names
limited, descriptive set of values color (blue,
red, ...)
Bad for ...
short, numeric elements num-baths, num-bedrooms

21
County-Name Recognizer

Stores all county names, obtained from the Web
Verifies if the input name is a county name
Essential to matching a county-name element

22
Meta-Learner Stacking

Training
uses training data to learn weights
one for each (base learner, mediated-schema
element)
Combining predictions
for each mediated-schema element
computes weighted sum of base-learner confidence
scores
picks mediated-schema element with highest sum

23
Experiments
24
Reasons for Incorrect Matchings

Unfamiliarity
suburb
solution add a suburb-name recognizer
Insufficient information
correctly identified the general type
failed to pinpoint the exact type
ltagent-namegtRichard Smithlt/agent-namegtltphonegt
(206) 234 5412 lt/phonegt
solution add a proximity learner

25
Experiments Summary

Multi-strategy learning
better performance than any single learner
Accuracy of 100 unlikely to be reached
difficult even for human
Lots of room for improvement
more learners
better learning algorithms

26
Related Work

Rule-based approaches
TRANSCM MiloZohar98, ARTEMIS
CastanoAntonellis99, Palopoli et. al. 98
utilize only schema information
Learner-based approaches
SEMINT LiClifton94, ILA PerkowitzEtzioni95
employ a single learner, limited applicability

27
Future Work
source descriptions
schema matching
data translationscope completeness reliability qu
ery capability
1-1 mappings
complex mappings
leaf elements
higher-level elements
28
Future Work

Improve matching accuracy
more learners, more domains
Incorporate domain knowledge
semantic integrity constraints
concept hierarchy of mediated-schema elements
Learn with structured data

29
Learning with Structured Data

Each example with gt1 level of structure
Generative model for XML
XML classifier
XML killer app for relational learning

30
Summary

Schema matching
automated by learning
Multi-strategy learning is essential
handles different types of data
incorporates different types of domain knowledge
easy to incorporate new learners
alleviates effects of noise dirty data
Implemented LSD
promising results with initial experiments

Write a Comment

User Comments (0)

About PowerShow.com

Pedro Domingos PowerPoint PPT Presentation