Pedro Domingos - PowerPoint PPT Presentation

About This Presentation
Title:

Pedro Domingos

Description:

Pedro Domingos. Joint work with AnHai Doan & Alon Levy ... short, numeric elements: num-baths, num-bedrooms. 21. County-Name Recognizer ... – PowerPoint PPT presentation

Number of Views:118
Avg rating:3.0/5.0
Slides: 31
Provided by: zam34
Category:
Tags: baths | domingos | pedro

less

Transcript and Presenter's Notes

Title: Pedro Domingos


1
Data IntegrationA Killer App for
Multi-Strategy Learning
  • Pedro Domingos
  • Joint work with AnHai Doan Alon LevyDepartment
    of Computer Science EngineeringUniversity of
    Washington

2
Overview
  • Data integration XML
  • Schema matching
  • Multi-strategy learning
  • Prototype system experiments
  • Related work
  • Future work
  • Summary

3
Data Integration
Find houses with four bathrooms and price under
500,000
mediated schema
source schema
source schema
source schema
superhomes.com
realestate.com
homeseekers.com
4
Why Data Integration Matters
  • Very active area in database AI
  • research / workshops
  • start-ups
  • Large organizations
  • multiple databases with differing schemas
  • Data warehousing
  • The Web HTML sources
  • The Web XML sources

5
XML
  • Extensible Markup Language
  • introduced in 1996
  • The standard for data publishing exchange
  • replaces HTML proprietary formats
  • embraced by database/web/e-commerce communities
  • XML versus HTML
  • both use tags to mark up data elements
  • HTML tags specify format
  • XML tags define meaning
  • relationships among elements provided via nesting

6
Example
HTML
XML
ltresidential-listingsgt lthousegt lt locationgt
ltcitygt Seattle lt/citygt ltstategt
WA lt/stategt ltcountrygt USA lt/countrygt
lt/locationgt ltagent-phonegt (206) 729 0831
lt/agent-phonegt ltlisted-pricegt 250,000
lt/listed-pricegt ltcommentsgt Fantastic house
... lt/commentsgt lt/housegt
... lt/residential-listingsgt
lth1gt Residential Listings lt/h1gt ltulgtHouse For
Sale ltligt location Seattle, WA, USA
ltligt agent-phone (206) 729 0831 ltligt
listed-price 250,000 ltligt comments
Fantastic house ... lt/ulgt lthrgt ltulgt House For
Sale ... lt/ulgt ...
7
XML DTD
  • Document Type Descriptor
  • BNF grammar
  • constraints on element structure type, order,
    of times
  • A real-estate DTD

lt!ELEMENT residential-listings
(house)gtlt!ELEMENT house (location?,
agent-phone, listed-price, comments?)gt lt!ELEMENT
location (city, state, country?)gt
  • A DTD can be visualized as a tree

8
Semantic Mappings between Schemas
  • Mediated source schemas XML DTDs

house
address
num-baths
amenities
contact
name phone
house
location contact-info
full-baths
half-baths
handicap-equipped
agent-name agent-phone
9
Map of the Problem
source descriptions
schema matching
data translation scope completeness reliability qu
ery capability
1-1 mappings
complex mappings
leaf elements
higher-level elements
10
Current State of Affairs
  • Largely done by hand
  • labor intensive error prone
  • key bottleneck in building applications
  • Will only be exacerbated
  • data sharing XML become pervasive
  • proliferation of DTDs
  • translation of legacy data
  • Need automatic approaches to scale up!

11
Our Approach
  • Use machine learning to match schemas
  • Basic idea
  • 1. create training data
  • manually map a set of sources to mediated schema
  • 2. train system on training data
  • learns from
  • name of schema elements
  • format of values
  • frequency of words symbols
  • characteristics of value distribution
  • proximity, position, structure, ...
  • 3. system proposes mappings for subsequent sources

12
Example
mediated schema
realestate.com
address phone price description
lthousegt lt locationgt Seattle, WA
lt/locationgt ltagent-phonegt (206) 729 0831
lt/agent-phonegt ltlisted-pricegt 250,000
lt/listed-pricegt ltcommentsgtFantastic
house ... lt/commentsgt lt/housegt ...
comments Fantastic house ... Great
... Hurry! ... ...
location Seattle, WA Seattle, WA Dallas,
TX ...
listed-price 250,000 162,000 180,000 ...
agent-phone (206) 729 0831 (206) 321
4571 (214) 722 4035 ...
13
Multi-Strategy Learning
  • Use a set of base learners
  • each exploits certain types of information
  • Match schema elements of a new source
  • apply the learners
  • combine their predictions using a meta-learner
  • Meta-learner
  • measures base learner accuracy on training data
  • weighs each learner based on its accuracy

14
Learners
  • Input
  • schema information name, proximity, structure,
    ...
  • data information value, format, ...
  • Output
  • prediction weighted by confidence score
  • Example learners
  • name matcher
  • agent-name gt (name,0.7), (phone,0.3)
  • Naive Bayes
  • Seattle, WA gt (address,0.8),
    (name,0.2)
  • Great location ... gt (description,0.9),
    (address,0.1)

15
Training the Learners
realestate.com
mediated schema
address phone price description
lthousegt lt locationgt Seattle, WA
lt/locationgt ltagent-phonegt (206) 729 0831
lt/agent-phonegt ltlisted-pricegt 250,000
lt/listed-pricegt ltcommentsgt Fantastic
house ... lt/commentsgt lt/housegt ...
location
listed-price
agent-phone
comments
Name Matcher (location, address) (agent-phone,
phone) (listed-price, price) (comments,
description) ...
Naive Bayes (Seattle, WA, address) ((206) 729
0831, phone) ( 250,000, price) (Fantastic
house ..., description) ...
16
Applying the Learned Models
homes.com
mediated schema
address phone price description
area Seattle, WA Kent, WA Austin,
TX Seattle, WA
Name Matcher Naive Bayes
Meta-learner
address address description address
Combiner
Name Matcher Naive Bayes
Meta-learner
address
17
The LSD System
  • Base learners/modules
  • name matcher
  • Naive Bayes
  • Whirl nearest-neighbor classifier
    CohenHirsh-KDD98
  • county-name recognizer
  • Meta-learner
  • stacking TingWitten99, Wolpert92

18
Name Matcher
  • Matches based on names
  • including all names on path from root to current
    node
  • allowing synonyms
  • Good for ...
  • specific, descriptive names agent-phone,
    listed-price
  • Bad for ...
  • vacuous names item, listings
  • partially specified, ambiguous names office
    (for office phone)

19
Naive Bayes Learner
  • Exploits frequencies of words symbols
  • Good for ...
  • elements with words/symbols that are strongly
    indicative
  • examples
  • fantastic great in house descriptions
  • in prices, parentheses in phone numbers
  • Bad for ...
  • short, numeric elements num-baths, num-bedrooms

20
WHIRL Nearest-Neighbor Classifier
  • Similarity-based
  • stores all examples seen so far
  • classifies a new example based on similarity to
    training examples
  • IR document similarity metric
  • Good for ...
  • long, textual elements house description, names
  • limited, descriptive set of values color (blue,
    red, ...)
  • Bad for ...
  • short, numeric elements num-baths, num-bedrooms

21
County-Name Recognizer
  • Stores all county names, obtained from the Web
  • Verifies if the input name is a county name
  • Essential to matching a county-name element

22
Meta-Learner Stacking
  • Training
  • uses training data to learn weights
  • one for each (base learner, mediated-schema
    element)
  • Combining predictions
  • for each mediated-schema element
  • computes weighted sum of base-learner confidence
    scores
  • picks mediated-schema element with highest sum

23
Experiments
24
Reasons for Incorrect Matchings
  • Unfamiliarity
  • suburb
  • solution add a suburb-name recognizer
  • Insufficient information
  • correctly identified the general type
  • failed to pinpoint the exact type
  • ltagent-namegtRichard Smithlt/agent-namegtltphonegt
    (206) 234 5412 lt/phonegt
  • solution add a proximity learner

25
Experiments Summary
  • Multi-strategy learning
  • better performance than any single learner
  • Accuracy of 100 unlikely to be reached
  • difficult even for human
  • Lots of room for improvement
  • more learners
  • better learning algorithms

26
Related Work
  • Rule-based approaches
  • TRANSCM MiloZohar98, ARTEMIS
    CastanoAntonellis99, Palopoli et. al. 98
  • utilize only schema information
  • Learner-based approaches
  • SEMINT LiClifton94, ILA PerkowitzEtzioni95
  • employ a single learner, limited applicability

27
Future Work
source descriptions
schema matching
data translationscope completeness reliability qu
ery capability
1-1 mappings
complex mappings
leaf elements
higher-level elements
28
Future Work
  • Improve matching accuracy
  • more learners, more domains
  • Incorporate domain knowledge
  • semantic integrity constraints
  • concept hierarchy of mediated-schema elements
  • Learn with structured data

29
Learning with Structured Data
  • Each example with gt1 level of structure
  • Generative model for XML
  • XML classifier
  • XML killer app for relational learning

30
Summary
  • Schema matching
  • automated by learning
  • Multi-strategy learning is essential
  • handles different types of data
  • incorporates different types of domain knowledge
  • easy to incorporate new learners
  • alleviates effects of noise dirty data
  • Implemented LSD
  • promising results with initial experiments
Write a Comment
User Comments (0)
About PowerShow.com