AnHai Doan - PowerPoint PPT Presentation

About This Presentation
Title:

AnHai Doan

Description:

AnHai Doan. Pedro Domingos. Alon Levy. Department of Computer ... num-baths. amenities. full-baths. half-baths. handicap- equipped. contact. name phone. 5 ... – PowerPoint PPT presentation

Number of Views:59
Avg rating:3.0/5.0
Slides: 18
Provided by: zam34
Category:
Tags: anhai | baths | doan

less

Transcript and Presenter's Notes

Title: AnHai Doan


1
Learning Source Descriptionsfor Data Integration
  • AnHai Doan
  • Pedro Domingos
  • Alon Levy
  • Department of Computer Science
    EngineeringUniversity of Washington

2
Overview
  • Problem definition
  • schema matching
  • Solution
  • multi-strategy learning
  • Prototype system
  • LSD (Learning Source Descriptions)
  • Experiments
  • Related work
  • Summary future work

3
Data Integration
Find houses with four bathrooms and price under
500,000
mediated schema
source schema
source schema
source schema
superhomes.com
realestate.com
homeseekers.com
4
Semantic Mappings between Schemas
  • Mediated source schemas XML DTDs

house
address
num-baths
amenities
contact
name phone
house
location contact-info
full-baths
half-baths
handicap-equipped
agent-name agent-phone
5
Map of the Problem
source descriptions
schema matching
data translationscope completeness reliability qu
ery capability
1-1 mappings
complex mappings
leaf elements
higher-level elements
6
Current State of Affairs
  • Largely done by hand
  • labor intensive error prone
  • key bottleneck in building applications
  • Will only be exacerbated
  • data sharing XML become pervasive
  • proliferation of DTDs
  • translation of legacy data
  • Need automatic approaches to scale up!

7
Our Approach
  • Use machine learning to match schemas
  • Basic idea
  • 1. create training data
  • manually map a set of sources to mediated schema
  • 2. train system on training data
  • learns from
  • name of schema elements
  • format of values
  • frequency of words symbols
  • characteristics of value distribution
  • proximity, position, structure, ...
  • 3. system proposes mappings for subsequent sources

8
Example
mediated schema
realestate.com
address phone price description
lthousegt lt locationgt Seattle, WA
lt/locationgt ltagent-phonegt (206) 729 0831
lt/agent-phonegt ltlisted-pricegt 250,000
lt/listed-pricegt ltcommentsgtFantastic
house ... lt/commentsgt lt/housegt ...
comments Fantastic house ... Great
... Hurry! ... ...
location Seattle, WA Seattle, WA Dallas,
TX ...
listed-price 250,000 162,000 180,000 ...
agent-phone (206) 729 0831 (206) 321
4571 (214) 722 4035 ...
9
Multi-Strategy Learning
  • Use a set of base learners
  • each exploits certain types of information
  • Match schema elements of a new source
  • apply the learners
  • combine their predictions using a meta-learner
  • Meta-learner
  • measures base learner accuracy on training data
  • weighs each learner based on its accuracy

10
Learners
  • Input
  • schema information name, proximity, structure,
    ...
  • data information value, format, ...
  • Output
  • prediction weighted by confidence score
  • Examples
  • Name matcher
  • agent-name gt (name,0.7), (phone,0.3)
  • Frequency learner
  • Seattle, WA gt (address,0.8),
    (name,0.2)
  • Great location ... gt (description,0.9),
    (address,0.1)

11
Training the Learners
realestate.com
mediated schema
address phone price description
lthousegt lt locationgt Seattle, WA
lt/locationgt ltagent-phonegt (206) 729 0831
lt/agent-phonegt ltlisted-pricegt 250,000
lt/listed-pricegt ltcommentsgt Fantastic
house ... lt/commentsgt lt/housegt ...
location
listed-price
agent-phone
comments
Name Matcher (location, address) (agent-phone,
phone) (listed-price, price) (comments,
description) ...
Frequency Learner (Seattle, WA,
address) ((206) 729 0831, phone) ( 250,000,
price) (Fantastic house ..., description) ...
12
Applying the Learners
homes.com
mediated schema
address phone price description
area Seattle, WA Kent, WA Austin,
TX Seattle, WA
Name Matcher Frequency Learner
Meta-learner
address address description address
Combiner
Name Matcher Frequency Learner
Meta-learner
address
13
The LSD System
  • Base learners/modules
  • name matcher
  • Naive Bayesian learner
  • Whirl nearest-neighbor classifier
    CohenHirsh-KDD98
  • county-name recognizer
  • Meta-learner
  • uses stacking TingWitten99, Wolpert92
  • uses training data to learn weights for base
    learners
  • combines predictions using confidence
    scores/weights

14
Experiments
15
Related Work
  • Rule-based approaches
  • TRANSCM MiloZohar98, ARTEMIS
    CastanoAntonellis99, Palopoli et. al. 98
  • utilize only schema information
  • Learner-based approaches
  • SEMINT LiClifton94, ILA PerkowitzEtzioni95
  • employ a single learner, limited applicability
  • Multi-strategy learning in other domains
  • series of workshops 91,93,96,98,00
  • Freitag98, Proverb Keim et. al. 99

16
Summary
  • Schema matching
  • automated by learning
  • Multi-strategy learning is essential
  • handles different types of data
  • incorporates different types of domain knowledge
  • easy to incorporate new learners
  • alleviates effects of noise dirty data
  • Implemented LSD
  • promising results with initial experiments

17
Future Work
source descriptions
schema matching
data translationscope completeness reliability qu
ery capability
1-1 mappings
complex mappings
leaf elements
higher-level elements
Write a Comment
User Comments (0)
About PowerShow.com