AnHai Doan

About This Presentation

Transcript and Presenter's Notes

Title: AnHai Doan

1
Schema Ontology Matching Current Research
Directions

AnHai Doan
Database and Information System Group
University of Illinois, Urbana Champaign
Spring 2004

2
Road Map

Schema Matching
motivation problem definition
representative current solutions LSD, iMAP, Clio
broader picture
Ontology Matching
motivation problem definition
representative current solution GLUE
broader picture
Conclusions Emerging Directions

3
Motivation Data Integration
Find houses with 2 bedrooms priced under 200K
New faculty member
homes.com
realestate.com
homeseekers.com
4
Architecture of Data Integration System
Find houses with 2 bedrooms priced under 200K
mediated schema
source schema 2
source schema 3
source schema 1
homes.com
realestate.com
homeseekers.com
5
Semantic Matches between Schemas
Mediated-schema
price agent-name address
1-1 match
complex match
homes.com
listed-price contact-name city
state
320K Jane Brown Seattle
WA 240K Mike Smith Miami
FL
6
Schema Matching is Ubiquitous!

Fundamental problem in numerous applications
Databases
data integration
data translation
schema/view integration
data warehousing
semantic query processing
model management
peer data management
AI
knowledge bases, ontology merging, information
gathering agents, ...
Web
e-commerce
marking up data using ontologies (e.g., on
Semantic Web)

7
Why Schema Matching is Difficult

Schema data never fully capture semantics!
not adequately documented
schema creator has retired to Florida!
Must rely on clues in schema data
using names, structures, types, data values, etc.
Such clues can be unreliable
same names gt different entities area gt
location or square-feet
different names gt same entity area
address gt location
Intended semantics can be subjective
house-style house-description?
military applications require committees to
decide!
Cannot be fully automated, needs user feedback!

8
Current State of Affairs

Finding semantic mappings is now a key
bottleneck!
largely done by hand
labor intensive error prone
data integration at GTE LiClifton, 2000
40 databases, 27000 elements, estimated time 12
years
Will only be exacerbated
data sharing becomes pervasive
translation of legacy data
Need semi-automatic approaches to scale up!
Many research projects in the past few years
Databases IBM Almaden, Microsoft Research, BYU,
George Mason, U of Leipzig, U
Wisconsin, NCSU, UIUC, Washington, ...
AI Stanford, Karlsruhe University, NEC Japan,
...

9
Road Map

Schema Matching
motivation problem definition
representative current solutions LSD, iMAP, Clio
broader picture
Ontology Matching
motivation problem definition
representative current solution GLUE
broader picture
Conclusions Emerging Directions

10
LSD

Learning Source Description
Developed at Univ of Washington 2000-2001
with Pedro Domingos and Alon Halevy
Designed for data integration settings
has been adapted to several other contexts
Desirable characteristics
learn from previous matching activities
exploit multiple types of information in schema
and data
incorporate domain integrity constraints
handle user feedback
achieves high matching accuracy (66 -- 97) on
real-world data

11
Schema Matching for Data Integrationthe LSD
Approach

Suppose user wants to integrate 100 data
sources
1. User
manually creates matches for a few sources, say 3
shows LSD these matches
2. LSD learns from the matches
3. LSD predicts matches for remaining 97 sources

12
Learning from the Manual Matches
Mediated schema
price agent-name agent-phone
office-phone description
If office occurs in name gt office-phone
listed-price contact-name contact-phone
office comments
Schema of realestate.com
realestate.com
listed-price contact-name contact-phone
office comments
250K James Smith (305) 729 0831
(305) 616 1822 Fantastic house 320K
Mike Doan (617) 253 1429 (617) 112
2315 Great location
If fantastic great occur frequently in
data instances gt description
homes.com
sold-at contact-agent extra-info
350K (206) 634 9435 Beautiful yard
230K (617) 335 4243 Close to
Seattle
13
Must Exploit Multiple Types of Information!
Mediated schema
price agent-name agent-phone
office-phone description
If office occurs in name gt office-phone
listed-price contact-name contact-phone
office comments
Schema of realestate.com
realestate.com
listed-price contact-name contact-phone
office comments
250K James Smith (305) 729 0831
(305) 616 1822 Fantastic house 320K
Mike Doan (617) 253 1429 (617) 112
2315 Great location
If fantastic great occur frequently in
data instances gt description
homes.com
sold-at contact-agent extra-info
350K (206) 634 9435 Beautiful yard
230K (617) 335 4243 Close to
Seattle
14
Multi-Strategy Learning

Use a set of base learners
each exploits well certain types of information
To match a schema element of a new source
apply base learners
combine their predictions using a meta-learner
Meta-learner
uses training sources to measure base learner
accuracy
weighs each learner based on its accuracy

15
Base Learners

Training
Matching
Name Learner
training (location, address)
(contact name, name)
matching agent-name gt (name,0.7),(phone,0
.3)
Naive Bayes Learner
training (Seattle, WA,address)
(250K,price)
matching Kent, WA gt
(address,0.8),(name,0.2)

labels weighted by confidence score
X
16
The LSD Architecture
Matching Phase
Training Phase
Mediated schema
Source schemas
Training data for base learners
Base-Learner1 .... Base-Learnerk
Meta-Learner
Base-Learner1
Base-Learnerk
Predictions for instances
Hypothesis1
Hypothesisk
Prediction Combiner
Domain constraints
Predictions for elements
Constraint Handler
Weights for Base Learners
Meta-Learner
Mappings
17
Training the Base Learners
Mediated schema
address price agent-name agent-phone
office-phone description
realestate.com
location price contact-name
contact-phone office
comments
Miami, FL 250K James Smith (305) 729
0831 (305) 616 1822 Fantastic house Boston,
MA 320K Mike Doan (617) 253 1429 (617)
112 2315 Great location
18
Meta-Learner StackingWolpert 92,TingWitten99

Training
uses training data to learn weights
one for each (base-learner,mediated-schema
element) pair
weight (Name-Learner,address) 0.2
weight (Naive-Bayes,address) 0.8
Matching combine predictions of base learners
computes weighted average of base-learner
confidence scores

area
Name Learner Naive Bayes
(address,0.4) (address,0.9)
Seattle, WA Kent, WA Bend, OR
Meta-Learner
(address, 0.40.2 0.90.8 0.8)
19
The LSD Architecture
Matching Phase
Training Phase
Mediated schema
Source schemas
Training data for base learners
Base-Learner1 .... Base-Learnerk
Meta-Learner
Base-Learner1
Base-Learnerk
Predictions for instances
Hypothesis1
Hypothesisk
Prediction Combiner
Domain constraints
Predictions for elements
Constraint Handler
Weights for Base Learners
Meta-Learner
Mappings
20
Applying the Learners
homes.com schema
area sold-at contact-agent
extra-info
area
Name Learner Naive Bayes
(address,0.8), (description,0.2) (address,0.6),
(description,0.4) (address,0.7), (description,0.3)
Meta-Learner
Seattle, WA Kent, WA Bend, OR
Name Learner Naive Bayes
Meta-Learner
Prediction-Combiner
(address,0.7), (description,0.3)
homes.com
sold-at
(price,0.9), (agent-phone,0.1)
contact-agent
(agent-phone,0.9), (description,0.1)
extra-info
(address,0.6), (description,0.4)
21
Domain Constraints

Encode user knowledge about domain
Specified only once, by examining mediated schema
Examples
at most one source-schema element can match
address
if a source-schema element matches house-id then
it is a key
avg-value(price) gt avg-value(num-baths)
Given a mapping combination
can verify if it satisfies a given constraint

area address sold-at
price contact-agent agent-phone extra-info
address
22
The Constraint Handler
Predictions from Prediction Combiner
Domain Constraints At most one element matches
address
area (address,0.7),
(description,0.3) sold-at
(price,0.9), (agent-phone,0.1) contact-agent
(agent-phone,0.9), (description,0.1) extra-info
(address,0.6), (description,0.4)
0.3 0.1 0.1 0.4 0.0012
0.7 0.9 0.9 0.4 0.2268
area address sold-at
price contact-agent agent-phone extra-info
description
0.7 0.9 0.9 0.6 0.3402
area address sold-at
price contact-agent agent-phone extra-info
address

Searches space of mapping combinations
efficiently
Can handle arbitrary constraints
Also used to incorporate user feedback
sold-at does not match price

23
The Current LSD System

Can also handle data in XML format
matches XML DTDs
Base learners
Naive Bayes DudaHart-93, DomingosPazzani-97
exploits frequencies of words symbols
WHIRL Nearest-Neighbor Classifier CohenHirsh
KDD-98
employs information-retrieval similarity metric
Name Learner SIGMOD-01
matches elements based on their names
County-Name Recognizer SIGMOD-01
stores all U.S. county names
XML Learner SIGMOD-01
exploits hierarchical structure of XML data

24
Empirical Evaluation

Four domains
Real Estate I II, Course Offerings, Faculty
Listings
For each domain
created mediated schema domain constraints
chose five sources
extracted converted data into XML
mediated schemas 14 - 66 elements, source
schemas 13 - 48

Ten runs for each domain, in each run
manually provided 1-1 matches for 3 sources
asked LSD to propose matches for remaining 2
sources
accuracy of 1-1 matches correctly identified

25
High Matching Accuracy
Average Matching Acccuracy ()
LSDs accuracy 71 - 92
Best single base learner 42 - 72
Meta-learner 5 - 22
Constraint handler 7 - 13 XML
learner 0.8 - 6
26
Contribution of Schema vs. Data
Average matching accuracy ()

LSD with only schema info.
LSD with only data info.
Complete LSD

More experiments in Doan et al. SIGMOD-01
27
LSD Summary

LSD
learns from previous matching activities
exploits multiple types of information
by employing multi-strategy learning
incorporates domain constraints user feedback
achieves high matching accuracy
LSD focuses on 1-1 matches
Next challenge discover more complex matches!
iMAP (illinois Mapping) system SIGMOD-04
developed at Washington and Illinois, 2002-2004
with Robin Dhamanka, Yoonkyong Lee, Alon Halevy,
Pedro Domingos

28
The iMAP Approach
Mediated-schema
price num-baths address
homes.com
listed-price agent-id full-baths
half-baths city zipcode

For each mediated-schema element
searches space of all matches
finds a small set of likely match candidates
uses LSD to evaluate them
To search efficiently
employs a specialized searcher for each element
type
Text Searcher, Numeric Searcher, Category
Searcher, ...

29
The iMAP Architecture SIGMOD-04
Source schema data
Mediated schema
Searcherk
Searcher2
Searcher1
Match candidates
Explanation module
Base-Learner1 .... Base-Learnerk
Domainknowledge and data
Meta-Learner
Similarity Matrix
User
Match selector
1-1 and complex matches
30
An Example Text Searcher

Beam search in space of all concatenation matches
Example find match candidates for address

Mediated-schema
price num-baths address
homes.com
listed-price agent-id full-baths
half-baths city zipcode
320K 532a 2
1 Seattle 98105 240K
115c 1 1
Miami 23591
concat(agent-id,zipcode)
concat(city,zipcode)
concat(agent-id,city)
532a 98105 115c 23591
Seattle 98105 Miami 23591
532a Seattle 115c Miami

Best match candidates for address
(agent-id,0.7), (concat(agent-id,city),0.75),
(concat(city,zipcode),0.9)

31
Empirical Evaluation

Current iMAP system
12 searchers
Four real-world domains
real estate, product inventory, cricket,
financial wizard
target schema 19 -- 42 elements, source schema
32 -- 44
Accuracy 43 -- 92
Sample discovered matches
agent-name concat(first-name,last-name)
area building-area / 43560
discount-cost (unit-price quantity) (1 -
discount)
More detail in Dhamanka et. al. SIGMOD-04

32
Observations

Finding complex matches much harder than 1-1
matches!
require gluing together many components
e.g., num-rooms bath-rooms bed-rooms
dining-rooms living-rooms
if missing one component gt incorrect match
However, even partial matches are already very
useful!
so are top-k matches gt need methods to handle
partial/top-k matches
Huge/infinite search spaces
domain knowledge plays a crucial role!
Matches are fairly complex, hard to know if they
are correct
must be able to explain matches
Human must be fairly active in the loop
need strong user interaction facilities
Break matching architecture into multiple
"atomic" boxes!

33
Road Map

Schema Matching
motivation problem definition
representative current solutions LSD, iMAP, Clio
broader picture
Ontology Matching
motivation problem definition
representative current solution GLUE
broader picture
Conclusions Emerging Directions

34
Finding Matches is only Half of the Job!

To translate data/queries, need mappings, not
matches

Schema S
Schema T
HOUSES
location price () agent-id Atlanta,
GA 360,000 32 Raleigh, NC 430,000
15
LISTINGS
area list-price agent-address
agent-name Denver, CO 550,000 Boulder, CO
Laura Smith Atlanta, GA 370,800
Athens, GA Mike Brown
AGENTS
id name city state
fee-rate 32 Mike Brown Athens GA
0.03 15 Jean Laup Raleigh
NC 0.04

Mappings
area SELECT location FROM
HOUSES
agent-address SELECT concat(city,state) FROM
AGENTS
list-price price (1 fee-rate)
FROM HOUSES,
AGENTS WHERE
agent-id id

35
Clio Elaborating Matches into Mappings

Developed at Univ of Toronto IBM Almaden,
2000-2003
by Renee Miller, Laura Haas, Mauricio Hernandez,
Lucian Popa, Howard Ho, Ling Yan, Ron Fagin
Given a match
list-price price (1 fee-rate)
Refine it into a mapping
list-price SELECT price (1 fee-rate)
FROM HOUSES (FULL OUTER JOIN)
AGENTS WHERE agent-id id
Need to discover
the correct join path among tables, e.g.,
agent-id id
the correct join, e.g., full outer join? inner
join?
Use heuristics to decide
when in doubt, ask users
employ sophisticated user interaction methods
VLDB-00, SIGMOD-01

36
Clio Illustrating Examples
Schema S
Schema T
HOUSES
location price () agent-id Atlanta,
GA 360,000 32 Raleigh, NC 430,000
15
LISTINGS
area list-price agent-address
agent-name Denver, CO 550,000 Boulder, CO
Laura Smith Atlanta, GA 370,800
Athens, GA Mike Brown
AGENTS
id name city state
fee-rate 32 Mike Brown Athens GA
0.03 15 Jean Laup Raleigh
NC 0.04

Mappings
area SELECT location FROM
HOUSES
agent-address SELECT concat(city,state) FROM
AGENTS
list-price price (1 fee-rate)
FROM HOUSES,
AGENTS WHERE
agent-id id

37
Road Map

Schema Matching
motivation problem definition
representative current solutions LSD, iMAP, Clio
broader picture
Ontology Matching
motivation problem definition
representative current solution GLUE
broader picture
Conclusions Emerging Directions

38
Broader Picture Find Matches
Single learner Exploit data 1-1 matches
Hand-crafted rules Exploit schema 1-1 matches
TRANSCM MiloZohar98 ARTEMIS
CastanoAntonellis99
Palopoli et al. 98 CUPID Madhavan et al.
01
SEMINT LiClifton94 ILA PerkowitzEtzioni95 DE
LTA Clifton et al. 97 AutoMatch, Autoplex
Berlin Motro, 01-03
Learners rules, use multi-strategy
learning Exploit schema data 1-1 complex
matches Exploit domain constraints
Other Important Works
COMA by Erhard Rahm group David Embley group at
BYU Jaewoo Kang group at NCSU Kevin Chang group
at UIUC Clement Yu group at UIC
LSD Doan et al., SIGMOD-01 iMAP Dhamanka et.
al., SIGMOD-04
More about some of these works soon ....
39
Broader Picture From Matches to Mappings
Learners rules Exploit schema data 1-1
complex matches Automate as much as possible
Rules Exploit data Powerful user interaction
CLIO Miller et. al., 00 Yan et al.
01
iMAP Dhamanka et al., SIGMOD-04
?
40
Road Map

Schema Matching
motivation problem definition
representative current solutions LSD, iMAP, Clio
broader picture
Ontology Matching
motivation problem definition
representative current solution GLUE
broader picture
Conclusions Emerging Directions

41
Ontology Matching

Increasingly critical for
knowledge bases, Semantic Web
An ontology
concepts organized into a taxonomy tree
each concept has
a set of attributes
a set of instances
relations among concepts
Matching
concepts
attributes
relations

CS Dept. US
Entity
Undergrad Courses
Grad Courses
People
Staff
Faculty
Assistant Professor
Associate Professor
Professor
name Mike Burns degree Ph.D.
42
Matching Taxonomies of Concepts
CS Dept. Australia
Entity
Courses
Staff
Technical Staff
Academic Staff
Senior Lecturer
Lecturer
Professor
43
Glue

Solution
Use data instances extensively
Learn classifiers using information within
taxonomies
Use a rich constraint satisfaction scheme
Doan, Madhavan, Domingos, Halevy WWW2002

44
Concept Similarity
Concept S
Concept A
Hypothetical universe of all examples
Joint Probability Distribution
P(A,S),P(?A,S),P(A,?S),P(?A,?S)

Multiple Similarity measures in terms of the JPD

45
Machine Learning for Computing Similarities
Taxonomy 1
Taxonomy 2

JPD estimated by counting the sizes of the
partitions

46
The Glue System
Matches for O1 , Matches for O2
Relaxation Labeling
Similarity Matrix
Common Knowledge Domain Constraints
Similarity Estimator
Joint Probability Distribution P(A,B), P(A, B)
Similarity Function
Meta Learner
Distribution Estimator
Base Learner
Base Learner
Taxonomy O1 (tree structure data instances)
Taxonomy O2 (tree structure data instances)
47
Constraints in Taxonomy Matching

Domain-dependent
at most one node matches department-chair
a node that matches professor can not be a child
of a node that matches assistant-professor
Domain-independent
two nodes match if parents children match
if all children of X matches Y, then X also
matches Y
Variations have been exploited in many restricted
settingsMelnikGarcia-Molina,ICDE-02,
MiloZohar,VLDB-98,Noy et al., IJCAI-01,
Madhavan et al., VLDB-01
Challenge find a general efficient approach

48
Solution Relaxation Labeling

Relaxation labeling HummelZucker, 83
applied to graph labeling in vision, NLP,
hypertext classification
finds best label assignment, given a set of
constraints
starts with initial label assignment
iteratively improves labels, using constraints
Standard relax. labeling not applicable
extended it in many ways Doan et al., W W W-02

49
Real World Experiments

Taxonomies on the web
University organization (UW and Cornell)
Colleges, departments and sub-fields
Companies (Yahoo and The Standard)
Industries and Sectors
For each taxonomy
Extract data instances course descriptions,
company profiles
Trivial data cleaning
100 300 concepts per taxonomy
3-4 depth of taxonomies
10-90 data instances per concept
Evaluation against manual mappings as the gold
standard

50
Glues Performance
University Depts 1
Company Profiles
University Depts 2
51
Broader Picture

Ontology matching parallels the development of
schema matching
rule-based learning-based approaches
PROMPT family, OntoMorph, OntoMerge, Chimaera,
Onion, OBSERVER, FCAMerge, ...
extensive work by Ed Hovy's group
ontology versioning (e.g., by Noy et. al.)
More powerful user interaction methods
e.g., iPROMPT, Chimaera
Much more theoretical works in this area

52
Road Map

Schema Matching
motivation problem definition
representative current solutions LSD, iMAP, Clio
broader picture
Ontology Matching
motivation problem definition
representative current solution GLUE
broader picture
Conclusions Emerging Directions

53
Develop the Theoretical Foundation

Not much is going on, however ...
see works by Alon Halevy (AAAI-02) and Phil
Bernstein (in model management contexts)
some preliminary work in AnHai Doan's Ph.D.
dissertation
work by Stuart Russell and other AI people on
identity uncertainty is potentially relevant
Most likely foundation
probability framework

54
Need Much More Domain Knowledge

Where to get it?
past matches (e.g., LSD, iMAP)
other schemas in the domain
holistic matching approach by Kevin Chang group
SIGMOD-02
corpus-based matching by Alon Halevy group
IJCAI-03
clustering to achieve bridging effects by Clement
Yu group SIGMOD-04
external data (e.g., iMAP at SIGMOD-04)
mass of users (e.g., MOBS at WebDB-03)
How to get it and how to use it?
no clear answer yet

55
Employ Multi-Module Architecture

Many "black boxes", each is good at doing a
single thing
Combine them and tailor them to each application
Examples
LSD, iMAP, COMA, David Embley's systems
Open issues
what are these back boxes?
how to build them?
how to combine them?

56
Powerful User Interaction

Minimize user effort, maximize its impact
Make it very easy for users to
supply domain knowledge
provide feedback on matches/mappings
Develop powerful explanation facilities

57
Other Issues

What to do with partial/top-k matches?
Meaning negotiation
Fortifying schemas for interoperability
Very-large-scale matching scenarios (e.g., the
Web)
What can we do without the mappings?
Interaction between schema matching and tuple
matching?
Benchmarks, tools?

58
Summary

Schema/ontology matching key to
numerous data management problems
much attention in the database, AI, Semantic Web
communities
Simple problem definition, yet very difficult to
do
no satisfactory solution yet
AI complete?
We now understand the problems much better
still at the beginning of the journey
will need techniques from multiple fields

59
Backup Slides
60
Backup Slides
61
Training the Meta-Learner

For address

Name Learner
Naive Bayes
True Predictions
Extracted XML Instances
ltlocationgt Miami, FLlt/gt ltlisted-pricegt
250,000lt/gt ltareagt Seattle, WA lt/gt lthouse-addrgtKen
t, WAlt/gt ltnum-bathsgt3lt/gt ...
0.5 0.8
1 0.4
0.3 0 0.3
0.9 1
0.6 0.8
1 0.3
0.3 0 ...
... ...
Least-SquaresLinear Regression
Weight(Name-Learner,address)
0.1 Weight(Naive-Bayes,address) 0.9
62
Sensitivity to Amount of Available Data
Average matching accuracy ()
Number of data listings per source (Real Estate I)
63
Contribution of Each Component
Average Matching Acccuracy ()
Without Name Learner Without Naive Bayes Without
Whirl Learner Without Constraint Handler The
complete LSD system
64
Exploiting Hierarchical Structure

Existing learners flatten out all structures
Developed XML learner
similar to the Naive Bayes learner
input instance bag of tokens
differs in one crucial aspect
consider not only text tokens, but also structure
tokens

ltcontactgt ltnamegt Gail Murphy lt/namegt ltfirmgt
MAX Realtors lt/firmgt lt/contactgt
ltdescriptiongt Victorian house with a view.
Name your price! To see it, contact Gail
Murphy at MAX Realtors. lt/descriptiongt
65
Reasons for Incorrect Matchings

Unfamiliarity
suburb
solution add a suburb-name recognizer
Insufficient information
correctly identified general type, failed to
pinpoint exact type
agent-name phoneRichard Smith
(206) 234 5412
solution add a proximity learner
Subjectivity
house-style description?Victorian
Beautiful neo-gothic houseMexican
Great location

66
Evaluate Mapping Candidates

For address, Text Searcher returns
(agent-id,0.7)
(concat(agent-id,city),0.8)
(concat(city,zipcode),0.75)
Employ multi-strategy learning to evaluate
mappings
Example (concat(agent-id,city),0.8)
Naive Bayes Learner 0.8
Name Learner address vs. agent id city 0.3
Meta-Learner 0.8 0.7 0.3 0.3 0.65
Meta-Learner returns
(agent-id,0.59)
(concat(agent-id,city),0.65)
(concat(city,zipcode),0.70)

67
Relaxation Labeling

Applied to similar problems in
vision, NLP, hypertext classification

People
Dept U.S.
Dept Australia
Courses
Courses
Courses
Courses
People
Staff
Staff
Faculty
Tech. Staff
Acad. Staff
Staff
Faculty
68
Relaxation Labeling for Taxonomy Matching

Must define
neighborhood of a node
k features of neighborhood
how to combine influence of features
Algorithm
init for each pair ltN,Lgt, compute
loop for each pair ltN,Lgt, re-compute

Acad. Staff Faculty Tech. Staff Staff
Staff People
Neighborhood configuration
69
Relaxation Labeling for Taxonomy Matching

Huge number of neighborhood configurations!
typically neighborhood immediate nodes
here neighborhood can be entire graph100 nodes,
10 labels gt configurations
Solution
label abstraction dynamic programming
guarantee quadratic time for a broad range of
domain constraints
Empirical evaluation
GLUE system Doan et. al., WWW-02
three real-world domains
30 -- 300 nodes / taxonomy
high accuracy 66 -- 97 vs. 52 -- 83 of best
base learner
relaxation labeling very fast, finished in
several seconds

Write a Comment

User Comments (0)

AnHai Doan PowerPoint PPT Presentation