AnHai Doan

About This Presentation
Title:

AnHai Doan

Description:

AnHai Doan. Database and Information System Group. University of Illinois, Urbana Champaign ... avg-value(price) avg-value(num-baths) Given a mapping combination ... – PowerPoint PPT presentation

Number of Views:202
Avg rating:3.0/5.0
Slides: 70
Provided by: zam91

less

Transcript and Presenter's Notes

Title: AnHai Doan


1
Schema Ontology Matching Current Research
Directions
  • AnHai Doan
  • Database and Information System Group
  • University of Illinois, Urbana Champaign
  • Spring 2004

2
Road Map
  • Schema Matching
  • motivation problem definition
  • representative current solutions LSD, iMAP, Clio
  • broader picture
  • Ontology Matching
  • motivation problem definition
  • representative current solution GLUE
  • broader picture
  • Conclusions Emerging Directions

3
Motivation Data Integration
Find houses with 2 bedrooms priced under 200K
New faculty member
homes.com
realestate.com
homeseekers.com
4
Architecture of Data Integration System
Find houses with 2 bedrooms priced under 200K
mediated schema
source schema 2
source schema 3
source schema 1
homes.com
realestate.com
homeseekers.com
5
Semantic Matches between Schemas
Mediated-schema
price agent-name address
1-1 match
complex match
homes.com
listed-price contact-name city
state
320K Jane Brown Seattle
WA 240K Mike Smith Miami
FL
6
Schema Matching is Ubiquitous!
  • Fundamental problem in numerous applications
  • Databases
  • data integration
  • data translation
  • schema/view integration
  • data warehousing
  • semantic query processing
  • model management
  • peer data management
  • AI
  • knowledge bases, ontology merging, information
    gathering agents, ...
  • Web
  • e-commerce
  • marking up data using ontologies (e.g., on
    Semantic Web)

7
Why Schema Matching is Difficult
  • Schema data never fully capture semantics!
  • not adequately documented
  • schema creator has retired to Florida!
  • Must rely on clues in schema data
  • using names, structures, types, data values, etc.
  • Such clues can be unreliable
  • same names gt different entities area gt
    location or square-feet
  • different names gt same entity area
    address gt location
  • Intended semantics can be subjective
  • house-style house-description?
  • military applications require committees to
    decide!
  • Cannot be fully automated, needs user feedback!

8
Current State of Affairs
  • Finding semantic mappings is now a key
    bottleneck!
  • largely done by hand
  • labor intensive error prone
  • data integration at GTE LiClifton, 2000
  • 40 databases, 27000 elements, estimated time 12
    years
  • Will only be exacerbated
  • data sharing becomes pervasive
  • translation of legacy data
  • Need semi-automatic approaches to scale up!
  • Many research projects in the past few years
  • Databases IBM Almaden, Microsoft Research, BYU,
    George Mason, U of Leipzig, U
    Wisconsin, NCSU, UIUC, Washington, ...
  • AI Stanford, Karlsruhe University, NEC Japan,
    ...

9
Road Map
  • Schema Matching
  • motivation problem definition
  • representative current solutions LSD, iMAP, Clio
  • broader picture
  • Ontology Matching
  • motivation problem definition
  • representative current solution GLUE
  • broader picture
  • Conclusions Emerging Directions

10
LSD
  • Learning Source Description
  • Developed at Univ of Washington 2000-2001
  • with Pedro Domingos and Alon Halevy
  • Designed for data integration settings
  • has been adapted to several other contexts
  • Desirable characteristics
  • learn from previous matching activities
  • exploit multiple types of information in schema
    and data
  • incorporate domain integrity constraints
  • handle user feedback
  • achieves high matching accuracy (66 -- 97) on
    real-world data

11
Schema Matching for Data Integrationthe LSD
Approach
  • Suppose user wants to integrate 100 data
    sources
  • 1. User
  • manually creates matches for a few sources, say 3
  • shows LSD these matches
  • 2. LSD learns from the matches
  • 3. LSD predicts matches for remaining 97 sources

12
Learning from the Manual Matches
Mediated schema
price agent-name agent-phone
office-phone description
If office occurs in name gt office-phone
listed-price contact-name contact-phone
office comments
Schema of realestate.com
realestate.com
listed-price contact-name contact-phone
office comments
250K James Smith (305) 729 0831
(305) 616 1822 Fantastic house 320K
Mike Doan (617) 253 1429 (617) 112
2315 Great location
If fantastic great occur frequently in
data instances gt description
homes.com
sold-at contact-agent extra-info
350K (206) 634 9435 Beautiful yard
230K (617) 335 4243 Close to
Seattle
13
Must Exploit Multiple Types of Information!
Mediated schema
price agent-name agent-phone
office-phone description
If office occurs in name gt office-phone
listed-price contact-name contact-phone
office comments
Schema of realestate.com
realestate.com
listed-price contact-name contact-phone
office comments
250K James Smith (305) 729 0831
(305) 616 1822 Fantastic house 320K
Mike Doan (617) 253 1429 (617) 112
2315 Great location
If fantastic great occur frequently in
data instances gt description
homes.com
sold-at contact-agent extra-info
350K (206) 634 9435 Beautiful yard
230K (617) 335 4243 Close to
Seattle
14
Multi-Strategy Learning
  • Use a set of base learners
  • each exploits well certain types of information
  • To match a schema element of a new source
  • apply base learners
  • combine their predictions using a meta-learner
  • Meta-learner
  • uses training sources to measure base learner
    accuracy
  • weighs each learner based on its accuracy

15
Base Learners
  • Training
  • Matching
  • Name Learner
  • training (location, address)
    (contact name, name)
  • matching agent-name gt (name,0.7),(phone,0
    .3)
  • Naive Bayes Learner
  • training (Seattle, WA,address)
    (250K,price)
  • matching Kent, WA gt
    (address,0.8),(name,0.2)

labels weighted by confidence score
X
16
The LSD Architecture
Matching Phase
Training Phase
Mediated schema
Source schemas
Training data for base learners
Base-Learner1 .... Base-Learnerk
Meta-Learner
Base-Learner1
Base-Learnerk
Predictions for instances
Hypothesis1
Hypothesisk
Prediction Combiner
Domain constraints
Predictions for elements
Constraint Handler
Weights for Base Learners
Meta-Learner
Mappings
17
Training the Base Learners
Mediated schema
address price agent-name agent-phone
office-phone description
realestate.com
location price contact-name
contact-phone office
comments
Miami, FL 250K James Smith (305) 729
0831 (305) 616 1822 Fantastic house Boston,
MA 320K Mike Doan (617) 253 1429 (617)
112 2315 Great location
18
Meta-Learner StackingWolpert 92,TingWitten99
  • Training
  • uses training data to learn weights
  • one for each (base-learner,mediated-schema
    element) pair
  • weight (Name-Learner,address) 0.2
  • weight (Naive-Bayes,address) 0.8
  • Matching combine predictions of base learners
  • computes weighted average of base-learner
    confidence scores

area
Name Learner Naive Bayes
(address,0.4) (address,0.9)
Seattle, WA Kent, WA Bend, OR
Meta-Learner
(address, 0.40.2 0.90.8 0.8)
19
The LSD Architecture
Matching Phase
Training Phase
Mediated schema
Source schemas
Training data for base learners
Base-Learner1 .... Base-Learnerk
Meta-Learner
Base-Learner1
Base-Learnerk
Predictions for instances
Hypothesis1
Hypothesisk
Prediction Combiner
Domain constraints
Predictions for elements
Constraint Handler
Weights for Base Learners
Meta-Learner
Mappings
20
Applying the Learners
homes.com schema
area sold-at contact-agent
extra-info
area
Name Learner Naive Bayes
(address,0.8), (description,0.2) (address,0.6),
(description,0.4) (address,0.7), (description,0.3)
Meta-Learner
Seattle, WA Kent, WA Bend, OR
Name Learner Naive Bayes
Meta-Learner
Prediction-Combiner
(address,0.7), (description,0.3)
homes.com
sold-at
(price,0.9), (agent-phone,0.1)
contact-agent
(agent-phone,0.9), (description,0.1)
extra-info
(address,0.6), (description,0.4)
21
Domain Constraints
  • Encode user knowledge about domain
  • Specified only once, by examining mediated schema
  • Examples
  • at most one source-schema element can match
    address
  • if a source-schema element matches house-id then
    it is a key
  • avg-value(price) gt avg-value(num-baths)
  • Given a mapping combination
  • can verify if it satisfies a given constraint

area address sold-at
price contact-agent agent-phone extra-info
address
22
The Constraint Handler
Predictions from Prediction Combiner
Domain Constraints At most one element matches
address
area (address,0.7),
(description,0.3) sold-at
(price,0.9), (agent-phone,0.1) contact-agent
(agent-phone,0.9), (description,0.1) extra-info
(address,0.6), (description,0.4)
0.3 0.1 0.1 0.4 0.0012
0.7 0.9 0.9 0.4 0.2268
area address sold-at
price contact-agent agent-phone extra-info
description
0.7 0.9 0.9 0.6 0.3402
area address sold-at
price contact-agent agent-phone extra-info
address
  • Searches space of mapping combinations
    efficiently
  • Can handle arbitrary constraints
  • Also used to incorporate user feedback
  • sold-at does not match price

23
The Current LSD System
  • Can also handle data in XML format
  • matches XML DTDs
  • Base learners
  • Naive Bayes DudaHart-93, DomingosPazzani-97
  • exploits frequencies of words symbols
  • WHIRL Nearest-Neighbor Classifier CohenHirsh
    KDD-98
  • employs information-retrieval similarity metric
  • Name Learner SIGMOD-01
  • matches elements based on their names
  • County-Name Recognizer SIGMOD-01
  • stores all U.S. county names
  • XML Learner SIGMOD-01
  • exploits hierarchical structure of XML data

24
Empirical Evaluation
  • Four domains
  • Real Estate I II, Course Offerings, Faculty
    Listings
  • For each domain
  • created mediated schema domain constraints
  • chose five sources
  • extracted converted data into XML
  • mediated schemas 14 - 66 elements, source
    schemas 13 - 48
  • Ten runs for each domain, in each run
  • manually provided 1-1 matches for 3 sources
  • asked LSD to propose matches for remaining 2
    sources
  • accuracy of 1-1 matches correctly identified

25
High Matching Accuracy
Average Matching Acccuracy ()
LSDs accuracy 71 - 92
Best single base learner 42 - 72
Meta-learner 5 - 22
Constraint handler 7 - 13 XML
learner 0.8 - 6
26
Contribution of Schema vs. Data
Average matching accuracy ()

  • LSD with only schema info.
  • LSD with only data info.
  • Complete LSD

More experiments in Doan et al. SIGMOD-01
27
LSD Summary
  • LSD
  • learns from previous matching activities
  • exploits multiple types of information
  • by employing multi-strategy learning
  • incorporates domain constraints user feedback
  • achieves high matching accuracy
  • LSD focuses on 1-1 matches
  • Next challenge discover more complex matches!
  • iMAP (illinois Mapping) system SIGMOD-04
  • developed at Washington and Illinois, 2002-2004
  • with Robin Dhamanka, Yoonkyong Lee, Alon Halevy,
    Pedro Domingos

28
The iMAP Approach
Mediated-schema
price num-baths address
homes.com
listed-price agent-id full-baths
half-baths city zipcode
  • For each mediated-schema element
  • searches space of all matches
  • finds a small set of likely match candidates
  • uses LSD to evaluate them
  • To search efficiently
  • employs a specialized searcher for each element
    type
  • Text Searcher, Numeric Searcher, Category
    Searcher, ...

29
The iMAP Architecture SIGMOD-04
Source schema data
Mediated schema
Searcherk
Searcher2
Searcher1
Match candidates
Explanation module
Base-Learner1 .... Base-Learnerk
Domainknowledge and data
Meta-Learner
Similarity Matrix
User
Match selector
1-1 and complex matches
30
An Example Text Searcher
  • Beam search in space of all concatenation matches
  • Example find match candidates for address

Mediated-schema
price num-baths address
homes.com
listed-price agent-id full-baths
half-baths city zipcode
320K 532a 2
1 Seattle 98105 240K
115c 1 1
Miami 23591
concat(agent-id,zipcode)
concat(city,zipcode)
concat(agent-id,city)
532a 98105 115c 23591
Seattle 98105 Miami 23591
532a Seattle 115c Miami
  • Best match candidates for address
  • (agent-id,0.7), (concat(agent-id,city),0.75),
    (concat(city,zipcode),0.9)

31
Empirical Evaluation
  • Current iMAP system
  • 12 searchers
  • Four real-world domains
  • real estate, product inventory, cricket,
    financial wizard
  • target schema 19 -- 42 elements, source schema
    32 -- 44
  • Accuracy 43 -- 92
  • Sample discovered matches
  • agent-name concat(first-name,last-name)
  • area building-area / 43560
  • discount-cost (unit-price quantity) (1 -
    discount)
  • More detail in Dhamanka et. al. SIGMOD-04

32
Observations
  • Finding complex matches much harder than 1-1
    matches!
  • require gluing together many components
  • e.g., num-rooms bath-rooms bed-rooms
    dining-rooms living-rooms
  • if missing one component gt incorrect match
  • However, even partial matches are already very
    useful!
  • so are top-k matches gt need methods to handle
    partial/top-k matches
  • Huge/infinite search spaces
  • domain knowledge plays a crucial role!
  • Matches are fairly complex, hard to know if they
    are correct
  • must be able to explain matches
  • Human must be fairly active in the loop
  • need strong user interaction facilities
  • Break matching architecture into multiple
    "atomic" boxes!

33
Road Map
  • Schema Matching
  • motivation problem definition
  • representative current solutions LSD, iMAP, Clio
  • broader picture
  • Ontology Matching
  • motivation problem definition
  • representative current solution GLUE
  • broader picture
  • Conclusions Emerging Directions

34
Finding Matches is only Half of the Job!
  • To translate data/queries, need mappings, not
    matches

Schema S
Schema T
HOUSES
location price () agent-id Atlanta,
GA 360,000 32 Raleigh, NC 430,000
15
LISTINGS
area list-price agent-address
agent-name Denver, CO 550,000 Boulder, CO
Laura Smith Atlanta, GA 370,800
Athens, GA Mike Brown
AGENTS
id name city state
fee-rate 32 Mike Brown Athens GA
0.03 15 Jean Laup Raleigh
NC 0.04
  • Mappings
  • area SELECT location FROM
    HOUSES
  • agent-address SELECT concat(city,state) FROM
    AGENTS
  • list-price price (1 fee-rate)
    FROM HOUSES,
    AGENTS WHERE
    agent-id id

35
Clio Elaborating Matches into Mappings
  • Developed at Univ of Toronto IBM Almaden,
    2000-2003
  • by Renee Miller, Laura Haas, Mauricio Hernandez,
    Lucian Popa, Howard Ho, Ling Yan, Ron Fagin
  • Given a match
  • list-price price (1 fee-rate)
  • Refine it into a mapping
  • list-price SELECT price (1 fee-rate)
    FROM HOUSES (FULL OUTER JOIN)
    AGENTS WHERE agent-id id
  • Need to discover
  • the correct join path among tables, e.g.,
    agent-id id
  • the correct join, e.g., full outer join? inner
    join?
  • Use heuristics to decide
  • when in doubt, ask users
  • employ sophisticated user interaction methods
    VLDB-00, SIGMOD-01

36
Clio Illustrating Examples
Schema S
Schema T
HOUSES
location price () agent-id Atlanta,
GA 360,000 32 Raleigh, NC 430,000
15
LISTINGS
area list-price agent-address
agent-name Denver, CO 550,000 Boulder, CO
Laura Smith Atlanta, GA 370,800
Athens, GA Mike Brown
AGENTS
id name city state
fee-rate 32 Mike Brown Athens GA
0.03 15 Jean Laup Raleigh
NC 0.04
  • Mappings
  • area SELECT location FROM
    HOUSES
  • agent-address SELECT concat(city,state) FROM
    AGENTS
  • list-price price (1 fee-rate)
    FROM HOUSES,
    AGENTS WHERE
    agent-id id

37
Road Map
  • Schema Matching
  • motivation problem definition
  • representative current solutions LSD, iMAP, Clio
  • broader picture
  • Ontology Matching
  • motivation problem definition
  • representative current solution GLUE
  • broader picture
  • Conclusions Emerging Directions

38
Broader Picture Find Matches
Single learner Exploit data 1-1 matches
Hand-crafted rules Exploit schema 1-1 matches
TRANSCM MiloZohar98 ARTEMIS
CastanoAntonellis99
Palopoli et al. 98 CUPID Madhavan et al.
01
SEMINT LiClifton94 ILA PerkowitzEtzioni95 DE
LTA Clifton et al. 97 AutoMatch, Autoplex
Berlin Motro, 01-03
Learners rules, use multi-strategy
learning Exploit schema data 1-1 complex
matches Exploit domain constraints
Other Important Works
COMA by Erhard Rahm group David Embley group at
BYU Jaewoo Kang group at NCSU Kevin Chang group
at UIUC Clement Yu group at UIC
LSD Doan et al., SIGMOD-01 iMAP Dhamanka et.
al., SIGMOD-04
More about some of these works soon ....
39
Broader Picture From Matches to Mappings
Learners rules Exploit schema data 1-1
complex matches Automate as much as possible
Rules Exploit data Powerful user interaction
CLIO Miller et. al., 00 Yan et al.
01
iMAP Dhamanka et al., SIGMOD-04
?
40
Road Map
  • Schema Matching
  • motivation problem definition
  • representative current solutions LSD, iMAP, Clio
  • broader picture
  • Ontology Matching
  • motivation problem definition
  • representative current solution GLUE
  • broader picture
  • Conclusions Emerging Directions

41
Ontology Matching
  • Increasingly critical for
  • knowledge bases, Semantic Web
  • An ontology
  • concepts organized into a taxonomy tree
  • each concept has
  • a set of attributes
  • a set of instances
  • relations among concepts
  • Matching
  • concepts
  • attributes
  • relations

CS Dept. US
Entity
Undergrad Courses
Grad Courses
People
Staff
Faculty
Assistant Professor
Associate Professor
Professor
name Mike Burns degree Ph.D.
42
Matching Taxonomies of Concepts
CS Dept. Australia
Entity
Courses
Staff
Technical Staff
Academic Staff
Senior Lecturer
Lecturer
Professor
43
Glue
  • Solution
  • Use data instances extensively
  • Learn classifiers using information within
    taxonomies
  • Use a rich constraint satisfaction scheme
  • Doan, Madhavan, Domingos, Halevy WWW2002

44
Concept Similarity
Concept S
Concept A
Hypothetical universe of all examples
Joint Probability Distribution
P(A,S),P(?A,S),P(A,?S),P(?A,?S)
  • Multiple Similarity measures in terms of the JPD

45
Machine Learning for Computing Similarities
Taxonomy 1
Taxonomy 2
  • JPD estimated by counting the sizes of the
    partitions

46
The Glue System
Matches for O1 , Matches for O2
Relaxation Labeling
Similarity Matrix
Common Knowledge Domain Constraints
Similarity Estimator
Joint Probability Distribution P(A,B), P(A, B)
Similarity Function
Meta Learner
Distribution Estimator
Base Learner
Base Learner
Taxonomy O1 (tree structure data instances)
Taxonomy O2 (tree structure data instances)
47
Constraints in Taxonomy Matching
  • Domain-dependent
  • at most one node matches department-chair
  • a node that matches professor can not be a child
    of a node that matches assistant-professor
  • Domain-independent
  • two nodes match if parents children match
  • if all children of X matches Y, then X also
    matches Y
  • Variations have been exploited in many restricted
    settingsMelnikGarcia-Molina,ICDE-02,
    MiloZohar,VLDB-98,Noy et al., IJCAI-01,
    Madhavan et al., VLDB-01
  • Challenge find a general efficient approach

48
Solution Relaxation Labeling
  • Relaxation labeling HummelZucker, 83
  • applied to graph labeling in vision, NLP,
    hypertext classification
  • finds best label assignment, given a set of
    constraints
  • starts with initial label assignment
  • iteratively improves labels, using constraints
  • Standard relax. labeling not applicable
  • extended it in many ways Doan et al., W W W-02

49
Real World Experiments
  • Taxonomies on the web
  • University organization (UW and Cornell)
  • Colleges, departments and sub-fields
  • Companies (Yahoo and The Standard)
  • Industries and Sectors
  • For each taxonomy
  • Extract data instances course descriptions,
    company profiles
  • Trivial data cleaning
  • 100 300 concepts per taxonomy
  • 3-4 depth of taxonomies
  • 10-90 data instances per concept
  • Evaluation against manual mappings as the gold
    standard

50
Glues Performance
University Depts 1
Company Profiles
University Depts 2
51
Broader Picture
  • Ontology matching parallels the development of
    schema matching
  • rule-based learning-based approaches
  • PROMPT family, OntoMorph, OntoMerge, Chimaera,
    Onion, OBSERVER, FCAMerge, ...
  • extensive work by Ed Hovy's group
  • ontology versioning (e.g., by Noy et. al.)
  • More powerful user interaction methods
  • e.g., iPROMPT, Chimaera
  • Much more theoretical works in this area

52
Road Map
  • Schema Matching
  • motivation problem definition
  • representative current solutions LSD, iMAP, Clio
  • broader picture
  • Ontology Matching
  • motivation problem definition
  • representative current solution GLUE
  • broader picture
  • Conclusions Emerging Directions

53
Develop the Theoretical Foundation
  • Not much is going on, however ...
  • see works by Alon Halevy (AAAI-02) and Phil
    Bernstein (in model management contexts)
  • some preliminary work in AnHai Doan's Ph.D.
    dissertation
  • work by Stuart Russell and other AI people on
    identity uncertainty is potentially relevant
  • Most likely foundation
  • probability framework

54
Need Much More Domain Knowledge
  • Where to get it?
  • past matches (e.g., LSD, iMAP)
  • other schemas in the domain
  • holistic matching approach by Kevin Chang group
    SIGMOD-02
  • corpus-based matching by Alon Halevy group
    IJCAI-03
  • clustering to achieve bridging effects by Clement
    Yu group SIGMOD-04
  • external data (e.g., iMAP at SIGMOD-04)
  • mass of users (e.g., MOBS at WebDB-03)
  • How to get it and how to use it?
  • no clear answer yet

55
Employ Multi-Module Architecture
  • Many "black boxes", each is good at doing a
    single thing
  • Combine them and tailor them to each application
  • Examples
  • LSD, iMAP, COMA, David Embley's systems
  • Open issues
  • what are these back boxes?
  • how to build them?
  • how to combine them?

56
Powerful User Interaction
  • Minimize user effort, maximize its impact
  • Make it very easy for users to
  • supply domain knowledge
  • provide feedback on matches/mappings
  • Develop powerful explanation facilities

57
Other Issues
  • What to do with partial/top-k matches?
  • Meaning negotiation
  • Fortifying schemas for interoperability
  • Very-large-scale matching scenarios (e.g., the
    Web)
  • What can we do without the mappings?
  • Interaction between schema matching and tuple
    matching?
  • Benchmarks, tools?

58
Summary
  • Schema/ontology matching key to
    numerous data management problems
  • much attention in the database, AI, Semantic Web
    communities
  • Simple problem definition, yet very difficult to
    do
  • no satisfactory solution yet
  • AI complete?
  • We now understand the problems much better
  • still at the beginning of the journey
  • will need techniques from multiple fields

59
Backup Slides
60
Backup Slides
61
Training the Meta-Learner
  • For address

Name Learner
Naive Bayes
True Predictions
Extracted XML Instances
ltlocationgt Miami, FLlt/gt ltlisted-pricegt
250,000lt/gt ltareagt Seattle, WA lt/gt lthouse-addrgtKen
t, WAlt/gt ltnum-bathsgt3lt/gt ...
0.5 0.8
1 0.4
0.3 0 0.3
0.9 1
0.6 0.8
1 0.3
0.3 0 ...
... ...
Least-SquaresLinear Regression
Weight(Name-Learner,address)
0.1 Weight(Naive-Bayes,address) 0.9
62
Sensitivity to Amount of Available Data
Average matching accuracy ()
Number of data listings per source (Real Estate I)
63
Contribution of Each Component
Average Matching Acccuracy ()
Without Name Learner Without Naive Bayes Without
Whirl Learner Without Constraint Handler The
complete LSD system
64
Exploiting Hierarchical Structure
  • Existing learners flatten out all structures
  • Developed XML learner
  • similar to the Naive Bayes learner
  • input instance bag of tokens
  • differs in one crucial aspect
  • consider not only text tokens, but also structure
    tokens

ltcontactgt ltnamegt Gail Murphy lt/namegt ltfirmgt
MAX Realtors lt/firmgt lt/contactgt
ltdescriptiongt Victorian house with a view.
Name your price! To see it, contact Gail
Murphy at MAX Realtors. lt/descriptiongt
65
Reasons for Incorrect Matchings
  • Unfamiliarity
  • suburb
  • solution add a suburb-name recognizer
  • Insufficient information
  • correctly identified general type, failed to
    pinpoint exact type
  • agent-name phoneRichard Smith
    (206) 234 5412
  • solution add a proximity learner
  • Subjectivity
  • house-style description?Victorian
    Beautiful neo-gothic houseMexican
    Great location

66
Evaluate Mapping Candidates
  • For address, Text Searcher returns
  • (agent-id,0.7)
  • (concat(agent-id,city),0.8)
  • (concat(city,zipcode),0.75)
  • Employ multi-strategy learning to evaluate
    mappings
  • Example (concat(agent-id,city),0.8)
  • Naive Bayes Learner 0.8
  • Name Learner address vs. agent id city 0.3
  • Meta-Learner 0.8 0.7 0.3 0.3 0.65
  • Meta-Learner returns
  • (agent-id,0.59)
  • (concat(agent-id,city),0.65)
  • (concat(city,zipcode),0.70)

67
Relaxation Labeling
  • Applied to similar problems in
  • vision, NLP, hypertext classification

People
Dept U.S.
Dept Australia
Courses
Courses
Courses
Courses
People
Staff
Staff
Faculty
Tech. Staff
Acad. Staff
Staff
Faculty
68
Relaxation Labeling for Taxonomy Matching
  • Must define
  • neighborhood of a node
  • k features of neighborhood
  • how to combine influence of features
  • Algorithm
  • init for each pair ltN,Lgt, compute
  • loop for each pair ltN,Lgt, re-compute

Acad. Staff Faculty Tech. Staff Staff
Staff People
Neighborhood configuration
69
Relaxation Labeling for Taxonomy Matching
  • Huge number of neighborhood configurations!
  • typically neighborhood immediate nodes
  • here neighborhood can be entire graph100 nodes,
    10 labels gt configurations
  • Solution
  • label abstraction dynamic programming
  • guarantee quadratic time for a broad range of
    domain constraints
  • Empirical evaluation
  • GLUE system Doan et. al., WWW-02
  • three real-world domains
  • 30 -- 300 nodes / taxonomy
  • high accuracy 66 -- 97 vs. 52 -- 83 of best
    base learner
  • relaxation labeling very fast, finished in
    several seconds
Write a Comment
User Comments (0)