Reconciling Schemas of Disparate Data Sources: A MachineLearning Approach - PowerPoint PPT Presentation

1 / 25
About This Presentation
Title:

Reconciling Schemas of Disparate Data Sources: A MachineLearning Approach

Description:

Contributions of base learners and the constraint handler ... Conclusion and Future Work. Improve over time. Extensible framework. Multiple types of knowledge ... – PowerPoint PPT presentation

Number of Views:49
Avg rating:3.0/5.0
Slides: 26
Provided by: LiXu8
Category:

less

Transcript and Presenter's Notes

Title: Reconciling Schemas of Disparate Data Sources: A MachineLearning Approach


1
Reconciling Schemas of Disparate Data Sources A
Machine-Learning Approach
  • AnHai Doan
  • Pedro Domingos
  • Alon Halevy

2
Data Integration
3
Problem Solution
  • Problem
  • Large-scale Data Integration Systems
  • Bottleneck Semantic Mappings
  • 1-1 Mappings
  • Solution
  • Multi-strategy Learning
  • Integrity Constraints
  • XML Structure Learner

4
Learning Source Descriptions (LSD)
  • Components
  • Base learners
  • Meta-learner
  • Prediction converter
  • Constraint handler
  • Operations
  • Training phase
  • Matching phase

5
Learners
  • Basic Learners
  • Name Matcher (Whirl)
  • Content Matcher (Whirl)
  • Naïve Bayes Learner
  • County-Name Recognizer
  • XML Learner
  • Meta-Learner (Stacking)

6
XML Learner
7
XML Learner (Cont.)
8
Constraint Handler
  • Domain Constraints

9
Constraint Handler (Cont.)
  • Search Heuristic
  • Mapping Cost

10
Training Phase
11
Example1 (Training Phase)
12
Example1 (Cont.)
13
Example1 (Cont.)
(location ,ADDRESS)
(Miami, FL, ADDRESS)
14
Matching Phase
15
Example2 (Matching Phase)
16
Example2 (Cont.)
17
Example2 (Cont.)
18
Empirical Evaluation
19
Measures
  • Matching accuracy of a source
  • Average matching accuracy of a source
  • Average matching accuracy of a domain

20
Experiment Result
21
Experiment Result (Cont.)
Contributions of base learners and the constraint
handler
22
Experiment Result (Cont.)
Contributions of Schema information and Data
Instances
23
Experiment Result (Cont.)
Performance sensitivity to the amount of data
instances
24
Limitations
  • Enough Training Data
  • Domain Dependent Learners
  • Ambiguities in Sources
  • Efficiency
  • Overlapping of Schemas

25
Conclusion and Future Work
  • Improve over time
  • Extensible framework
  • Multiple types of knowledge
  • Non 1-1 mapping ?
Write a Comment
User Comments (0)
About PowerShow.com