Title: Learning Object Identification Rules for Information Integration
1Learning Object Identification Rules for
Information Integration
- Sheila Tejada
- Craig A. Knobleock
- Steven Minton
- _at_ University of Southern California
2Introduction
- When integrating information, data objects can
exist in inconsistent text formats across several
sources - Previous methods manually construct mapping rules
for object identification - Active Atlas learns to tailor mapping rules,
through limited user input, to a specific
application domain - Active Atlas achieves higher accuracy and require
less user involvement than previous methods
3Object Identification Example
4Ariadne Information Mediator
5Ariadne Information Mediator(contd)
6Active Atlas Approach to Map Objects
- First, determine the text formatting
transformations and propose candidate mappings - Then, learn domain-specific mapping rules
7Active Atlas Architecture
8Mapping Objects(Transformation Functions)
- General Transformation Functions
- Type I
- Stemming, Soundex, Abbreviation
- Type II
- Equality, Initial, Prefix, Suffix, Substring,
- Abbreviation, Acronym
9Mapping Objects(Transformation Functions Example)
10Mapping Objects(Compute Attribute Similarity
Scores)
11Mapping Objects(Compute Total Similarity Scores)
- Total object similarity score is computed as a
weighted sum of the attribute similarity scores - Each attribute has a uniqueness weight that is a
heuristic measure of the importance of that
attribute
12Mapping Objects(Output of Candidate Generator)
13Mapping Objects(Mapping-Rule Learning)
- Decision Tree Learning
- Passive Learning
- Requires a large set of training examples
- Active Learning
- Uses query by bagging technique
- Selects a small set of initial training examples
- Includes a variety of training examples
- Creates a diverse set of decision tree learners
- Actively chooses the examples for user to label
14Mapping Objects(Active Learning)
15Experimental Results
- Three different domains
- Restaurants, Companies and Airports
- Experiments
- Two base line experiments
- Compare the shared attributes seperately
- Compare the object as a whole
- Both requires choosing an optimal threshold
- Passive learning
- Active learning
16Experimental Results(Restaurants)
- Source A 331 objects
- Source B 533 objects
- 112 correct mappings
- 3259 candidate mappings over 10 runs
17Measurement of Accuracy
- Accuracy
- The total number of correct classifications over
the total number of mappings plus the number of
correct mappings not proposed
18Experimental Results
19Experimental Results
20Related Work
21Conclusion
- The research addresses the problem of mapping
objects between structured web sources - The experiments results show that Active Atlas
can achieve high accuracy, while limiting the
user involvement.
22Future Work