Robert%20McCann - PowerPoint PPT Presentation

About This Presentation
Title:

Robert%20McCann

Description:

System 'practices ahead of time' Example: Reformatting Price. homeseekers.com. wrapper ... Attributes with similar values. E.g., trained with ORDER-DATE before ... – PowerPoint PPT presentation

Number of Views:22
Avg rating:3.0/5.0
Slides: 21
Provided by: zam34
Category:

less

Transcript and Presenter's Notes

Title: Robert%20McCann


1
Mapping Maintenance
for Data Integration Systems
  • Robert McCann
  • University of Illinois
  • Joint work with Bedoor AlShebli, Quoc Le, Hoa
    Nguyen, Long Vu, AnHai Doan
  • VLDB 2005

2
Data Integration Systems
Find homes under 300K
mediated schema
source schema 2
source schema 3
source schema 1
wrapper
wrapper
windermere.com
yahoo.com
3
Mapping Maintenance is a Key Bottleneck
  • Constructing mappings has proven difficult
  • (see first speaker)
  • but maintenance often quickly dominates cost
  • E.g., Integrated Genome Database Project Stein,
    03
  • 12 genomic databases, each remodeled data twice
    per year
  • System broke every two weeks, abandoned after 1
    year
  • E.g., Integration Project at Illinois
  • Integrated 400 DB researcher homepages
  • 2 system administrators, stopped after 3 months

Reducing maintenance costs is now crucial!
4
Problem Definition
mediated schema
mediated schema
?
price location beds baths
180,000 61801 2 2
260,000 98195 3 2
5
Example 1 Change Source Schema or Data
  • Update tuples
  • Change units of price

wrapper
homeseekers.com
6
Example 2 Change Presentation Format
  • Display location as zipcode
  • Rearrange page layout

wrapper
homeseekers.com
7
The MAVERIC Approach
  • Suppose administrator wants to maintain mappings
    for 1 year
  • 1. For a short initial period (e.g., 5 weeks)
  • Administrator manually verifies each mapping
  • MAVERIC probes the source to learn data
    characteristics
  • 2. For remaining time (e.g., 47 weeks)
  • MAVERIC probes the source to observe new data
    instances
  • MAVERIC outputs an alarm if characteristics
    differ
  • If an alarm, administrator repairs mappings

8
Example
  • Training phase

Learned data characteristics
  • Verification phase

If beds lt baths, output alarm
If average price lt 100,000, output alarm
If layout of attributes changes, output alarm
9
Contributions
  • Develop core MAVERIC system
  • An ensemble of sensors that exploit multiple
    characteristics of data
  • A combiner that leverages the most effective
    sensors
  • Significantly improve core system
  • Generate synthetic data to improve training
  • Leverage external data to improve training
  • Employ filters to reduce false alarms
  • Extensive evaluation over 114 sources in 6
    domains
  • Core MAVERIC outperforms related work, improving
    F-1 by 4-19
  • Enhancements further improve F-1 by 2-13

10
Training the Core MAVERIC System
  • Sensors learn internal profiles of data
    characteristics
  • Combiner learns weight for each sensor

employ Winnow to learn weights
layout of attributes in HTML pages
price location beds / baths
avg value of price
11
Verifying with the Core MAVERIC System
  • Sensors leverage internal profiles to output
    sensor scores
  • Combiner combines scores based upon weights

alarm if combined score ?
score1
new avg price
12
Improving Training via Perturbation
  • Idea expand training data by generating
    synthetic data
  • Simulate natural source changes during training
  • Source data changes, e.g., insert and delete
    tuples
  • Presentation format changes, e.g., 29.99 becomes
    29.99 USD

perturber - apply change - reapply
wrapper - test results
perturbed results
training data for S
original results
query results at tn
wrapper
source S at tn
System practices ahead of time
13
Example Reformatting Price
training data
perturbed training example
original training example
price location beds baths
?
185,000 USD Urbana, IL 3 2
original results
perturbed results
wrapper
wrapper
185,000 Urbana, IL 3bed/2bath
185,000 USD Urbana, IL 3bed/2bath
original HTML
perturbed HTML
homeseekers.com
14
Additional Improvements
  • Improve training by borrowing data from other
    sources

mediated schema
source schema
source schema
comments amount
category price
wrapper
wrapper
This 185,000 USD
house 185,000
S
S
  • Reduce false alarms via filtering
  • Web Search Engines
  • price is 185,000 USD
  • costs 185,000 USD

Other Sources
  • Monetary Recognizers
  • 185,000
  • 185000.00

potentially corrupt attribute
price
price is valid
185,000 USD
amount
210 K
(see paper for details)
15
Empirical Evaluation
  • Test verification ability over 114 sources in 6
    domains

Domain Number of Sources Schema Size (Number of Attributes) Probing Schedule Snapshots Snapshots
Domain Number of Sources Schema Size (Number of Attributes) Probing Schedule Correct Mappings Broken Mappings
Flights 19 8 weekly for 10 weeks 164 26
Books 21 6 weekly for 12 weeks 210 42
Researchers 60 4 daily for 313 days 12480 6274
Real Estate 5 17 11 snapshots per source 30 25
Inventory 4 7 11 snapshots per source 24 20
Courses 5 11 11 snapshots per source 30 25
16
Core MAVERIC Outperforms Prior Work
  • Compare with recent system
  • Lerman et al, Journal of AI Research 03

Domain Lerman System Lerman System Sensor Ensemble Sensor Ensemble
Domain P / R F-1 P / R F-1
Flights 0.81 / 1.00 0.85 0.93 / 0.98 0.93
Books 0.83 / 1.00 0.89 0.90 / 0.99 0.93
Researchers 0.77 / 0.99 0.84 0.90 / 0.99 0.93
Real Estate 0.45 / 0.90 0.63 0.80 / 0.82 0.82
Inventory 0.52 / 0.89 0.67 0.75 / 0.90 0.77
Courses 0.49 / 0.94 0.66 0.92 / 0.88 0.88
Achieve F-1 from 82-93, an improvement of 4-19
in all domains
17
Enhancements Boost Performance
  • Progressively enhanced versions of MAVERIC

Each enhancement improved F-1 in at least 4
domains
18
Reasons for Mistakes
  • Unrecognized instance formats
  • E.g., trained over TIME with format 200 pm,
  • source changed format to 1400, output
    false alarm
  • E.g., trained over DAYS with format M-W-F,
  • source changed format to Mon Wed
    Fri, output false alarm
  • Train with additional perturbations? Leverage
    more sources?
  • Attributes with similar values
  • E.g., trained with ORDER-DATE before SHIP-DATE,
  • source reversed order, missed alarm
    on reversed values
  • (ORDER-DATE 7/13/2004, SHIP-DATE
    7/4/2004)
  • Include additional domain constraints?

19
Related Work
  • Schema matching
  • Dhamankar et al, 04, He Chang, 03, Kang
    Naughton, 03, Rahm Bernstein, 01, Doan, 01
  • Quantify semantics to compute matching scores
  • Activity monitoring
  • Shavlik Shavlik, 04, Lazarevic et al, 03,
    Stolfo et al, 01, Fawcett Provost, 99,
    Allan et al, 98
  • Profile normal behavior to detect notable events
    (e.g., intrusions)
  • Mapping and wrapper maintenance
  • Wrapper verification Lerman et al, 03,
    Kushmerick, 00
  • Mapping and wrapper repair Velegrakis et al,
    03, Meng et al, 03, Chidlovskii, 01

20
Conclusion Future Work
  • Developed MAVERIC to reduce maintenance costs
  • An ensemble of sensors that exploit multiple
    characteristics of data
  • Significantly improved core system
  • Perturbation, multi-source training, and
    filtering
  • Extensively evaluated over 114 sources in 6
    domains
  • Core outperformed related work, improving F-1 by
    4-19
  • Enhancements further improved F-1 by 2-13
  • Future work
  • Further improve and evaluate MAVERIC
  • Develop a solution for repairing broken mappings
Write a Comment
User Comments (0)
About PowerShow.com