Statistical Schema Matching across Web Query Interfaces - PowerPoint PPT Presentation

1 / 32

About This Presentation

Title:

Statistical Schema Matching across Web Query Interfaces

Description:

must for our task. Scale is an opportunity. Useful Context. Pairwise Attribute. Correspondence ... M1={(ti), (is), (kw), (pr), (fm), (pd), (pu), (su,cg), (au,ln), (fn) ... – PowerPoint PPT presentation

Number of Views:59

Avg rating:3.0/5.0

Slides: 33

Provided by: Bin6

Category:

more less

Transcript and Presenter's Notes

Title: Statistical Schema Matching across Web Query Interfaces

1
Statistical Schema Matching across Web Query
Interfaces
SIGMOD 2003

Bin He, Kevin Chen-Chuan Chang

2
Background Large-Scale Integration of the deep
Web
Query
Result
The Deep Web
3
Challenge matching query interfaces (QIs)
Book Domain
Music Domain
4
Traditional approaches of schema matching
Pairwise Attribute Correspondence

Scale is a challenge
Only small scale
Large-scale is a
must for our task
Scale is an opportunity
Useful Context

S1 author title subject ISBN
S2 writer title category format
S3 name title keyword binding
Pairwise Attribute Correspondence
S1.author S3.name
S1.subject S2.category
5
Deep Web Observation

Proliferating sources
Converging vocabularies

6
A hidden schema model exists?

Our View (Hypothesis)

Instantiation probabilityP(QI1M)
P
M
QI1
QIs
Finite Vocabulary
Statistical Model
Generate QIs with different probabilities
7
A hidden schema model exists?

Our View (Hypothesis)
Now the problem is

Instantiation probabilityP(QI1M)
P
M
QI1
QIs
Finite Vocabulary
Statistical Model
Generate QIs with different probabilities
P
M
Given
, can we discover
?
QIs
8
MGS framework Goal

Hypothesis modeling
Hypothesis generation
Hypothesis selection
Goal
Verify the phenomenons
Validate MGSsd with two metrics

9
Comparison with Related Work
10
Outline

MGS
MGSsd Hypothesis Modeling, Generation, Selection
Deal with Real World Data
Final Algorithm
Case Study
Metrics
Experimental Results
Conclusion and Future Issues
My Assessment

11
Towards hidden model discovery Statistical
schema matching (MGS)
M
1. Define the abstract Model structure M to solve
a target question
P(QIM)
12
MGSSD Specialize MGS for Synonym Discovery

MGS is generally applicable to a wide range of
schema matching tasks
E.g., attribute grouping
Focus discover synonym attributes
Author Writer, Subject Category
No hierarchical matching
Query interface as flat schema
No complex matching
(LastName, FirstName) Author

13
Hypothesis Modeling Structure

Goal capture synonym relationship
Two-level model structure
Possible schemas I1author, title, subject,
ISBN, I2title,category, ISBN

No overlapping concepts
14
Hypothesis Modeling Formula

Definition and Formula
Probability that M can generate schema I

15
Hypothesis Modeling Instantiation probability
1.Observing an attribute
P(authorM)
a1 ß1
16
Consistency check

A set of schema I as schema observation
ltIi,Bigtnumber of occurrences Bi for each Ii
M is consistent if Pr (IM)gt0
Find consistent models as candidates

17
Hypothesis Generation

Two sub-steps
1. Consistent Concept Construction
2.Build Hypothesis Space

18
Hypothesis Generation Space pruning

Prune the space of model candidates
Generate M such that P(QIM)gt0 for any observed
QI
mutual exclusion assumption
Co-occurrence graph
Example
Observations QI1 author, subject and QI2
author, category
Space of model any set partition of author,
subject, category

19
Hypothesis Generation

Prune the space of model candidates
Generate M such that P(QIM)gt0 for any observed
QI
mutual exclusion assumption
Example
Observations QI1 author, subject and QI2
author, category
Space of model any set partition of author,
subject, category
Model candidates after pruning

20
Hypothesis Generation (Cont.)

Build Probability Functions
Maximum likelihood estimation
Estimate ai and Bj that maximize Pr (IM)

21
Hypothesis Selection

Rank the model candidates
Select the model that generates the closest
distribution to the observations
Approach hypothesis testing
Example select schema model at significance
level 0.05
3.93
3.93lt7.815 accept
20.20
20.20gt14.067 reject

22
Dealing with the Real World Data

Head-often, tail-rare distribution
Attribute Selection
Systematically remove rare attributes
Rare Schema Smoothing
Aggregate infrequent schemas into a
conceptual event I(rare)
Consensus Projection
Follow concept mutual independence assumption
Extract and aggregate
New input schemas with re-estimation para.

23
Final Algorithm

Two phases
Build initial hypothesis space
Discover the hidden model

Attribute Selection
Extract the common parts of model candidates of
last iteration
Hypothesis Generation
Combine rare interfaces
Hypothesis Selection
24
Experiment Setup in Case Studies

Over 200 sources on four domains
Threshold f10
Significance level 0.05
Can be specified by users

25
Example of the MSGsd Algorithm
M1(ti), (is), (kw), (pr), (fm), (pd), (pu),
(su,cg), (au,ln), (fn) M2(ti), (is), (kw),
(pr), (fm), (pd), (pu), (su,cg), (au,fn), (ln)
26
Metrics

1. How it is close to the correct schema model
Precision
Recall
2. How good it can answer the target question
Precison
Recall

27
Examples on Metrics

IltI1,6gt, ltI2,3gt, ltI3,1gt
I1author, subject, I2author, category,
I3subject
M1(author1)0.6, (subject0.7,category0.3)1
M2(author1)0.6, (subject1)0.7,
(category1)0.3
Metrics 1
Pm(M2,Mc)0.1960.0360.2490.0540.58
Rm(M2,Mc)0.280.120.420.181
Metrics 2

28
Experimental Results
The discovered synonyms are all correct ones
Can generate all correct instances

This approach can identify most concepts
correctly
Incorrect matchings due to small observations
Do need two suites of metrics
Time complexity is exponential

29
Advantages

Scalability large-scale matching
Solvability exploit statistical information
Generality

S1 author title subject ISBN
S2 writer title category format
S3 name title keyword binding
Holistic Model Discovery
category
author
name
subject
writer
30
Conclusions Future Work