Statistical Schema Matching across Web Query Interfaces - PowerPoint PPT Presentation

1 / 32
About This Presentation
Title:

Statistical Schema Matching across Web Query Interfaces

Description:

must for our task. Scale is an opportunity. Useful Context. Pairwise Attribute. Correspondence ... M1={(ti), (is), (kw), (pr), (fm), (pd), (pu), (su,cg), (au,ln), (fn) ... – PowerPoint PPT presentation

Number of Views:59
Avg rating:3.0/5.0
Slides: 33
Provided by: Bin6
Category:

less

Transcript and Presenter's Notes

Title: Statistical Schema Matching across Web Query Interfaces


1
Statistical Schema Matching across Web Query
Interfaces
SIGMOD 2003
  • Bin He, Kevin Chen-Chuan Chang

2
Background Large-Scale Integration of the deep
Web
Query
Result
The Deep Web
3
Challenge matching query interfaces (QIs)
Book Domain
Music Domain
4
Traditional approaches of schema matching
Pairwise Attribute Correspondence
  • Scale is a challenge
  • Only small scale
  • Large-scale is a
  • must for our task
  • Scale is an opportunity
  • Useful Context

S1 author title subject ISBN
S2 writer title category format
S3 name title keyword binding
Pairwise Attribute Correspondence
S1.author S3.name
S1.subject S2.category
5
Deep Web Observation
  • Proliferating sources
  • Converging vocabularies

6
A hidden schema model exists?
  • Our View (Hypothesis)

Instantiation probabilityP(QI1M)
P
M
QI1
QIs
Finite Vocabulary
Statistical Model
Generate QIs with different probabilities
7
A hidden schema model exists?
  • Our View (Hypothesis)
  • Now the problem is

Instantiation probabilityP(QI1M)
P
M
QI1
QIs
Finite Vocabulary
Statistical Model
Generate QIs with different probabilities
P
M
Given
, can we discover
?
QIs
8
MGS framework Goal
  • Hypothesis modeling
  • Hypothesis generation
  • Hypothesis selection
  • Goal
  • Verify the phenomenons
  • Validate MGSsd with two metrics

9
Comparison with Related Work
10
Outline
  • MGS
  • MGSsd Hypothesis Modeling, Generation, Selection
  • Deal with Real World Data
  • Final Algorithm
  • Case Study
  • Metrics
  • Experimental Results
  • Conclusion and Future Issues
  • My Assessment

11
Towards hidden model discovery Statistical
schema matching (MGS)
M
1. Define the abstract Model structure M to solve
a target question
P(QIM)
12
MGSSD Specialize MGS for Synonym Discovery
  • MGS is generally applicable to a wide range of
    schema matching tasks
  • E.g., attribute grouping
  • Focus discover synonym attributes
  • Author Writer, Subject Category
  • No hierarchical matching
  • Query interface as flat schema
  • No complex matching
  • (LastName, FirstName) Author

13
Hypothesis Modeling Structure
  • Goal capture synonym relationship
  • Two-level model structure
  • Possible schemas I1author, title, subject,
    ISBN, I2title,category, ISBN

No overlapping concepts
14
Hypothesis Modeling Formula
  • Definition and Formula
  • Probability that M can generate schema I

15
Hypothesis Modeling Instantiation probability
1.Observing an attribute
P(authorM)
a1 ß1
16
Consistency check
  • A set of schema I as schema observation
  • ltIi,Bigtnumber of occurrences Bi for each Ii
  • M is consistent if Pr (IM)gt0
  • Find consistent models as candidates

17
Hypothesis Generation
  • Two sub-steps
  • 1. Consistent Concept Construction
  • 2.Build Hypothesis Space

18
Hypothesis Generation Space pruning
  • Prune the space of model candidates
  • Generate M such that P(QIM)gt0 for any observed
    QI
  • mutual exclusion assumption
  • Co-occurrence graph
  • Example
  • Observations QI1 author, subject and QI2
    author, category
  • Space of model any set partition of author,
    subject, category

19
Hypothesis Generation
  • Prune the space of model candidates
  • Generate M such that P(QIM)gt0 for any observed
    QI
  • mutual exclusion assumption
  • Example
  • Observations QI1 author, subject and QI2
    author, category
  • Space of model any set partition of author,
    subject, category
  • Model candidates after pruning

20
Hypothesis Generation (Cont.)
  • Build Probability Functions
  • Maximum likelihood estimation
  • Estimate ai and Bj that maximize Pr (IM)

21
Hypothesis Selection
  • Rank the model candidates
  • Select the model that generates the closest
    distribution to the observations
  • Approach hypothesis testing
  • Example select schema model at significance
    level 0.05
  • 3.93
    3.93lt7.815 accept
  • 20.20
    20.20gt14.067 reject

22
Dealing with the Real World Data
  • Head-often, tail-rare distribution
  • Attribute Selection
  • Systematically remove rare attributes
  • Rare Schema Smoothing
  • Aggregate infrequent schemas into a
    conceptual event I(rare)
  • Consensus Projection
  • Follow concept mutual independence assumption
  • Extract and aggregate
  • New input schemas with re-estimation para.

23
Final Algorithm
  • Two phases
  • Build initial hypothesis space
  • Discover the hidden model

Attribute Selection
Extract the common parts of model candidates of
last iteration
Hypothesis Generation
Combine rare interfaces
Hypothesis Selection
24
Experiment Setup in Case Studies
  • Over 200 sources on four domains
  • Threshold f10
  • Significance level 0.05
  • Can be specified by users

25
Example of the MSGsd Algorithm
M1(ti), (is), (kw), (pr), (fm), (pd), (pu),
(su,cg), (au,ln), (fn) M2(ti), (is), (kw),
(pr), (fm), (pd), (pu), (su,cg), (au,fn), (ln)
26
Metrics
  • 1. How it is close to the correct schema model
  • Precision
  • Recall
  • 2. How good it can answer the target question
  • Precison
  • Recall

27
Examples on Metrics
  • IltI1,6gt, ltI2,3gt, ltI3,1gt
  • I1author, subject, I2author, category,
    I3subject
  • M1(author1)0.6, (subject0.7,category0.3)1
  • M2(author1)0.6, (subject1)0.7,
    (category1)0.3
  • Metrics 1
  • Pm(M2,Mc)0.1960.0360.2490.0540.58
  • Rm(M2,Mc)0.280.120.420.181
  • Metrics 2

28
Experimental Results
The discovered synonyms are all correct ones
Can generate all correct instances
  • This approach can identify most concepts
    correctly
  • Incorrect matchings due to small observations
  • Do need two suites of metrics
  • Time complexity is exponential

29
Advantages
  • Scalability large-scale matching
  • Solvability exploit statistical information
  • Generality

S1 author title subject ISBN
S2 writer title category format
S3 name title keyword binding
Holistic Model Discovery
category
author
name
subject
writer
30
Conclusions Future Work
  • Holistic statistical schema matching of massive
    sources
  • MGS framework to find synonym attributes
  • Discover hidden models
  • Suited for large-scale database
  • Results verify the observed phenomena and show
    accuracy and effectiveness
  • Future Issues
  • Complex matching (Last Name, First Name)
    Author
  • More efficient approximation algorithm
  • Incorporating other matching techniques

31
My Assessments
  • Promise
  • Use minimal light-weight information attribute
    name
  • Effective with sufficient instances
  • Leverage challenge as opportunity
  • Limitation
  • Need sufficient observations
  • Simple Assumptions
  • Exponential time complexity
  • Homonyms

32
Questions
Write a Comment
User Comments (0)
About PowerShow.com