Discovering Complex Matchings across Web Query Interfaces: A Correlation Mining Approach - PowerPoint PPT Presentation

1 / 21

About This Presentation

Title:

Discovering Complex Matchings across Web Query Interfaces: A Correlation Mining Approach

Description:

Discovering Complex Matchings across Web Query Interfaces: A Correlation Mining Approach Bin He Joint work with: Kevin Chen-Chuan Chang, Jiawei Han – PowerPoint PPT presentation

Number of Views:135

Avg rating:3.0/5.0

Slides: 22

Provided by: ZhenZ7

Category:

more less

Transcript and Presenter's Notes

Title: Discovering Complex Matchings across Web Query Interfaces: A Correlation Mining Approach

1
Discovering Complex Matchings across Web Query
Interfaces A Correlation Mining Approach

Bin He
Joint work with Kevin Chen-Chuan Chang, Jiawei
Han
Univ. Illinois at Urbana-Champaign

2
Context MetaQuerierLarge-scale integration of
the deep Web
Query
Result
MetaQuerier
The Deep Web
3
Challenge Matching query interfaces (QIs)
Book Domain
mn complex matching
11 simple matching
Music Domain
4
Demo.
5
Traditional approaches of schema matching
Pairwise attribute correspondence
Pairwise Matching

But, scale is a challenge
How to address the challenge of large scale?
And, scale is an opportunity!
How to leverage the opportunity of large scale?

S1 author title subject ISBN
S2 name title category format
Pairwise Attribute Correspondence
S1.author S2.name
S1.subject S2.category
6
A holistic schema matching paradigm
Input Set of schemas
Output Semantic model, for all attribute
matchings
Holistic Schema Matching
author name writer
subject category
format binding
7
Holistic matching is, in essence Data mining to
discover semantics for information integration
Generation

Our Hypothesis

Semantics (semantic correspondences)
Observations (attribute occurrences)
Hidden Regularities

Our Approach

Statistical Analysis -- for Model Discovery
8
Regularity Co-occurrence patterns
Author
Last Name
First Name
,

Grouping Attributes
Synonym Attributes
(a) amazon.com
(b) www.randomhouse.com
(d) 1bookstreet.com
(c) bn.com
9
Schema matching as correlation mining

Across many sources
Synonym attributes with negative correlation
synonym attributes are semantically alternative
thus, rarely co-occur in query interfaces
Grouping attributes with positive correlation
grouping attributes are semantically complement
thus, often co-occur in query interfaces

10
Data preparation Prepare schema transactions to
be mined

Interface Extraction
SIGMOD04
Type Recognition
Type is not declared in Web interfaces
Identify types from instance values,
e.g., integer, datetime
Used for constraining merging and matching
Syntactic Merging
merge attributes with syntactically similar names
e.g., title of book to title, authors name to
author
merge attributes with syntactically similar
instance values

11
DCM Dual Correlation Mining framework
1. Positive correlation mining as potential groups
Mining positive correlations
Last Name (any), First Name (any)
2. Negative correlation mining as potential
matchings
Author (any) Last Name (any), First Name
(any)
Mining negative correlations
ISBN (any) Last Name (any), First Name (any)
3. Matching selection as model construction
Author (any) Last Name (any), First Name
(any)
Subject (string) Category (string)
Format (string) Binding (string)
12
Correlation measure for qualification

To find groups and matchings that pass the
correlation threshold
Observation Pairwise correlations
e.g., in Airfares domain, to arrival city
destination
to and arrival city are negatively correlated
to and destiation are negatively correlated
arrival city and destination are negatively
correlated
Measure
m some correlation measure for two items
support downward closure --- enable Apriori
algorithm
accommodate different measure m

Cmin min m(Ai, Aj), for all i ltgt j
13
The mining process A standard Apriori algorithm
Schema Transactions

Departure City
Destination
.
.
.

Correlated items with length 2
Destination To Destination Arrival City To
Arrival City Departure City From .
From To
Departure City Arrival City
Correlated items with length 3
Destination To Arrival City .
14
Correlation measures for ranking

To rank and select matchings in model
construction
Qualification measure is not good for ranking
a set cannot win its subset due to the downward
closure
e.g., min(1, 2, 3) lt min(2, 3)
superset contains more matchings and should be
preferred
Ranking measure
A set doest not win its superset
When tie, breaking the tie by semantic richness
A1 A2 A3 is semantically richer than A1 A2
A1 A2, A3 is semantically richer than A1 A2

Cmax max m(Ai, Aj), for all i ltgt j
15
Choosing the m --- Measuring the correlation of
two items

Contingency table

We explore 22 measures, e.g.,

Lift f00f11/(f01f10)
Jaccard f11/(f11f01f10)
16
Choosing the m --- The problems of existing
measures

Co-presence (f11) is more important than
co-absence (f00)

Less positive correlation but a higher Lift 17
More positive correlation but a lower Lift 0.69

Rare attributes are not statistically convincing

Ap as rare attributes and Jaccard 0.02
No rare attributes and Jaccard 0.02
17
Choosing the m --- H-measure
H f01f10/(f1f1)

H-measure

Ignore the co-absence
Less positive correlation H 0.25
More positive correlation H 0.07
Differentiate the subtlety of negative
correlations
Ap as rare attributes and H 0.49
No rare attributes and H 0.92
18
Experimental setup

447 deep Web sources in 8 domains
Domains
Travel Airfares, Hotels, Car Rentals
Entertainment Books, Movies, Music Records
Living Jobs, Automobiles
Available as the TEL-8 dataset in UIUC Web
Integration Repository
http//metaquerier.cs.uiuc.edu/repository/

19
Results in Books and Airfares domains

Books
author (any) last name (any), first name
(any)
subject (string) category (string)
format (string) binding (string)
Airfares
passenger (integer) adult (integer), child
(integer), infant (integer)
from (string) departure city (string) depart
(string)
departure date (datetime) depart (datetime)
return date (datetime) return (datetime)
class (string) cabin (string)
destination (string) to (string) departure
city (string), arrival city (string)

20
Contributions

Insight
We build a conceptually novel connection between
data integration and correlation mining
schema matching as a new application of
correlation mining
correlation mining as a new approach for schema
matching
Techniques
The dual correlation mining framework
Measures for qualification and ranking
H-measure, robust for negative correlations

21
Thank You!

Write a Comment

User Comments (0)