Title: Discovering Complex Matchings across Web Query Interfaces: A Correlation Mining Approach
1Discovering Complex Matchings across Web Query
Interfaces A Correlation Mining Approach
- Bin He
- Joint work with Kevin Chen-Chuan Chang, Jiawei
Han - Univ. Illinois at Urbana-Champaign
2Context MetaQuerierLarge-scale integration of
the deep Web
Query
Result
MetaQuerier
The Deep Web
3Challenge Matching query interfaces (QIs)
Book Domain
mn complex matching
11 simple matching
Music Domain
4Demo.
5Traditional approaches of schema matching
Pairwise attribute correspondence
Pairwise Matching
- But, scale is a challenge
- How to address the challenge of large scale?
- And, scale is an opportunity!
- How to leverage the opportunity of large scale?
S1 author title subject ISBN
S2 name title category format
Pairwise Attribute Correspondence
S1.author S2.name
S1.subject S2.category
6A holistic schema matching paradigm
Input Set of schemas
Output Semantic model, for all attribute
matchings
Holistic Schema Matching
author name writer
subject category
format binding
7Holistic matching is, in essence Data mining to
discover semantics for information integration
Generation
Semantics (semantic correspondences)
Observations (attribute occurrences)
Hidden Regularities
Statistical Analysis -- for Model Discovery
8Regularity Co-occurrence patterns
Author
Last Name
First Name
,
Grouping Attributes
Synonym Attributes
(a) amazon.com
(b) www.randomhouse.com
(d) 1bookstreet.com
(c) bn.com
9Schema matching as correlation mining
- Across many sources
- Synonym attributes with negative correlation
- synonym attributes are semantically alternative
- thus, rarely co-occur in query interfaces
- Grouping attributes with positive correlation
- grouping attributes are semantically complement
- thus, often co-occur in query interfaces
10Data preparation Prepare schema transactions to
be mined
- Interface Extraction
- SIGMOD04
- Type Recognition
- Type is not declared in Web interfaces
- Identify types from instance values,
- e.g., integer, datetime
- Used for constraining merging and matching
- Syntactic Merging
- merge attributes with syntactically similar names
- e.g., title of book to title, authors name to
author - merge attributes with syntactically similar
instance values
11DCM Dual Correlation Mining framework
1. Positive correlation mining as potential groups
Mining positive correlations
Last Name (any), First Name (any)
2. Negative correlation mining as potential
matchings
Author (any) Last Name (any), First Name
(any)
Mining negative correlations
ISBN (any) Last Name (any), First Name (any)
3. Matching selection as model construction
Author (any) Last Name (any), First Name
(any)
Subject (string) Category (string)
Format (string) Binding (string)
12Correlation measure for qualification
- To find groups and matchings that pass the
correlation threshold - Observation Pairwise correlations
- e.g., in Airfares domain, to arrival city
destination - to and arrival city are negatively correlated
- to and destiation are negatively correlated
- arrival city and destination are negatively
correlated - Measure
- m some correlation measure for two items
- support downward closure --- enable Apriori
algorithm - accommodate different measure m
Cmin min m(Ai, Aj), for all i ltgt j
13The mining process A standard Apriori algorithm
Schema Transactions
- Departure City
- Destination
- .
- .
- .
Correlated items with length 2
Destination To Destination Arrival City To
Arrival City Departure City From .
From To
Departure City Arrival City
Correlated items with length 3
Destination To Arrival City .
14Correlation measures for ranking
- To rank and select matchings in model
construction - Qualification measure is not good for ranking
- a set cannot win its subset due to the downward
closure - e.g., min(1, 2, 3) lt min(2, 3)
- superset contains more matchings and should be
preferred - Ranking measure
- A set doest not win its superset
- When tie, breaking the tie by semantic richness
- A1 A2 A3 is semantically richer than A1 A2
- A1 A2, A3 is semantically richer than A1 A2
Cmax max m(Ai, Aj), for all i ltgt j
15Choosing the m --- Measuring the correlation of
two items
- We explore 22 measures, e.g.,
Lift f00f11/(f01f10)
Jaccard f11/(f11f01f10)
16Choosing the m --- The problems of existing
measures
- Co-presence (f11) is more important than
co-absence (f00)
Less positive correlation but a higher Lift 17
More positive correlation but a lower Lift 0.69
- Rare attributes are not statistically convincing
Ap as rare attributes and Jaccard 0.02
No rare attributes and Jaccard 0.02
17Choosing the m --- H-measure
H f01f10/(f1f1)
Ignore the co-absence
Less positive correlation H 0.25
More positive correlation H 0.07
Differentiate the subtlety of negative
correlations
Ap as rare attributes and H 0.49
No rare attributes and H 0.92
18Experimental setup
- 447 deep Web sources in 8 domains
- Domains
- Travel Airfares, Hotels, Car Rentals
- Entertainment Books, Movies, Music Records
- Living Jobs, Automobiles
- Available as the TEL-8 dataset in UIUC Web
Integration Repository - http//metaquerier.cs.uiuc.edu/repository/
19Results in Books and Airfares domains
- Books
- author (any) last name (any), first name
(any) - subject (string) category (string)
- format (string) binding (string)
- Airfares
- passenger (integer) adult (integer), child
(integer), infant (integer) - from (string) departure city (string) depart
(string) - departure date (datetime) depart (datetime)
- return date (datetime) return (datetime)
- class (string) cabin (string)
- destination (string) to (string) departure
city (string), arrival city (string)
20Contributions
- Insight
- We build a conceptually novel connection between
data integration and correlation mining - schema matching as a new application of
correlation mining - correlation mining as a new approach for schema
matching - Techniques
- The dual correlation mining framework
- Measures for qualification and ranking
- H-measure, robust for negative correlations
21Thank You!