Discovering Complex Matchings across Web Query Interfaces: A Correlation Mining Approach - PowerPoint PPT Presentation

1 / 21
About This Presentation
Title:

Discovering Complex Matchings across Web Query Interfaces: A Correlation Mining Approach

Description:

Discovering Complex Matchings across Web Query Interfaces: A Correlation Mining Approach Bin He Joint work with: Kevin Chen-Chuan Chang, Jiawei Han – PowerPoint PPT presentation

Number of Views:135
Avg rating:3.0/5.0
Slides: 22
Provided by: ZhenZ7
Category:

less

Transcript and Presenter's Notes

Title: Discovering Complex Matchings across Web Query Interfaces: A Correlation Mining Approach


1
Discovering Complex Matchings across Web Query
Interfaces A Correlation Mining Approach
  • Bin He
  • Joint work with Kevin Chen-Chuan Chang, Jiawei
    Han
  • Univ. Illinois at Urbana-Champaign

2
Context MetaQuerierLarge-scale integration of
the deep Web
Query
Result
MetaQuerier
The Deep Web
3
Challenge Matching query interfaces (QIs)
Book Domain
mn complex matching
11 simple matching
Music Domain
4
Demo.
5
Traditional approaches of schema matching
Pairwise attribute correspondence
Pairwise Matching
  • But, scale is a challenge
  • How to address the challenge of large scale?
  • And, scale is an opportunity!
  • How to leverage the opportunity of large scale?

S1 author title subject ISBN
S2 name title category format
Pairwise Attribute Correspondence
S1.author S2.name
S1.subject S2.category
6
A holistic schema matching paradigm
Input Set of schemas
Output Semantic model, for all attribute
matchings
Holistic Schema Matching
author name writer
subject category
format binding
7
Holistic matching is, in essence Data mining to
discover semantics for information integration
Generation
  • Our Hypothesis

Semantics (semantic correspondences)
Observations (attribute occurrences)
Hidden Regularities
  • Our Approach

Statistical Analysis -- for Model Discovery
8
Regularity Co-occurrence patterns
Author
Last Name
First Name
,

Grouping Attributes
Synonym Attributes
(a) amazon.com
(b) www.randomhouse.com
(d) 1bookstreet.com
(c) bn.com
9
Schema matching as correlation mining
  • Across many sources
  • Synonym attributes with negative correlation
  • synonym attributes are semantically alternative
  • thus, rarely co-occur in query interfaces
  • Grouping attributes with positive correlation
  • grouping attributes are semantically complement
  • thus, often co-occur in query interfaces

10
Data preparation Prepare schema transactions to
be mined
  • Interface Extraction
  • SIGMOD04
  • Type Recognition
  • Type is not declared in Web interfaces
  • Identify types from instance values,
  • e.g., integer, datetime
  • Used for constraining merging and matching
  • Syntactic Merging
  • merge attributes with syntactically similar names
  • e.g., title of book to title, authors name to
    author
  • merge attributes with syntactically similar
    instance values

11
DCM Dual Correlation Mining framework
1. Positive correlation mining as potential groups
Mining positive correlations
Last Name (any), First Name (any)
2. Negative correlation mining as potential
matchings
Author (any) Last Name (any), First Name
(any)
Mining negative correlations
ISBN (any) Last Name (any), First Name (any)
3. Matching selection as model construction
Author (any) Last Name (any), First Name
(any)
Subject (string) Category (string)
Format (string) Binding (string)
12
Correlation measure for qualification
  • To find groups and matchings that pass the
    correlation threshold
  • Observation Pairwise correlations
  • e.g., in Airfares domain, to arrival city
    destination
  • to and arrival city are negatively correlated
  • to and destiation are negatively correlated
  • arrival city and destination are negatively
    correlated
  • Measure
  • m some correlation measure for two items
  • support downward closure --- enable Apriori
    algorithm
  • accommodate different measure m

Cmin min m(Ai, Aj), for all i ltgt j
13
The mining process A standard Apriori algorithm
Schema Transactions
  • Departure City
  • Destination
  • .
  • .
  • .

Correlated items with length 2
Destination To Destination Arrival City To
Arrival City Departure City From .
From To
Departure City Arrival City
Correlated items with length 3
Destination To Arrival City .
14
Correlation measures for ranking
  • To rank and select matchings in model
    construction
  • Qualification measure is not good for ranking
  • a set cannot win its subset due to the downward
    closure
  • e.g., min(1, 2, 3) lt min(2, 3)
  • superset contains more matchings and should be
    preferred
  • Ranking measure
  • A set doest not win its superset
  • When tie, breaking the tie by semantic richness
  • A1 A2 A3 is semantically richer than A1 A2
  • A1 A2, A3 is semantically richer than A1 A2

Cmax max m(Ai, Aj), for all i ltgt j
15
Choosing the m --- Measuring the correlation of
two items
  • Contingency table
  • We explore 22 measures, e.g.,

Lift f00f11/(f01f10)
Jaccard f11/(f11f01f10)
16
Choosing the m --- The problems of existing
measures
  • Co-presence (f11) is more important than
    co-absence (f00)

Less positive correlation but a higher Lift 17
More positive correlation but a lower Lift 0.69
  • Rare attributes are not statistically convincing

Ap as rare attributes and Jaccard 0.02
No rare attributes and Jaccard 0.02
17
Choosing the m --- H-measure
H f01f10/(f1f1)
  • H-measure

Ignore the co-absence
Less positive correlation H 0.25
More positive correlation H 0.07
Differentiate the subtlety of negative
correlations
Ap as rare attributes and H 0.49
No rare attributes and H 0.92
18
Experimental setup
  • 447 deep Web sources in 8 domains
  • Domains
  • Travel Airfares, Hotels, Car Rentals
  • Entertainment Books, Movies, Music Records
  • Living Jobs, Automobiles
  • Available as the TEL-8 dataset in UIUC Web
    Integration Repository
  • http//metaquerier.cs.uiuc.edu/repository/

19
Results in Books and Airfares domains
  • Books
  • author (any) last name (any), first name
    (any)
  • subject (string) category (string)
  • format (string) binding (string)
  • Airfares
  • passenger (integer) adult (integer), child
    (integer), infant (integer)
  • from (string) departure city (string) depart
    (string)
  • departure date (datetime) depart (datetime)
  • return date (datetime) return (datetime)
  • class (string) cabin (string)
  • destination (string) to (string) departure
    city (string), arrival city (string)

20
Contributions
  • Insight
  • We build a conceptually novel connection between
    data integration and correlation mining
  • schema matching as a new application of
    correlation mining
  • correlation mining as a new approach for schema
    matching
  • Techniques
  • The dual correlation mining framework
  • Measures for qualification and ranking
  • H-measure, robust for negative correlations

21
Thank You!
Write a Comment
User Comments (0)
About PowerShow.com