Text%20Joins%20in%20an%20RDBMS%20for%20Web%20Data%20Integration presentation

About This Presentation

Title:

Text%20Joins%20in%20an%20RDBMS%20for%20Web%20Data%20Integration

Description:

Text Joins in an RDBMS for Web Data Integration. Text Joins in an RDBMS for Web Data Integration Luis Gravano Panagiotis G. Ipeirotis Nick Koudas Divesh Srivastava –

Number of Views:216

Avg rating:3.0/5.0

Slides: 19

Provided by: Pana57

Category:

more less

Transcript and Presenter's Notes

Title: Text%20Joins%20in%20an%20RDBMS%20for%20Web%20Data%20Integration

1
Text Joins in an RDBMS for Web Data Integration

Luis Gravano
Panagiotis G. Ipeirotis

Nick Koudas Divesh Srivastava
Columbia University
ATT Labs - Research
2
Why Text Joins?
Web Service B
HATRONIC CORP
EUROAFT INC
EUROAFT CORP

Web Service A
EUROAFT CORP
HATRONIC INC

Problem Same entity has multiple textual
representations.
3
Matching Text Attributes
Need for a similarity metric!
Many desirable properties

Match entries with typing mistakes
Microsoft Windpws XP vs. Microsoft Windows XP
Match entries with abbreviated information
Zurich International Airport vs. Zurich Intl.
Airport
Match entries with different formatting
conventions
Dept. of Computer Science vs. Computer Science
Dept.
and combinations thereof

4
Matching Text Attributes using Edit Distance
Edit Distance Character insertions, deletions,
and modifications to transform one string to the
other
EUROAFT CORP - EURODRAFT CORP ? 2
COMPUTER SCI. - COMPUTER ? 3
KIA INTERNATIONAL - KIA ? 13
Good for spelling errors, short word insertions
and deletions Problems word order variations,
long word insertions and deletions
Approximate String Joins VLDB 2001
5
Matching Text Attributes using Cosine Similarity
Similar entries should share infrequent tokens
EUROAFT CORP EUROAFT INC
EUROAFT CORP ? HATRONIC CORP
Different token choices result in similarity
metrics with different properties
6
Using Words and Cosine Similarity
Using words as tokens
Infrequent token (high weight)
EUROAFT CORP EUROAFT INC
EUROAFT CORP ? HATRONIC CORP

Split each entry into words
Similar entries share infrequent words

Common token (low weight)

Good for word order variations and common word
insert./del.
Computer Science Dept. Dept. of Computer
Science
Problems with misspellings
Biotechnology Department ? Bioteknology Dept.

WHIRL W.Cohen, SIGMOD98
7
Using q-grams and Cosine Similarity
Using q-grams as tokens

Split each string into small substrings of length
q (q-grams)
Similar entries share many, infrequent q-grams

Handles naturally misspellings, word order
variations, and insertions and deletions of
common or short words
8
Problem
For two entries t1, t2 0 Similarity 1
Problem that we address Given two relations,
report all pairs with cosine similarity above
threshold f
9
Computing Text Joins in an RDBMS
R1
R2
Name
1 EUROAFT CORP
2 HATRONIC INC

Name
1 HATRONIC CORP
2 EUROAFT INC
3 EUROAFT CORP

R1 R2 Similarity
EUROAFT CORP EUROAFT INC 0.98
EUROAFT CORP EUROAFT CORP 1.00
EUROAFT CORP HATRONIC CORP 0.01
HATRONIC INC HATRONIC CORP 0.98
HATRONIC INC EUROAFT INC 0.02
10
Sampling Step for Text Joins
Similarity S weight(token, t1) weight(token,
t2)

Similarity is a sum of products
Products cannot be high when weight is small
Can (safely) drop low weights from RiWeights
(adapted from Cohen Lewis, SODA97 for
efficient execution inside an RDBMS)

RiWeights
Token W
EUROAFT 0.9144
HATRONIC 0.8419

CORP 0.01247
INC 0.00504
RiSample
Token TIMES SAMPLED
EUROAFT 18 (18/200.90)
HATRONIC 17 (17/200.85)
Eliminates low similarity pairs (e.g., EUROAFT
INC with HATRONIC INC)
11
Sampling-Based Text Joins in SQL
R1Weights
R2Sample
Token W
1 EUROAFT 0.98
1 CORP 0.02
2 HATRONIC 0.98
2 INC 0.01

Token W
1 HATRONIC 0.98
1 CORP 0.02
2 EUROAFT 0.95
2 INC 0.05
3 EUROAFT 0.97
3 CORP 0.03
R1
Name
1 EUROAFT CORP
2 HATRONIC INC

R1 R2 Similarity
EUROAFT CORP EUROAFT INC 0.98
EUROAFT CORP EUROAFT CORP 0.9
HATRONIC INC HATRONIC CORP 0.98
12
Experimental Setup