Title: Text%20Joins%20in%20an%20RDBMS%20for%20Web%20Data%20Integration
1Text Joins in an RDBMS for Web Data Integration
- Luis Gravano
- Panagiotis G. Ipeirotis
Nick Koudas Divesh Srivastava
Columbia University
ATT Labs - Research
2Why Text Joins?
Web Service B
HATRONIC CORP
EUROAFT INC
EUROAFT CORP
Web Service A
EUROAFT CORP
HATRONIC INC
Problem Same entity has multiple textual
representations.
3Matching Text Attributes
Need for a similarity metric!
Many desirable properties
- Match entries with typing mistakes
- Microsoft Windpws XP vs. Microsoft Windows XP
- Match entries with abbreviated information
- Zurich International Airport vs. Zurich Intl.
Airport - Match entries with different formatting
conventions - Dept. of Computer Science vs. Computer Science
Dept. - and combinations thereof
4Matching Text Attributes using Edit Distance
Edit Distance Character insertions, deletions,
and modifications to transform one string to the
other
EUROAFT CORP - EURODRAFT CORP ? 2
COMPUTER SCI. - COMPUTER ? 3
KIA INTERNATIONAL - KIA ? 13
Good for spelling errors, short word insertions
and deletions Problems word order variations,
long word insertions and deletions
Approximate String Joins VLDB 2001
5Matching Text Attributes using Cosine Similarity
Similar entries should share infrequent tokens
EUROAFT CORP EUROAFT INC
EUROAFT CORP ? HATRONIC CORP
Different token choices result in similarity
metrics with different properties
6Using Words and Cosine Similarity
Using words as tokens
Infrequent token (high weight)
EUROAFT CORP EUROAFT INC
EUROAFT CORP ? HATRONIC CORP
- Split each entry into words
- Similar entries share infrequent words
Common token (low weight)
- Good for word order variations and common word
insert./del. - Computer Science Dept. Dept. of Computer
Science - Problems with misspellings
- Biotechnology Department ? Bioteknology Dept.
WHIRL W.Cohen, SIGMOD98
7Using q-grams and Cosine Similarity
Using q-grams as tokens
- Split each string into small substrings of length
q (q-grams) - Similar entries share many, infrequent q-grams
Handles naturally misspellings, word order
variations, and insertions and deletions of
common or short words
8Problem
For two entries t1, t2 0 Similarity 1
Problem that we address Given two relations,
report all pairs with cosine similarity above
threshold f
9Computing Text Joins in an RDBMS
R1
R2
Name
1 EUROAFT CORP
2 HATRONIC INC
Name
1 HATRONIC CORP
2 EUROAFT INC
3 EUROAFT CORP
R1 R2 Similarity
EUROAFT CORP EUROAFT INC 0.98
EUROAFT CORP EUROAFT CORP 1.00
EUROAFT CORP HATRONIC CORP 0.01
HATRONIC INC HATRONIC CORP 0.98
HATRONIC INC EUROAFT INC 0.02
10Sampling Step for Text Joins
Similarity S weight(token, t1) weight(token,
t2)
- Similarity is a sum of products
- Products cannot be high when weight is small
- Can (safely) drop low weights from RiWeights
(adapted from Cohen Lewis, SODA97 for
efficient execution inside an RDBMS)
RiWeights
Token W
EUROAFT 0.9144
HATRONIC 0.8419
CORP 0.01247
INC 0.00504
RiSample
Token TIMES SAMPLED
EUROAFT 18 (18/200.90)
HATRONIC 17 (17/200.85)
Eliminates low similarity pairs (e.g., EUROAFT
INC with HATRONIC INC)
11Sampling-Based Text Joins in SQL
R1Weights
R2Sample
Token W
1 EUROAFT 0.98
1 CORP 0.02
2 HATRONIC 0.98
2 INC 0.01
Token W
1 HATRONIC 0.98
1 CORP 0.02
2 EUROAFT 0.95
2 INC 0.05
3 EUROAFT 0.97
3 CORP 0.03
R1
Name
1 EUROAFT CORP
2 HATRONIC INC
R1 R2 Similarity
EUROAFT CORP EUROAFT INC 0.98
EUROAFT CORP EUROAFT CORP 0.9
HATRONIC INC HATRONIC CORP 0.98
12Experimental Setup
- 40,000 entries from ATT customer database, split
into R1 (26,000 entries) and R2 (14,000 entries) - Tokenizations
- Words
- Q-grams, q2 q3
- Methods compared
- Variations of sample-based joins
- Baseline in SQL
- WHIRL SIGMOD98, adapted for handling q-grams
13Metrics
- Execute the (approximate) join for similarity gt
f - Precision (measures accuracy)
- Fraction of the pairs in the answer with real
similarity gt f - Recall (measures completeness)
- Fraction of the pairs with real similarity gt f
that are also in the answer - Execution time
14Comparing WHIRL and Sample-based Joins
- Sample-based Joins Good recall across similarity
thresholds - WHIRL Very low recall (almost 0 recall for
thresholds below 0.7)
15Changing Sample Size
- Increased sample size ? Better recall,
precision - Drawback Increased execution time
16Execution Time
- WHIRL and Sample-based text joins break-even at
S 64, 128
17Contributions
- WHIRL Cohen, SIGMOD98 inside an RDBMS
Scalability, no data exporting/importing - Different tokens choices
- Words Captures word swaps, deletion of common
words - Q-grams All the above, plus spelling mistakes,
but slower
SQL statements tested in MS SQL Server and
available for download at http//www.cs.columbia.
edu/pirot/DataCleaning/
18Questions?