Title: Clustering and Load Balancing Optimization for Redundant Content Removal
1Clustering and Load Balancing Optimization for
Redundant Content Removal
- Shanzhong Zhu (Ask.com)
- Alexandra Potapova, Maha Alabduljalil (Univ. of
California at Santa Barbara) - Xin Liu (Amazon.com)
- Tao Yang (Univ. of California at Santa Barbara)
2Redundant Content Removal in Search Engines
- Over 1/3 of Web pages crawled are near duplicates
- When to remove near duplicates?
- Offline removal
- Online removal with query-based duplicate removal
3Tradeoff of online vs. offline removal
Online-dominating approach Offline-dominating approach
Impact to offline High precision Low recall Remove fewer duplicates High precision High recall Remove most of duplicates Higher offline burden
Impact to online More burden to online deduplication Less burden to online deduplication
Impact to overall cost Higher serving cost Lower serving cost
4Challenges issues in offline duplicate handling
- Achieve high-recall with high precision
- All-to-all duplicate comparison for complex/deep
pairwise analysis - Expensive? parallelism management unnecessary
computation elimination - Maintain duplicate groups instead of duplicate
pairs - Reduce storage requirement.
- Aid winner selection for duplicate removal
- Continuous group update is expensive.
- Approximation.
- Error handling
5Optimization for faster offline duplicate
handling
- Incremental duplicate clustering and group
management - Approximated transitive relationship
- Lazy update
- Avoid unnecessary computation while balancing
computation among machines - Multi-dimensional partitioning
- Faster many-to-all duplicate comparisons
Page partition
Page partition
Page partition
Page partition
6Two-tier Architecture for Incremental Duplicate
Detection
7Approximation in Incremental Duplicate Group
Management
- Example of incremental group merging/splitting
- Approximation
- Group is unchanged when updated pages are still
similar to group signatures - Group splitting does not re-validate all
relations - Error of transitive relation after content update
- Alt-gtB, Blt-gt C ? Alt-gtC
- A lt-gtC may not be true if content B is updated.
- Error prevention during duplicate filtering
- double check similarity threshold between a
winner and a loser
8Multi-dimensional page partitioning
- Objective
- One page is mapped to one unique partition
- Dissimilar pages are mapped to different
partitions. - Reduce unnecessary cross-partition comparisons.
- Partitioning based on document length
- Outperform signature-based mapping for higher
recall rates. - Multi-dimensional mapping
- Improve load imbalance caused by skewed length
distribution
9Multi-dimensional page partitioning
Sub-dictionary
Dictionary
Sub-dictionary
A(280,320)
A(600)
1D length space
2D length space
10When does Page A compare with B?
- Page length vector A (A1, A2) , B(B1,B2)
- Page A needs to be compared with B only if
- t is the similarity threshold
- ? is a fixed interval enlarging factor
11Implementation and Evaluations
- Implemented in Ask.com offline platform with C
for processing billions of documents - Impact on relevancy
- Continuously monitor top query results.
- Error rate of false removal is tiny.
- Impact on cost.
- Compare two approaches
- A Online dominating.
- Offline removes 5 duplicates first.
- Most of duplicates hosted in online tier-2
machines - B Offline dominating.
12Cost Saving with Offline Dominating Approach
- Fixed QPS target. Two-tier online index for 3-8
billion URLs. - 8-26 cost saving with offline dominating
- Less tier-2 machines due to less duplicates
hosted. - Online tier 1 machines can answer more queries
- Online messages communicated contain less
duplicates
13Reduction of unnecessary inter-machine
communiation comparison
- Up to 87 saving when using up to 64 machines
14Effectiveness of 3D mapping
- Load balance factor with upto 64 machines
- Speedup of
- processing throughput
15Benefits of incremental computation
- Ratio of non-incremental duplicate detection time
over incremental one for a 100 million dataset.
Upto 24-fold speedup. - During a crawling update,
- 30 of updated pages have
- signatures similar to group
- signatures
16Accuracy of distributed clustering and duplicate
group management
- Relative error in precision compared to a
single-machine configuration - Relative error in recall
17Conclusion remarks
- Budget-conscious solution with offline dominating
redundant removal - Up to 26 cost saving.
- Approximated incremental scheme for duplicate
clustering with error handling - Upto 24-fold speedup
- Undetected duplicates are handled online.
- 3D mapping still reduces unnecessary
comparisons (upto 87) while balancing load (3
fold improvement)