Clustering and Load Balancing Optimization for Redundant Content Removal PowerPoint PPT Presentation

presentation player overlay
About This Presentation
Transcript and Presenter's Notes

Title: Clustering and Load Balancing Optimization for Redundant Content Removal


1
Clustering and Load Balancing Optimization for
Redundant Content Removal
  • Shanzhong Zhu (Ask.com)
  • Alexandra Potapova, Maha Alabduljalil (Univ. of
    California at Santa Barbara)
  • Xin Liu (Amazon.com)
  • Tao Yang (Univ. of California at Santa Barbara)

2
Redundant Content Removal in Search Engines
  • Over 1/3 of Web pages crawled are near duplicates
  • When to remove near duplicates?
  • Offline removal
  • Online removal with query-based duplicate removal

3
Tradeoff of online vs. offline removal
Online-dominating approach Offline-dominating approach
Impact to offline High precision Low recall Remove fewer duplicates High precision High recall Remove most of duplicates Higher offline burden
Impact to online More burden to online deduplication Less burden to online deduplication
Impact to overall cost Higher serving cost Lower serving cost
4
Challenges issues in offline duplicate handling
  • Achieve high-recall with high precision
  • All-to-all duplicate comparison for complex/deep
    pairwise analysis
  • Expensive? parallelism management unnecessary
    computation elimination
  • Maintain duplicate groups instead of duplicate
    pairs
  • Reduce storage requirement.
  • Aid winner selection for duplicate removal
  • Continuous group update is expensive.
  • Approximation.
  • Error handling

5
Optimization for faster offline duplicate
handling
  • Incremental duplicate clustering and group
    management
  • Approximated transitive relationship
  • Lazy update
  • Avoid unnecessary computation while balancing
    computation among machines
  • Multi-dimensional partitioning
  • Faster many-to-all duplicate comparisons

Page partition
Page partition
Page partition
Page partition

6
Two-tier Architecture for Incremental Duplicate
Detection
7
Approximation in Incremental Duplicate Group
Management
  • Example of incremental group merging/splitting
  • Approximation
  • Group is unchanged when updated pages are still
    similar to group signatures
  • Group splitting does not re-validate all
    relations
  • Error of transitive relation after content update
  • Alt-gtB, Blt-gt C ? Alt-gtC
  • A lt-gtC may not be true if content B is updated.
  • Error prevention during duplicate filtering
  • double check similarity threshold between a
    winner and a loser

8
Multi-dimensional page partitioning
  • Objective
  • One page is mapped to one unique partition
  • Dissimilar pages are mapped to different
    partitions.
  • Reduce unnecessary cross-partition comparisons.
  • Partitioning based on document length
  • Outperform signature-based mapping for higher
    recall rates.
  • Multi-dimensional mapping
  • Improve load imbalance caused by skewed length
    distribution

9
Multi-dimensional page partitioning
Sub-dictionary
Dictionary
Sub-dictionary
A(280,320)
A(600)
1D length space
2D length space
10
When does Page A compare with B?
  • Page length vector A (A1, A2) , B(B1,B2)
  • Page A needs to be compared with B only if
  • t is the similarity threshold
  • ? is a fixed interval enlarging factor

11
Implementation and Evaluations
  • Implemented in Ask.com offline platform with C
    for processing billions of documents
  • Impact on relevancy
  • Continuously monitor top query results.
  • Error rate of false removal is tiny.
  • Impact on cost.
  • Compare two approaches
  • A Online dominating.
  • Offline removes 5 duplicates first.
  • Most of duplicates hosted in online tier-2
    machines
  • B Offline dominating.

12
Cost Saving with Offline Dominating Approach
  • Fixed QPS target. Two-tier online index for 3-8
    billion URLs.
  • 8-26 cost saving with offline dominating
  • Less tier-2 machines due to less duplicates
    hosted.
  • Online tier 1 machines can answer more queries
  • Online messages communicated contain less
    duplicates

13
Reduction of unnecessary inter-machine
communiation comparison
  • Up to 87 saving when using up to 64 machines

14
Effectiveness of 3D mapping
  • Load balance factor with upto 64 machines
  • Speedup of
  • processing throughput

15
Benefits of incremental computation
  • Ratio of non-incremental duplicate detection time
    over incremental one for a 100 million dataset.
    Upto 24-fold speedup.
  • During a crawling update,
  • 30 of updated pages have
  • signatures similar to group
  • signatures

16
Accuracy of distributed clustering and duplicate
group management
  • Relative error in precision compared to a
    single-machine configuration
  • Relative error in recall

17
Conclusion remarks
  • Budget-conscious solution with offline dominating
    redundant removal
  • Up to 26 cost saving.
  • Approximated incremental scheme for duplicate
    clustering with error handling
  • Upto 24-fold speedup
  • Undetected duplicates are handled online.
  • 3D mapping still reduces unnecessary
    comparisons (upto 87) while balancing load (3
    fold improvement)
Write a Comment
User Comments (0)
About PowerShow.com