Algorithms for Low Latency Remote File Synchronization - PowerPoint PPT Presentation

1 / 40
About This Presentation
Title:

Algorithms for Low Latency Remote File Synchronization

Description:

Protocol assumes knowledge of (tight) upper bound on the number of 'differences' ... Algo. Algo. Endoing and Decoding. Golomb Encoding. ... – PowerPoint PPT presentation

Number of Views:101
Avg rating:3.0/5.0
Slides: 41
Provided by: jatinde
Category:

less

Transcript and Presenter's Notes

Title: Algorithms for Low Latency Remote File Synchronization


1
Algorithms for Low Latency Remote File
Synchronization
  • Hao Yan, Utku Irmak, Torsten Suel
  • Polytechnic University
  • Presented By - Jay

2
Introduction
  • Remote File Synchronization problem
  • Type of Protocols
  • Single Round
  • Multi Round
  • Multi Round protocols suffer from protocol
    complexity, computing and I/O overheads at two
    endpoints

3
Focus
  • Single Round Protocol
  • Communication cost less than rsync
  • Protocol assumes knowledge of (tight) upper bound
    on the number of differences
  • Sampling Techniques to address the above issue
  • Choose suitable parameters based on sampling.

4
Assumptions
  • Collection consists of unstructured files
  • Modified in arbitrary ways
  • Insertion and Deletion may change line and page
    alignments
  • No log shipping
  • Not concerned with issues of consistency in
    between synchronization steps
  • No resolution on conflicts if updates are
    performed at multiple locations.

5
Contributions
  • New single round algorithm for file
    synchronization
  • Random sampling techniques gtgt practical protocol
  • Evaluation on several data sets.

6
Related Work
  • rsync
  • Divide and Conquer
  • Send larger blocks and then recursively smaller
  • Trade off between rounds of communication and the
    bandwidth consumption
  • Set Reconciliation
  • Puzzles and De Bruijn Graphs

7
Technical Review
  • rsync
  • Client Compute Hashes/fingerprints
  • Client Transfer hashes
  • Server looks through hashes
  • Server sends new data and the location to the
    client.

8
rsync Algorithm
  • Suppose we have two general purpose computers X
    and Y. Computer has access to a file A and has
    access to file B, where A and B are similar''.
    There is a slow communications link between and .
  • The rsync algorithm consists of the following
    steps
  • Y splits the file B into a series of
    non-overlapping fixed-sized blocks of size S
    bytes1. The last block may be shorter than S
    bytes.
  • For each of these blocks calculates two
    checksums a weak rolling'' 32-bit checksum
    (described below) and a strong 128-bit MD4
    checksum.
  • Y sends these checksums to X.
  • X searches through A to find all blocks of length
    S bytes (at any offset, not just multiples of S)
    that have the same weak and strong checksum as
    one of the blocks of B. This can be done in a
    single pass very quickly using a special property
    of the rolling checksum described below.
  • X sends a sequence of instructions for
    constructing a copy of A. Each instruction is
    either a reference to a block of B, or literal
    data. Literal data is sent only for those
    sections of A which did not match any of the
    blocks of B. The end result is that gets a copy
    of A, but only the pieces of A that are not found
    in B (plus a small amount of data for checksums
    and block indexes) are sent over the link. The
    algorithm also only requires one round trip,
    which minimizes the impact of the link latency.
  • The most important details of the algorithm are
    the rolling checksum and the associated
    multi-alternate search mechanism which allows the
    all-offsets checksum search to proceed very
    quickly.
  • http//rsync.samba.org/tech_report/node2.html

9
rsync Rolling Checksum

s(k,l) a(k,l) 216 b(k,l)
10
rsync Search Steps
  • 3 steps
  • 16 bit hash of the 32 bit rolling checksum and a
    216 hash table
  • Search the hash table
  • Search the buckets
  • Calculate the strong checksum.

11
Set Reconciliation
  • Sa and Sb are sets on Hosts A and B respectively
  • Hosts want to know intersection and union with
    min communication.
  • Goal is to use an amount of communication that is
    proportional to the size of symmetric difference
    ie. No. of elements in ( Sa Sb ) U ( Sb Sa)
  • Characteristic Polynomial

12
Set Reconciliation
  • Cancel terms corresponding to Sa intersection Sb

13
Reconciliation Puzzles
  • Consider the string 01010010011
  • Mask 111
  • Lm 3
  • Beginning and end 01010010011
  • Multi set of puzzle pieces

14
Reconciliation Puzzles
15
Reconciliation Puzzles
16
Reconciliation Puzzles
  • Modified De Bruijn Digraph
  • parallel edges are added to the digraph for each
    occurrence of a particular piece in the
    multi-set.
  • edges which represent strings not in the
    multi-set are deleted.
  • vertices with degree zero are deleted.
  • two new vertices and edges corresponding to the
    first and last pieces of the encoded string are
    added
  • An artificial edge is added between the two new
    vertices to make their in-degree equal to their
    out-degree.

17
De Bruijn Modified Graphs
  • Sa 01010011 and Sb 01010010011

18
File Partitioning
  • Fixed Block as in rsync
  • Not good for set reconciliation
  • Karp-Rabin finger printing
  • Winnowing
  • 2 way min

19
Fingerprinting
20
Karp Rabin Fingerprinting
  • Exploits the fact that if two strings are equal,
    their hash are equal.
  • 1 function RabinKarp(string s1..n, string
    sub1..m) 2 hsub hash(sub1..m)
  • 3 for i from 1 to n-m1
  • 4 if hs hsub
  • 5 if si..im-1 sub
  • 6 return i
  • 7 hs hash(si1..im)
  • 8 return not found

21
Winnowing
22
Winnowing
23
Winnowing
24
2-way min
25
Comparison
26
Authors Algorithm
  • Locally partition both versions of the file into
    overlapping blocks using the 2-way min technique
  • Represent the blocks by their hashes.
  • Set Reconciliation protocol message from client
    to server
  • Server transmits the blocks to client using
    compression.

27
Algo
28
Algo
29
Endoing and Decoding
  • Golomb Encoding.
  • Golomb coding uses a tunable parameter M to
    divide an input value into two parts q, the
    result of a division by M, and r, the remainder.
    The quotient is sent in unary coding, followed by
    the remainder in truncated binary encoding.

30
Preliminary Experimental Results
31
Preliminary Experimental Results
32
Costs
33
Effect of Similarity of Data
34
Effect of Similarity
35
Sampling
  • Estimate symmetric difference d between the
    hashes of new file and old file.

36
Costs
37
Estimating a Good Block Size
  • Hard to do
  • Sample hashes for several block sizes
  • Trade off between size and no. of blocks and the
    size of the unmatched parts of the file.

38
Concluding Remarks
  • Authors describe and evaluate a new algorithm.
  • Using Reconciliation techniques with overlapping
    content dependent partitioning is a promising
    approach.
  • Sampling can decide whether set reconciliation
    should be used.

39
References
  • Wikipedia
  • The referenced papers.

40
Questions.
Write a Comment
User Comments (0)
About PowerShow.com